Generative AI in endpoint devices: the new demands on the MCU

By: Henrik Flodell, Senior Director of Marketing, Alif Semiconductor

Generative AI | Large language models (LLMs) and LLM-based services such as ChatGPT and Gemini are a shining example of the capability of generative AI. But these AI software models are huge code bases – as of early 2025, the largest have more than one trillion parameters.  

Huge data centers hosting the most advanced cloud computing systems are now necessary to provide the compute and power resources demanded by generative AI services. Which raises the question, how are makers of embedded devices operating at the edge or the endpoint to scale generative AI systems to fit their much more constrained hardware resources?

In fact, device manufacturers are already working on solutions to this puzzle. Some of the early lessons they have learned are emerging – and it turns out that both the hardware and the software in endpoints need to be specially adapted for generative AI. The microcontroller architectures on which embedded devices have for years been based are not up to the task of implementing generative AI, and new models optimized for constrained resources need to be deployed, providing some similar functions to cloud-based AI, but in a different way.

The uses of generative AI at the endpoint

The broad set of software capabilities which we refer to as generative AI are appealing to the embedded world because of the scope that they offer to make embedded systems more intelligently autonomous.

The hallmark of generative AI systems is their ability to ‘remember’, and so to put new inputs in the context of previous data. This is what enables:

  • Natural language understanding and text generation
  • The implementation of long command sequences
  • Intelligent response to inputs from multiple sensor modes, such as a combination of audio, video and text

In a consumer wearable device, such as smart glasses, generative AI offers scope, for instance, for real-time translation of foreign language text in a shop window or on a road sign. In sectors such as medical equipment, manufacturing, or transportation, OEMs are excited by the potential for generative AI in the human-machine interface, adopting agentic capabilities for instance, or learning user behavior and autonomously deciding on actions without following a pre-programmed menu of responses.

In many of these cases, local AI processing is going to be essential because of latency – user expectations will not allow for the round-trip time for cloud-based operation. Generative AI-related cloud data storage is also a growing concern: the number of installed IoT devices is forecast to reach 50 billion units by 2030, and the size of the datasphere is expected to exceed 300 zettabytes. Both the cost and the energy involved in storing an accumulating mass of generative AI data inputs in the cloud are substantial.

For these reasons, endpoint device manufacturers are designing their products to perform most or all AI processing locally.

Mastering the scaling challenge

But how is an endpoint system, smart glasses for instance, to perform language operations such as real-time translation when the LLM software which enables such operations has a memory footprint measured in terabytes? Even using an ordinary scaling technique such as quantization, it is not conceivable that these models could be cut to less than multiple gigabytes – and that is still a huge compute overhead for most embedded products, let alone a wearable device such as smart glasses, to handle.

It’s clear that the answer for embedded devices is not to use an LLM at all, but different models which are better suited to constrained hardware resources. The only viable candidate to perform the AI processing and system control functions is the microcontroller: it alone can meet the power, size, feature integration and cost constraints of endpoint devices. And for MCU-based products, OEMs are finding a sweet spot for generative AI in the deployment of small language models (SLMs) and of convolutional neural networks (CNNs) and recurrent neural networks (RNNs) that are enhanced with generative AI elements (see Figure 1). In other words, at the endpoint, generative AI will not be implemented with scaled-down versions of the models running in the cloud, but with new models optimized for embedded device hardware.

Fig. 1: while high-load AI requires a GPU, low- and mid-load AI software can run on new AI MCUs

So what demands do these endpoint-optimized generative AI models make of the MCU?

Today’s most AI-capable MCUs are performing largely voice-, video- and motion-based operations such as face detection, keyword recognition, and condition-based monitoring in factory equipment. The best such MCUs offer throughput of up to a few hundred giga-operations per second (GOPS).

The transition to generative AI at the endpoint will see demand for raw neural networking capability rising to as much as 10 tera-operations per second (TOPS) by 2030. This will require MCU architectures which combine CPU(s) with neural processing unit(s) (NPUs). To perform generative AI functions, new NPUs will be needed that can support the transformer operations on which generative AI algorithms depend.

But when evaluating hardware options for the new generative AI applications at the endpoint, OEMs are learning not to focus exclusively on raw throughput: other features of an MCU’s architecture determine whether or not it is capable of running generative AI models:

Memory provision – the requirement for very fast access to data is higher for generative AI than for other types of AI, which itself has a much higher memory footprint than the real-time control functions which conventional MCUs are designed to support. Internal memory accesses are inherently faster than external memory, so the specification of the MCU for generative AI should put particular emphasis on the size and speed of the internal memory.

Even with enhanced internal memory provision, many generative AI applications will also require external memory as well: here, the speed of the memory interface is a crucial parameter, to avoid the risk of external memory accesses causing latency.

System bus bandwidth – to achieve high performance in generative AI applications, the MCU needs to orchestrate multiple operations allocated to various functional blocks. These include not only the NPU and the CPU for implementing neural networking operations, but also auxiliary processing functions such as a hardware image signal processor (ISP) for curating and pre-processing images before they are fed to a neural networking algorithm. This mix of operations before an inferencing result is produced calls for frictionless movement of data inside the system, and requires generous internal bandwidth on a bus to which all functional blocks involved in AI operations are attached (see Figure 2).

Fig. 2: in second generation Ensemble MCUs from Alif Semiconductor, a wider internal bus connects all internal processor blocks and memory

Ultra-low power consumption – it is in the nature of AI applications, including generative AI, that a stream of data is continuously scanned for relevance in a background monitoring mode, before high-performance inferencing hardware is applied only periodically when relevant data is found.

An MCU architecture which reflects this dual nature of generative AI operation can allocate background monitoring to a low-power, lower-speed hardware block, and provide a high-performance, higher-power block for use only when a fast and accurate inferencing result is required.

A low-power architecture enables generative AI functions to be implemented even in highly power-constrained devices such as smart glasses or true wireless stereo earbuds, which only have space for a very small and light battery.

High operating efficiency also reduces the thermal footprint of the MCU system, helping the designer to eliminate the risk of hotspots, which are incompatible with wearable form factors such as earbuds and smart glasses.

Small footprint – endpoint devices which can benefit from the implementation of generative AI will necessarily be complex systems. Smart glasses, for instance, might have to integrate cameras, microphones, speakers, a display, a battery and more into a frame which needs to be light, comfortable, and attractive to look at. This puts an emphasis on reducing component count to reduce the system footprint. This requires the MCU to integrate as many as possible of the features required for generative AI – not only the CPU and NPU, but also supporting functions such as an ISP and fast memory.

Data and device security – OEMs which can implement generative AI at the endpoint will have intellectual property (IP) of huge value embedded in their product. This needs to be protected from potential competitors. Generative AI systems which capture images and speech are also subject to privacy concerns.

For both these reasons, security capabilities are an essential element of a generative AI system. It is preferable for security functions to be integrated into the MCU, to prevent the exposure of secrets on board traces, and to eliminate the need for additional security components on the board.

Optimizing MCU hardware and software for generative AI

For the reasons explained above, legacy MCU architectures provide an inadequate basis for the implementation of generative AI: even if an adequate NPU is bolted on to previously CPU-centric architectures, they will lack the memory capacity, internal bandwidth, support for low-power monitoring, integration of AI functions, and security capabilities that are required for generative AI at the endpoint. This is why new MCU architectures capable of handling SLMs and other models optimized for the endpoint are now emerging, to provide genuine scope to support the exciting new opportunities for exercising generative AI algorithms on video, audio and motion data inputs in endpoint devices running on battery power and in highly space- and power-constrained form factors.

For more information visit here.