The challenge of integration: What it really takes for a general-purpose MCU to function as a complete AI SoC

By Mark Rootz Vice President of Marketing, Alif Semiconductor

The application of AI to endpoint devices can transform their value. In the medical arena, wearable AI devices promise to take the detection and diagnosis of serious conditions such as atrial fibrillation out of the clinic and into the patient’s everyday world for monitoring. And by applying AI algorithms, hearing aids can be transformed from a simple amplifier into an intelligent vocal discriminator, isolating the voice of the person of interest, while cancelling all other noise and voices, or muting them in the sonic background.

These AI capabilities could multiply the value of almost every type of wearable and portable battery-powered product. And at the endpoint, AI functions can often not be performed in the cloud for reasons of power, latency, privacy, wireless reach, security and cost. Such devices need local AI processing capability.

But the successful local deployment of AI in these products must find a way past the severe design constraints of space and power. Wearable devices such as earbuds, rings, smart glasses, and patient monitors have small enclosures which can accommodate few components and only a small battery.

Pre-AI, many of these types of products (although endowed with fewer features and physically larger) were based on general purpose microcontrollers or microprocessors which could integrate the main functions. Integration helps the designer meet their goals for space and power while reducing component count and board footprint.

In the AI era, integration of system functions into a general-purpose MCU remains a highly desirable way to save space and power. But an AI MCU must integrate even more functions than earlier conventional MCUs, while at the same time providing the ultra-low power consumption that enables devices with small batteries to run AI at the endpoint without sacrificing run-time between charges.

This is not as simple as bolting AI capability onto a legacy MCU architecture. Alif Semiconductor, founded in 2019 to create a new generation of AI-enabled MCUs for endpoint devices, has had the advantage of thinking from first principles about how to integrate AI into an MCU. This thinking reflects the hundreds of conversations in which OEMs have described to Alif the factors that affect their choice of MCU to serve as an AI system-on-chip (SoC) for battery-powered endpoint devices. Here are four of the most important such factors.

1. Neural processing acceleration must be tightly coupled to the CPU

    The MCU market’s initial response to customer demand for AI functionality was to develop AI software development kits (SDKs) to enable AI/machine learning (ML) algorithms to run on the same Arm® Cortex®-M CPU that also performed conventional control functions. However, an MCU for endpoint ML applications needs a neural processing unit (NPU): this type of processor is optimized for the multiply-accumulate (MAC) operations that are the bread and butter of neural networking applications. An embedded CPU on its own will struggle with meaningful ML workloads because inferences derived from the highly parallel ML network must be resolved in a serial fashion, taking excessive time while burning a lot of energy.

    Figure 1 illustrates the contrast in AI performance between an MCU’s CPU and NPU. Alif Semiconductor’s Ensemble MCU family employs the contemporary Cortex-M55 CPU core paired with an Ethos™-U55 NPU co-processor. The metrics show a single ML inference for four trained ML models running on an Ensemble MCU: the models represent the operations required for keyword spotting, object detection, image classification, and speech recognition. The Cortex-M55 used in Alif MCUs is among the best of embedded CPU architectures, and puts up good numbers which are already on the order of 5x better for ML workloads than previous generations of Cortex-M CPUs. As good as the Cortex-M55 is, however, substantial additional performance uplift is revealed by the figures in the yellow columns: close to and even more than two orders of magnitude improvement when using the NPU+CPU compared to the CPU only. Noting that the Cortex-M55 already performs around 5x better than earlier, widely used Cortex-M architectures, it is reasonable to multiply these gains again by 5x. For speech recognition that would be around 800x faster and 400x less energy per inference compared to legacy Cortex-M CPUs.

    Fig.1: Benchmark tests show the superior performance and efficiency of an NPU over a CPU when executing common ML functions

    Another important factor in tightly coupling the NPU to the CPU is the software development environment. There are many options for deploying proprietary and third-party NPU cores into SoCs. OEMs are very clear, however: they do not want to retool their entire development infrastructure to work with a new architecture just for AI/ML workloads, necessitating the adoption of new toolchains and new instruction sets.

    If they already work in the Arm ecosystem for embedded control functions, they want to stay in the Arm ecosystem for AI/ML functions as well. Combining a Cortex-M CPU with an Ethos-U NPU, both from Arm, achieves this requirement.

    In fact, the Ethos-U NPU is essentially a co-processor which integrates seamlessly with the Cortex-M CPU. The Arm Vela compiler will automatically split the ML workload between them, with 95% or more typically falling onto the NPU. A great side benefit is that the Cortex-M CPU can sleep or do other work while the ML inference is being resolved.

    2. Integration has to encompass the whole system

      It’s fair to say that the NPU is likely to be the center of attention in an integrated AI/ML MCU. However, what is integrated around the processing cores, and specifically how the integration is crafted, is very important.  At the top of the list are memory and peripherals.

      Figure 1 showed that enhanced processing capability is key to high performance and power efficiency, but without an optimized memory system behind it, the results will fail to meet expectations. 

      A simplified view of the Ensemble MCU memory topology is shown in Figure 2. The upper half represents the real-time section with very fast Tightly Coupled Memory (TCM) connected to the CPU and NPU cores. For fast inference times these TCM SRAM memories must be sufficiently large to hold the ML model’s tensor arena. The lower half of the diagram shows other system memories connected by a common high-speed bus. A large, shared bulk SRAM is required to hold sensor data, such as the input from a camera and microphones, and a large non-volatile memory contains the ML model itself plus application code. When large on-chip memories are distributed this way to minimize competing bus traffic, then concurrent memory transactions flourish, bottlenecks are cleared, memory access times are minimized, and power consumption is compatible with the use of a small battery.

      Fig. 2: The Ensemble MCUs’ internal memory topology

      The correct peripheral set is also critical for MCUs in endpoint ML applications which often operate in one or more of the ‘three V’ domains:  vision, voice and vibration. This means that connectivity to image sensors, microphones, inertial measurement units and more is required, in addition to traditional MCU peripherals such as high-speed serial channels, analog interfaces, and display interconnects.

      For endpoint AI devices, all these functions should be integrated into the MCU.

      Whole-system integration not only eliminates the need for additional power rails and power conversion such as an external Power Management IC (PMIC), but also enables power to be controlled dynamically at a much more granular level on the chip – and this is the third desirable feature on an integrated AI MCU.

      3. Adaptive power management to extend battery runtime

        Alif recognized early on that the concentration of local ML capability at the endpoint will skyrocket in the near future, while at the same time the physical size of these products will rapidly shrink, especially for wearable devices, entailing the use of smaller and smaller batteries.

        Alif’s approach to stretching battery life to address this problem took several forms. Two prime examples are:

        1. Partitioning the system so that a low-power portion of the chip can be always-on, but still offer robust compute capability, enabling it to selectively wake a much higher performance portion of the chip to execute heavy workloads then return to sleep
        2. The power management system dynamically powers on only portions of the chip that are needed and shuts them off when not required, all at a finely granular level.

        To facilitate this division of functions, many Ensemble MCUs have two pairs of Cortex-M55+Ethos-U55 cores as shown in Figure 3:

        • One in the High-Efficiency region of the chip built on low-leakage transistors that can be always-on operating at up to 160MHz
        • The other pair in the High-Performance region operating at up to 400MHz.

        To picture the advantage this brings, imagine a smart occupancy camera which continuously scans a room at a low frame rate using the High-Efficiency pair of cores to classify a valid event (such as a human falling to the floor, or a specific gesture) which wakes the High-Performance pair to identify a person or persons, check for blocked exits, dial for help, and so on. In this case the camera can be intelligently vigilant, produce fewer false positives, and extend battery life. Similar uses for these two pairs of CPU+NPU cores can be applied just as well to the classification of sounds, voices, words, text, vibrations, and sensor data in a wide variety of applications.

        Fig. 3: Ensemble E3 MCU block diagram showing the High-Efficiency and High-Performance regions of the chip

        Additionally, all Ensemble MCUs employ Alif’s aiPM™ (autonomous intelligent Power Management) technology to manipulate in real time up to 12 individual power domains in the chip as needed to match the use case being executed. Only domains that are actively executing tasks are powered on (such as those supplying specific processing cores, memories, or peripherals) while the other domains remain off. This becomes transparent to the software developer.

        4. Protection for valuable machine learning models and other IP

          The final key feature which needs to be integrated into an AI MCU for endpoints is security protection. This is of course to resist the ever-present forms of cyber-attack. But for many OEMs the most important protection is for its IP embedded in AI models.

          OEMs invest vast amounts of time and money in collating training datasets, building AI models, and developing and refining inferencing algorithms. This gives unscrupulous manufacturers a strong incentive to steal this expensive IP by copying it from insufficiently protected production units.

          An external secure MCU would enable the OEM to establish a root-of-trust, manage secret keys and certificates, facilitate a secure boot, and so on. The use of an external secure MCU is a common approach to building in strong security to conventional MCU-based designs, but it is rare to find a full secure ‘enclave’ with these functions, and more, built into a conventional MCU.

          Yet battery-powered and wearable AI products benefit particularly from the space and power savings, and the heightened security when this functionality is integrated into the MCU. The secure enclave (see Figure 4), standard in all Alif devices, is a dedicated isolated subsystem for management of vital security functions such as secure key management and storage, secure boot with an immutable Root-of-Trust, attestation at runtime using certificates, hardware cryptographic services, secure debugging, read-out protection, secure firmware updates, and complete lifecycle management.

          Fig. 4: The secure enclave in Ensemble MCUs govern the security policy for the entire chip

          An AI-ready MCU platform

          These four characteristics of an AI MCU – tight coupling of the NPU and CPU with a standard development ecosystem, whole-system integration, adaptive power management, and built-in IP protection – are in strong demand from the manufacturers of battery-powered endpoint devices that Alif Semiconductor has engaged with.

          Designers who evaluate the Ensemble family will find a wide selection of scalable and compatible devices ranging from a single CPU core to quad-core devices supporting the Linux® operating system to fit varied projects while enabling re-use of software across all of them.

          Learn more at www.alifsemi.com.