Transition of MCU & MPU-based embedded devices to PQC

Davide Bellizia (Telsy)

04/03/2024

Introduction

Internet-of-Things, embedded systems and dependable devices are pervasive in our daily life, as well as pillars of the modern concept of industry. Their capability to gather data from sensor, perform local processing and interconnect with other IoT devices or larger systems is clearly one of key success of this family of technologies. Security of the data at-rest and the communication between devices has become a major concern in the design itself of IoT devices, and the advent of quantum computing represents a big challenge in this direction.

The world of IoT and embedded systems has to be prepared for the quantum menace, and clearly the security of these constrained devices has to deal with the transition process from traditional cryptography to Post-Quantum Cryptography (PQC). Also within the NIST PQC contest and relative working group, relevant figures of merit and impacts of proposed PQC algorithms have been provided for classical embedded devices [1]. This remarks the importance that cryptographic community gives in the regards of this particular field. In this blogpost, we discuss about key technologies used in embedded devices and IoT applications and what are the challenges towards their transition to PQC.

MCU & MPU

Embedded devices can come in various forms, depending on the needs of the application and its constraints, such as:

power consumption (e.g. battery-powered or energy harvesting devices);
computational power;
input/output and interfacing capabilities;
cost;
security;
etc.

Clearly, the choice of a particular device manipulating data is usually tailored on a specific application (or at least on a family of applications) or constraints. For embedded applications, usually we may infer on two major families of devices:

Micro-Controller Units (MCUs);
Micro-Processor Units (MPUs).

MCUs are the simplest and (usually) cheapest processing devices that can be found in embedded devices. A common architecture of an MCU is shown in Figure 1. Typically, they come with an on-chip non-volatile memory that stores the code that the MCU itself runs to perform the desired actions. It offers a wide number of general-purpose input/output pins and usually some communication interfaces (e.g., UART, SPI, I²C, etc.), to gather and exchange data from sensors and from/to other systems. MCUs are typically used for control-intensive tasks and applications, where limited computational power is required.

Figure 1: Example of an MCU architecture.

In some applications, MCUs are not the optimal solutions due to their limitations in terms of computational power, at the cost of higher power consumption and more expensive hardware. Therefore, when complex tasks have to be performed and/or manipulation of data requires tighter timing-wise requirements, another possible solution is to adopt MPU-based systems. MPUs shares many things in common with MCUs, but they are generally more advanced and complex. From a processing viewpoint, MPUs are usually faster than MCUs and they are generally capable to run a full-fledged Operating System (OS). The possibility to leverage on an OS is not always possible on MCUs due to limited capabilities (some MCUs are able to run some lightweight OSs and Real-Time OSs, like RTOS, see [4] for more details), but it enables MPU-based systems to perform very complex tasks and to ease the development. In addition, MCUs’ memory is usually limited and code-size represents a huge limitation for writing complex applications. From a connection standpoint, MPUs usually come with many peripherals that are similar to MCUs’, but they also have the possibility to connect to high-speed communication peripherals, such as USB-3.0 and/or Gigabit Ethernet ports, as their computational power and memory can support large amount of data. Memory-wise, it is common that MPUs store their code outside the device itself, and external memory for storage and RAM is usually needed, making in general the Bill-Of-Materials (BOM) more expensive respect to an MCU-based system.

PQC Transition and Impact on MCU & MPU-based systems

As stated in [3], implementing PQC algorithms in embedded systems is challenging. Compared to larger systems (e.g., laptops, servers, etc.), the constraints are very tight. Nevertheless, the importance of the transition to PQC is of primary importance also in this context, and QUBIP will address this aspect leveraging on the IoT Digital Manufacturing pilot demonstrator (see Figure 2).

Figure 2: PQC-enabled MCU & MPU solution with off-chip and on-chip acceleration. The two systems implement the IoT Digital Manufacturing pilot demonstrator that QUBIP will use as case study for the PQC transition in the IoT domain.

The transition to PQC in such constrained devices usually comes at a cost. From a performance point of view, PQC algorithms are quite intensive, and it usually impacts in terms of code-size and execution time (along with power consumption). If we consider the case of MCU-based systems, where memory resources are generally limited, implementing PQC algorithms may be difficult, especially if the specific application requires countermeasures against side-channel and/or fault injection attacks. Likewise, if the application requires tight timing constraints for communication between devices, the execution of PQC algorithm may become a strong bottleneck. In these regards, one possible solution to deploy an effective PQC transition is to leverage on hardware accelerators, speeding up the execution time and saving code-size, at the expense of more silicon area. It has to be noted that in some cases, MCUs and MPUs have hardware accelerators for some common traditional cryptographic functions (for example, AES, RSA and ECC) to speed up security-demanding tasks and/or to provide secure boot.

Quantum-safe algorithms typically require a large number of complex operations. For example, PQC algorithms based on lattices, as CRYSTALS-KYBER and CRYSTALS-DILITHIUM, usually leverage on the polynomial multiplication, Pseudo-Random Functions (PRFs) and hash functions. Hash-based PQC algorithms as SPHINCS⁺heavily rely on the usage of hash functions. In general, all PQC algorithms need randomness, that may be generated from a true random number generator or by means of a pseudo-random number generator. As aforementioned, the possibility to speed up the PQC algorithm execution can be achieved by ensuring that these “heavy” functions are all performed in hardware. In this case, the “heavy” operation is not performed by the CPU itself, but on a dedicated area of silicon that has been specifically designed to perform such operations, that may be located on-chip or off-chip (see Figure 2).

In case of on-chip accelerators, the CPU access to the hardware accelerator through a driver, low level abstract the hardware layer to the software. The CPU can configure hardware accelerators, send data and read back the results. On the other hand, if no on-chip accelerator is available, the CPU may request to perform “heavy” operations to an external device, specifically designed to offer cryptographic services, frequently called Secure Element (SE), usually accessible with off-chip interfaces, such as UART, SPI and I²C. Operation requests and data are usually exchanged by means of a protected channel (e.g., Secure Channel Protocol). This scenario is typical among many applications where the central unit is MCU-based with no availability of hardware acceleration.

From a computational point of view, we may summarize the “heavy” functions needed by PQC algorithm that require hardware accelerations as follows:

Polynomial multiplications;
Hash functions;
Randomness generation and conditioning.

Polynomial multiplications are widely used in lattice-based cryptography, and they are usually very demanding. The multiplication between two n elements polynomials A and B would have complexity O(n²) if performed with the convolutional method. A typical approach to overcome the burden of the polynomial multiplication, it is to perform a domain transformation. As PQC algorithms usually works in finite fields, the domain transformation is typically performed by means of Number Theoretic Transform (NTT). The two polynomials are firstly transformed in the NTT domain, and then they are point-wise multiplied. Being the NTT invertible, it is possible to obtain the polynomial multiplication results in the former domain by applying the Inverse NTT (INTT) on the outcome of the point-wise multiplication, requiring an overall a complexity of O(n logn):

Clearly, an NTT hardware accelerator as well as sub-modules to efficiently compute ring and modulo operations are key enablers for guaranteeing the efficient implementation of PQC algorithms. It must be noted that performing operations in a certain modulo or algebraic structure may imply some restrictions and/or requirements on the hardware design. It is not always easy to change the modulo and/or the algebraic structure of a hardware accelerator. Hence, having a flexible acceleration for the NTT that would cover different cases (e.g., KYBER and DILITHIUM at the same time) is challenging and can lead to very substantial performance/energy/security/area trade-offs.

Many PQC algorithms make abundant use of hash and hash-based functions. For example, the hash-based digital signature algorithm SPHINCS⁺makes use of (one out of) three families of hash functions, SHA2, SHAKE and Harraka. Both CRYSTALS-KYBER and CRYSTALS-DILITHIUM heavily leverage on Keccak-based functions, in particular SHA3-256/-512 and SHAKE-128/-256, as well as FALCON. It is interesting to notice that the performance of CRYSTALS-KYBER/-DILITHIUM are tightly coupled with the execution of the underlying Keccak-based sub-routine [5]. It has to be noted that the hardware support on commercially available MCUs and MPUs (in the form of complex System-on-Chips) for Keccak permutation is very limited (SHA1 and SHA2 accelerators are more common than Keccak on these platforms). Therefore, a transition exercise for embedded devices would require to implement a versatile Keccak permutation accelerator in hardware to efficiently speed up all PQC operations based on this specific family of functions.

Randomness generation in PQC is of upmost importance, as many security features rely on specific randomness distributions in a given domain. Popular choices in PQC algorithms are Center Binomial Distributions (CBDs), discrete Gaussian and uniform. One of the main challenges of obtaining a certain randomness distribution (especially in a given algebraic structure) is to obtain it in a secure and constant-time manner, to avoid the possibility of side-channel vulnerabilities. Uniform distributions are usually easy to obtain (in constant-time) from a True Random Number Generator (TRNG) or from a seeded PRNG implemented with traditional block/stream ciphers or hash functions, and algorithms like CRYSTALS-KYBER have adopted this strategy to mitigate this issue. In general, CBDs are also easy to obtain from a uniform distribution, but this does not hold for discrete Gaussian distributions. A common technique to implement discrete Gaussian distributions in hardware is to use a cumulative distribution tables (CDTs) to process uniformly-random samples [2, 6]. This approach guarantees a constant-time generation of discrete distributions of the randomness, starting from a uniform distribution, at a cost that follows the complexity of the distribution itself and its support space.

PQ/T Hybrid

To ensure reliability and backward compatibility with quantum-unsafe systems, a possible path to the realization of the PQC transition is to adopt hybrid cryptographic schemes. In hybrid cryptographic schemes (also referred as PQ/T hybrid), traditional and post-quantum primitives are used together. This approach ensures a certain degree of cryptographic agility in the next generation of embedded and IoT devices, making them able to interoperate with older and newer systems. “Legacy” features in the embedded and IoT segment is a common nice-to-have and, clearly, it would ease the transition to PQC for low-end devices.

Conclusions

The transition to PQC for embedded and IoT devices requires a certain degree of cryptographic agility, as many commercial products in this particular field are not yet prepared for efficiently implementing PQC algorithms and protocols. The upcoming challenges for next generation IoT and embedded devices to mitigate the quantum menace will go through efficient implementation of hardware accelerators of computationally intensive functions, such as NTT, Keccak, and ring arithmetic. To some extents, the adoption of hybrid schemes could be a pivotal strategy to ease the PQC transition in the embedded and IoT segment.

References

[1] M. J. Kannwischer, J. Rijneveld, P. Schwabe, and K. Stoffelen, “pqm4: Testing and Benchmarking NIST PQC on ARM Cortex-M4,” in Second PQC Standardization Conference. NIST-CSRC, 2019.

[2] M. X. Lyons and K. Gaj, “Sampling from discrete distributions in combinational hardware with application to post-quantum cryptography,” in 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2020, pp. 610–613.

[3] M. Molteni, L. Nava, A. Gringiani, and G. Greco, “Integration of PQC in TLS Protocol for IoT Devices,” 2024, https://qubip.eu/integration-of-post-quantum-solutions-for-tls-protocol-for-embedded-devices/.

[4] E. C. Ruiz, “PQC Implementation on IoT: Challenges and Solutions,” 2024, https://qubip.eu/pqc-implementation-on-iot-challenges-and-solutions/.

[5] M.-J. O. Saarinen, “Benchmarking RISC-V Post-Quantum Crypto,” RISCV Summit 2023, https://mjos.fi/doc/20231108-rvsummit-pqc.pdf.

[6] T. R. Thomas Prest and M. Rossi, “Simple, fast and constant-time Gaussian sampling over the integers for Falcon,” in Second PQC Standardization Conference. NIST-CSRC, 2019.