SCIENTIFIC PUBLICATIONS

1. Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge

Rutishauser, G., Conti, F. and Benini, L., 2023, June. Free bits: Latency optimization of mixed-precision quantized neural networks on the edge. In 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS) (pp. 1-5). IEEE.

Abstract: Mixed-precision quantization, where a deep neural network’s layers are quantized to different precisions, offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy beyond what can be achieved with homogeneous-bit-width quantization. To navigate the intractable search space of mixed-precision configurations for a given network, this paper proposes a hybrid search methodology. It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware heuristic optimization to find mixed-precision configurations latency-optimized for a specific hardware target. We evaluate our algorithm on MobileNetV1 and MobileNetV2 and deploy the resulting networks on a family of multi-core RISC-V microcontroller platforms with different hardware characteristics. We achieve up to 28.6% reduction of end-to-end latency compared to an 8-bit model at a negligible accuracy drop from a full-precision baseline on the 1000-class ImageNet dataset. We demonstrate speedups relative to an 8-bit baseline, even on systems with no hardware support for sub-byte arithmetic at negligible accuracy drop. Furthermore, we show the superiority of our approach with respect to differentiable search targeting reduced binary operation counts as a proxy for latency.

2. SALSA: Simulated Annealing-based Loop-Ordering Scheduler for DNN Accelerators

Jung, V.J., Symons, A., Mei, L., Verhelst, M. and Benini, L., 2023. SALSA: Simulated Annealing based Loop-Ordering Scheduler for DNN Accelerators. arXiv preprint arXiv:2304.12931.

Abstract: To meet the growing need for computational power for DNNs, multiple specialized hardware architectures have been proposed. Each DNN layer should be mapped onto the hardware with the most efficient schedule, however, SotA schedulers struggle to consistently provide optimum schedules in a reasonable time across all DNN-HW combinations.
This paper proposes SALSA, a fast dual-engine scheduler to generate optimal execution schedules for both even and uneven mapping. We introduce a new strategy, combining exhaustive search with simulated annealing to address the dynamic nature of the loop ordering design space size across layers. SALSA is extensively benchmarked against two SotA schedulers, LOMA and Timeloop on 5 different DNNs, on average SALSA finds schedules with 11.9% and 7.6% lower energy while speeding up the search by 1.7x and 24x compared to LOMA and Timeloop, respectively.

3. Dependability of Future Edge-AI Processors: Pandora’s Box

Gomony, M.D., Gebregiorgis, A., Fieback, M., Geilen, M., Stuijk, S., Richter-Brockmann, J., Bishnoi, R., Argo, S., Andradas, L.A., Güneysu, T. and Taouil, M., 2023, May. Dependability of Future Edge-AI Processors: Pandora’s Box. In 2023 IEEE European Test Symposium (ETS) (pp. 1-6). IEEE.

Abstract: This paper addresses one of the directions of the HORIZON EU CONVOLVE project being dependability of smart edge processors based on computation-in-memory and emerging memristor devices such as RRAM. It discusses how this alternative computing paradigm will change the way we used to do manufacturing test. In addition, it describes how these emerging devices inherently suffering from many non-idealities are calling for new solutions in order to ensure accurate and reliable edge computing. Moreover, the paper also covers the security aspects for future edge processors and shows the challenges and the future directions.

4. PetaOps/W edge-AI Processors: Myth or reality?

Gomony, M.D., De Putter, F., Gebregiorgis, A., Paulin, G., Mei, L., Jain, V., Hamdioui, S., Sanchez, V., Grosser, T., Geilen, M. and Verhelst, M., 2023, April. PetaOps/W edge-AI $\mu $ Processors: Myth or reality?. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1-6). IEEE.

Abstract: With the rise of deep learning (DL), our world braces for artificial intelligence (AI) in every edge device, creating an urgent need for edge-AI SoCs. This SoC hardware needs to support high throughput, reliable and secure AI processing at ultra-low power (ULP), with a very short time to market. With its strong legacy in edge solutions and open processing platforms, the EU is well-positioned to become a leader in this SoC market. However, this requires AI edge processing to become at least 100 times more energy-efficient, while offering sufficient flexibility and scalability to deal with AI as a fast-moving target. Since the design space of these complex SoCs is huge, advanced tooling is needed to make their design tractable. The CONVOLVE project (currently in Inital stage) addresses these roadblocks. It takes a holistic approach with innovations at all levels of the design hierarchy. Starting with an overview of SOTA DL processing support and our project methodology, this paper presents 8 important design choices largely impacting the energy efficiency and flexibility of DL hardware. Finding good solutions is key to making smart-edge computing a reality.

5. Challenges and Opportunities of Security-Aware EDA

Feldtkeller, J., Sasdrich, P. and Güneysu, T., 2023. Challenges and Opportunities of Security-Aware EDA. ACM Transactions on Embedded Computing Systems, 22(3), pp.1-34.

Abstract: The foundation of every digital system is based on hardware in which security, as a core service of many applications, should be deeply embedded. Unfortunately, the knowledge of system security and efficient hardware design is spread over different communities and, due to the complex and ever-evolving nature of hardware-based system security, state-of-the-art security is not always implemented in state-of-the-art hardware. However, automated security-aware hardware design seems to be a promising solution to bridge the gap between the different communities. In this work, we systematize state-of-the-art research with respect to security-aware Electronic Design Automation (EDA) and identify a modern security-aware EDA framework. As part of this work, we consider threats in the form of information flow, timing and power side channels, and fault injection, which are the fundamental building blocks of more complex hardware-based attacks. Based on the existing research, we provide important observations and research questions to guide future research in support of modern, holistic, and security-aware hardware design infrastructures.

6. A Holistic Approach Towards Side-Channel Secure Fixed-Weight Polynomial Sampling

Krausz, M., Land, G., Richter-Brockmann, J. and Güneysu, T., 2023, May. A Holistic Approach Towards Side-Channel Secure Fixed-Weight Polynomial Sampling. In IACR International Conference on Public-Key Cryptography (pp. 94-124). Cham: Springer Nature Switzerland.

Abstract: The sampling of polynomials with fixed weight is a procedure required by round-4 Key Encapsulation Mechanisms (KEMs) for Post-Quantum Cryptography (PQC) standardization (BIKE, HQC, McEliece) as well as NTRU, Streamlined NTRU Prime, and NTRU LPRrime. Recent attacks have shown in this context that side-channel leakage of sampling methods can be exploited for key recoveries. While countermeasures regarding such timing attacks have already been presented, still, there is no comprehensive work covering solutions that are also secure against power side channels. To close this gap, the contribution of this work is threefold: First, we analyze requirements for the different use cases of fixed weight sampling. Second, we demonstrate how all known sampling methods can be implemented securely against timing and power/EM side channels and propose performance-enhancing modifications. Furthermore, we propose a new, comparison-based methodology that outperforms existing methods in the masked setting for the three round-4 KEMs BIKE, HQC, and McEliece. Third, we present bitsliced and arbitrary-order masked software implementations and benchmarked them for all relevant cryptographic schemes to be able to infer recommendations for each use case. Additionally, we provide a hardware implementation of our new method as a case study and analyze the feasibility of implementing the other approaches in hardware.

7. Combined Private Circuits – Combined Security Refurbished

Feldtkeller, J., Güneysu, T., Moos, T., Richter-Brockmann, J., Saha, S., Sasdrich, P. and Standaert, F.X., 2023, November. Combined private circuits-combined security refurbished. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (pp. 990-1004).

Abstract: Physical attacks are well-known threats to cryptographic implementations. While countermeasures against passive Side-Channel Analysis (SCA) and active Fault Injection Analysis (FIA) exist individually, protecting against their combination remains a significant challenge. A recent attempt at achieving joint security has been published at CCS 2022 under the name CINI-MINIS. The authors introduce relevant security notions and aim to construct arbitrary-order gadgets that remain trivially composable in the presence of a combined adversary. Yet, we show that all CINI-MINIS gadgets at any order are susceptible to a devastating attack with only a single fault and probe due to a lack of error correction modules in the compression. We explain the details of the attack, pinpoint the underlying problem in the constructions, propose an additional design principle, and provide new (fixed) provably secure and composable gadgets for arbitrary order. Luckily, the changes in the compression stage help us to save correction modules and registers elsewhere, making the resulting Combined Private Circuits (CPC) more secure and more efficient than the original ones. We also explain why the discovered flaws have been missed by the associated formal verification tool VERICA (TCHES 2022) and propose fixes to remove its blind spot. Finally, we explore alternative avenues to repair the compression stage without additional corrections based on non-completeness, i.e., constructing a compression that never recombines any secret. Yet, while this approach could have merit for low-order gadgets, it is, for now, hard to generalize and scales poorly to higher orders. We conclude that our refurbished arbitrary order CINI gadgets provide a solid foundation for further research.

8. Quantitative Fault Injection Analysis

Feldtkeller, J., Güneysu, T. and Schaumont, P., 2023, December. Quantitative Fault Injection Analysis. In International Conference on the Theory and Application of Cryptology and Information Security (pp. 302-336). Singapore: Springer Nature Singapore.

Abstract:

Active fault injection is a credible threat to real-world digital systems computing on sensitive data. Arguing about security in the presence of faults is non-trivial, and state-of-the-art criteria are overly conservative and lack the ability of fine-grained comparison. However, comparing two alternative implementations for their security is required to find a satisfying compromise between security and performance. In addition, the comparison of alternative fault scenarios can help optimize the implementation of effective countermeasures. In this work, we use quantitative information flow analysis to establish a vulnerability metric for hardware circuits under fault injection that measures the severity of an attack in terms of information leakage. Potential use cases range from comparing implementations with respect to their vulnerability to specific fault scenarios to optimizing countermeasures. We automate the computation of our metric by integrating it into a state-of-the-art evaluation tool for physical attacks and provide new insights into the security under an active fault attacker.

9. Gadget-based Masking of Streamlined NTRU Prime Decapsulation in Hardware

Land, G., Marotzke, A., Richter-Brockmann, J. and Güneysu, T., 2024. Gadget-based Masking of Streamlined NTRU Prime Decapsulation in Hardware. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2024(1), pp.1-26.

Abstract: Streamlined NTRU Prime is a lattice-based Key Encapsulation Mechanism (KEM) that is, together with X25519, the default algorithm in OpenSSH 9. Based on lattice assumptions, it is assumed to be secure also against attackers with access to< large-scale quantum computers. While Post-Quantum Cryptography (PQC) schemes have been subject to extensive research in recent years, challenges remain with respect to protection mechanisms against attackers that have additional side-channel information, such as the power consumption of a device processing secret data. As a countermeasure to such attacks, masking has been shown to be a promising and effective approach. For public-key schemes, including any recent PQC schemes, usually, a mixture of Boolean and arithmetic techniques is applied on an algorithmic level. Our generic hardware implementation of Streamlined NTRU Prime decapsulation, however, follows an idea that until now was assumed to be solely applicable efficiently to symmetric cryptography: gadget-based masking. The hardware design is transformed into a secure implementation by replacing each gate with a composable secure gadget that operates on uniform random shares of secret values. In our work, we show the feasibility of applying this approach also to PQC schemes and present the first Public-Key Cryptography (PKC)–pre-and post-quantum–implementation masked with the gadget-based approach considering several trade-offs and design choices. By the nature of gadget-based masking, the implementation can be instantiated at arbitrary masking order. We synthesize our implementation both for Artix-7 Field-Programmable Gate Arrays (FPGAs) and 45nm Application-Specific Integrated Circuits (ASICs), yielding practically feasible results regarding the area, randomness requirement, and latency. We verify the side-channel security of our implementation using formal verification on the one hand, and practically using Test Vector Leakage Assessment (TVLA) on the other. Finally, we also analyze the applicability of our concept to Kyber and Dilithium, which will be standardized by the National Institute of Standards and Technology (NIST).

10. Dynamic nsNet2: Efficient Deep Noise Suppression with Early Exiting

Miccini, R., Zniber, A., Laroche, C., Piechowiak, T., Schoeberl, M., Pezzarossa, L., Karrakchou, O., Sparsø, J. and Ghogho, M., 2023, September. Dynamic nsNET2: Efficient Deep Noise Suppression with Early Exiting. In 2023 IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1-6). IEEE.

Abstract: Although deep learning has made strides in the field of deep noise suppression, leveraging deep architectures on resourceconstrained devices still proved challenging. Therefore, we present an early-exiting model based on nsNet2 that provides several levels of accuracy and resource savings by halting computations at different stages. Moreover, we adapt the original architecture by splitting the information flow to take into account the injected dynamism. We show the trade-offs between performance and computational complexity based on established metrics.

11. Differentiable Transportation Pruning

Li, Y., van Gemert, J.C., Hoefler, T., Moons, B., Eleftheriou, E. and Verhoef, B.E., 2023. Differentiable Transportation Pruning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 16957-16967).

Abstract: Deep learning algorithms are increasingly employed at the edge. However, edge devices are resource constrained and thus require efficient deployment of deep neural networks. Pruning methods are a key tool for edge deployment as they can improve storage, compute, memory bandwidth, and energy usage. In this paper we propose a novel accurate pruning technique that allows precise control over the output network size. Our method uses an efficient optimal transportation scheme which we make end-to-end differentiable and which automatically tunes the exploration-exploitation behavior of the algorithm to find accurate sparse sub-networks. We show that our method achieves state-of-the-art performance compared to previous pruning methods on 3 different datasets, using 5 different models, across a wide range of pruning ratios, and with two types of sparsity budgets and pruning granularities.

12. CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions

Singh, S., Feliu, J., Acacio, M.E., Jimborean, A. and Ros, A., 2023, October. CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions. In 2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT) (pp. 1-13). IEEE.

Abstract: Efficient Total Store Order (TSO) implementations allow loads to execute speculatively out-of-order. To detect order violations, the load queue (LQ) holds all the in-flight loads and is searched on every invalidation and cache eviction. Moreover, in a simultaneous multithreading processor (SMT), stores also search the LQ when writing to cache. LQ searches entail considerable energy consumption. Furthermore, the processor stalls upon encountering the LQ full or when its ports are busy. Hence, the LQ is a critical structure in terms of both energy and performance. In this work, we observe that the use of the LQ could be dramatically optimized under the guarantees of the datarace-free (DRF) property imposed by modern programming languages. To leverage this observation, we propose CELLO, a software-hardware co-design in which the compiler detects memory operations in DRF regions and the hardware optimizes their execution by safely skipping LQ searches without violating the TSO consistency model. Furthermore, CELLO allows removing DRF loads from the LQ earlier, as they do not need to be searched to detect consistency violations. With minimal hardware overhead, we show that an 8-core 2-way SMT processor with CELLO avoids almost all conservative searches to the LQ and significantly reduces its occupancy. CELLO allows i) to reduce the LQ energy expenditure by 33% on average (up to 53%) while performing 2.8% better on average (up to 18.6%) than the baseline system, and ii) to shrink the LQ size from 192 to only 80 entries, reducing the LQ energy expenditure as much as 69% while performing on par with a mainstream LQ implementation.

13. Towards a tailored mixed-precision sub-8-bit quantization scheme for Gated Recurrent Units using Genetic Algorithms

Miccini, R., Cerioli, A., Laroche, C., Piechowiak, T., Sparsø, J. and Pezzarossa, L., 2024. Towards a tailored mixed-precision sub-8bit quantization scheme for Gated Recurrent Units using Genetic Algorithms. arXiv preprint arXiv:2402.12263.

Abstract: Despite the recent advances in model compression techniques for deep neural networks, deploying such models on ultra-low-power embedded devices still proves challenging. In particular, quantization schemes for Gated Recurrent Units (GRU) are difficult to tune due to their dependence on an internal state, preventing them from fully benefiting from sub-8bit quantization. In this work, we propose a modular integer quantization scheme for GRUs where the bit width of each operator can be selected independently. We then employ Genetic Algorithms (GA) to explore the vast search space of possible bit widths, simultaneously optimizing for model size and accuracy. We evaluate our methods on four different sequential tasks and demonstrate that mixed-precision solutions exceed homogeneous-precision ones in terms of Pareto efficiency. Our results show a model size reduction between 25% and 55% while maintaining an accuracy comparable with the 8-bit homogeneous equivalent.

14. Optimising GPGPU Execution Through Runtime Micro-Architecture Parameter Analysis

Sarda, G.M., Shah, N., Bhattacharjee, D., Debacker, P. and Verhelst, M., 2023, October. Optimising GPGPU Execution Through Runtime Micro-Architecture Parameter Analysis. In 2023 IEEE International Symposium on Workload Characterization (IISWC) (pp. 226-228). IEEE.

Abstract: GPGPU execution analysis has always been tied to closed-source, proprietary benchmarking tools that provide high-level, non-exhaustive, and/or statistical information, preventing a thorough understanding of bottlenecks and optimization possibilities. Open-source hardware platforms offer opportunities to overcome such limits and co-optimize the full hardware-mapping-algorithm compute stack. Yet, so far, this has remained under-explored. In this work, we exploit micro-architecture parameter analysis to develop a hardware-aware, runtime mapping technique for OpenCL kernels on the open Vortex RISC-V GPGPU. Our method is based on trace observations and targets optimal hardware resource utilization to achieve superior performance and flexibility compared to hardware-agnostic mapping approaches. The technique was validated on different architectural GPU configurations across several OpenCL kernels. Overall, our approach significantly enhances the performance of the open-source Vortex GPGPU, contributing to unlocking its potential and usability.

15. DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling

Mei, L., Goetschalckx, K., Symons, A. and Verhelst, M., 2023, February. Defines: Enabling fast exploration of the depth-first scheduling space for dnn accelerators through analytical modeling. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (pp. 570-583). IEEE.

Abstract: DNN workloads can be scheduled onto DNN accelerators in many different ways: from layer-by-layer scheduling to cross-layer depth-first scheduling (a.k.a. layer fusion, or cascaded execution). This results in a very broad scheduling space, with each schedule leading to varying hardware (HW) costs in terms of energy and latency. To rapidly explore this vast space for a wide variety of hardware architectures, analytical cost models are crucial to estimate scheduling effects on the HW level. However, state-of-the-art cost models are lacking support for exploring the complete depth-first scheduling space, for instance focusing only on activations while ignoring weights, or modeling only DRAM accesses while overlooking on-chip data movements. These limitations prevent researchers from systematically and accurately understanding the depth-first scheduling space.After formalizing this design space, this work proposes a unified modeling framework, DeFiNES, for layer-by-layer and depth-first scheduling to fill in the gaps. DeFiNES enables analytically estimating the hardware cost for possible schedules in terms of both energy and latency, while considering data access at every memory level. This is done for each schedule and HW architecture under study by optimally choosing the active part of the memory hierarchy per unique combination of operand, layer, and feature map tile. The hardware costs are estimated, taking into account both data computation and data copy phases. The analytical cost model is validated against measured data from a taped-out depth-first DNN accelerator, DepFiN, showing good modeling accuracy at the end-to-end neural network level. A comparison with generalized state-of-the-art demonstrates up to 10× better solutions found with DeFiNES.

16. Optimizing Layer-Fused Scheduling of Transformer Networks on Multi-accelerator Platforms

Colleman, S., Symons, A., Jung, V.J. and Verhelst, M., 2024, April. Optimizing Layer-Fused Scheduling of Transformer Networks on Multi-accelerator Platforms. In 2024 25th International Symposium on Quality Electronic Design (ISQED) (pp. 1-6). IEEE.

Abstract: The impact of transformer networks is booming, yet, they come with significant computational complexity. It is therefore essential to understand how to optimally map and execute these networks on modern neural processor hardware. So far, literature on transformer scheduling optimization has been focusing on deployment on GPU and specific ASICs. This work enables extensive hardware/mapping exploration by extending the DSE framework Stream towards support for transformers across a wide variety of hardware architectures and different execution schedules. After validation, we explore the optimal schedule for transformer layers/attention heads and investigate whether layer fusion is beneficial to improve latency, energy or memory requirements. Our study shows that the memory requirements for active feature data can be drastically reduced, by adapting the execution schedule based on the size of the input of the attention head.

17. Analog or Digital In-Memory Computing? Benchmarking Through Quantitative Modeling

Sun, J., Houshmand, P. and Verhelst, M., 2023, October. Analog or Digital In-Memory Computing? Benchmarking Through Quantitative Modeling. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD) (pp. 1-9). IEEE.

Abstract: In-Memory Computing (IMC) has emerged as a promising paradigm for energy-efficient, throughput-efficient and area-efficient machine learning at the edge. However, the differences in hardware architectures, array dimensions, and fabrication technologies among published IMC realizations have made it difficult to grasp their relative strengths. Moreover, previous studies have primarily focused on exploring and bench-marking the peak performance of a single IMC macro rather than full system performance on real workloads. This paper aims to address the lack of a quantitative comparison of Analog In-Memory Computing (AIMC) and Digital In-Memory Computing (DIMC) processor architectures. We propose an analytical IMC performance model that is validated against published implementations and integrated into a system-level exploration framework for comprehensive performance assessments on different work-loads with varying IMC configurations. Our experiments show that while DIMC generally has higher computational density than AIMC, AIMC with large macro sizes may have better energy efficiency than DIMC on convolutional-layers and pointwise-layers, which can exploit high spatial unrolling. On the other hand, DIMC with small macro size outperforms AIMC on depthwise-layers, which feature limited spatial unrolling opportunities inside a macro.

18. CMDS: Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories

Shi, M., Colleman, S., VanDeMieroop, C., Joseph, A., Meijer, M., Dehaene, W. and Verhelst, M., 2023, April. CMDS: Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories. In 2023 24th International Symposium on Quality Electronic Design (ISQED) (pp. 1-8). IEEE.

Abstract: Deep neural networks (DNN) use a wide range of network topologies to achieve high accuracy within diverse applications. This model diversity makes it impossible to identify a single “dataflow” (execution schedule) to perform optimally across all possible layers and network topologies. Several frameworks support the exploration of the best dataflow for a given DNN layer and hardware. However, switching the dataflow from one layer to the next layer within one DNN model can result in hardware inefficiencies stemming from memory data layout mismatch among the layers. Unfortunately, all existing frameworks treat each layer independently and typically model memories as black boxes (one large monolithic wide memory), which ignores the data layout and can not deal with the data layout dependencies of sequential layers. These frameworks are not capable of doing dataflow cross-layer optimization. This work, hence, aims at cross-layer dataflow optimization, taking the data dependency and data layout reshuffling overheads among layers into account. Additionally, we propose to exploit the multibank memories typically present in modern DNN accelerators towards efficiently reshuffling data to support more dataflow at low overhead. These innovations are supported through the Cross-layer Memory-aware Dataflow Scheduler (CMDS). CMDS can model DNN execution energy/latency while considering the different data layout requirements due to the varied optimal dataflow of layers. Compared with the state-of-the-art (SOTA), which performs layer-optimized memory-unaware scheduling, CMDS achieves up to 5.5× energy reduction and 1.35× latency reduction with negligible hardware cost.

19. An Empirical Evaluation of Sliding Windows on Siren Detection Task using Spiking Neural Networks

Kshirasagar, S., Guntoro, A. and Mayr, C., 2024. An Empirical Evaluation of Sliding Windows on Siren Detection Task Using Spiking Neural Networks. Advances in Signal Processing and Artificial Intelligence, p.112.

Abstract: Anomaly acoustic cues like siren sounds, when undetected, could lead to road safety issues like collisions or accidents. Auditory perception systems are resource bound when deployed on power constrained sensory edge devices. Spiking neural networks (SNN) premise brain-like computing with high energy-efficiency. This work presents a quantitative analysis of the variation of sliding window on the performance of acoustic anomaly detection task for siren sounds. We perform FFT based pre-processing and employ Mel-spectrogram features fed as input to the recurrent spiking neural network. SNN model in this work comprises of leaky-integrate-and-fire (LIF) neurons in the hidden layer and a single readout with leaky integrator cell. The non-trivial motivation of this research is to understand the effect of encoding behavior of spiking neurons with sliding windows. We conduct experiments with different window sizes, and the overlapping ratio within the windows. We present our results for performance measures like accuracy and onset latency to provide an insight on the choice of optimal window.

20. COAC: Cross-Layer Optimization of Accelerator Configurability for Efficient CNN Processing

Colleman, S., Shi, M. and Verhelst, M., 2023. COAC: Cross-Layer Optimization of Accelerator Configurability for Efficient CNN Processing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

Abstract: To achieve high accuracy, convolutional neural networks (CNNs) are increasingly growing in complexity and diversity in layer types and topologies. This makes it very challenging to efficiently deploy such networks on custom processor architectures for resource-scarce edge devices. Existing mapping exploration frameworks enable searching for the optimal execution schedules or hardware mappings of individual network layers, by optimizing each layer’s spatial (dataflow parallelization) and temporal unrolling (TU, execution order). However, these tools fail to take into account the overhead of supporting different unrolling schemes within a common hardware architecture. Using a fixed unrolling scheme across all layers is also not ideal, as this misses significant opportunities for energy and latency savings from optimizing the mapping of diverse layer types. A balanced approach assesses the right amount of mapping flexibility needed across target neural networks, while taking into account the overhead to support multiple unrollings. This article, therefore, presents cross-layer optimization of accelerator configurability (COAC), a cross-layer design space exploration and mapping framework to optimize the flexibility of neural processing architectures by balancing configurability overhead against resulting energy and latency savings for end-to-end inference. COAC does not only provide a systematical analysis of the architectural overhead in function of the supported spatial unrollings (SUs), but also builds an automated flow to find the best unrolling combination(s) for efficient end-to-end inference with limited hardware overhead. Results demonstrate that architectures with carefully optimized flexibility can achieve up to 38% energy-delay-product (EDP) savings for a set of six neural networks at the expense of a relative area increase of 9.5%.

21. ACCO: Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators

Yin, J., Mei, L., Guntoro, A. and Verhelst, M., 2023, November. ACCO: Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators. In 2023 IEEE 41st International Conference on Computer Design (ICCD) (pp. 391-398). IEEE.

Abstract: Spatio-Temporal Convolutional Neural Networks (ST-CNN) allow extending CNN capabilities from image processing to consecutive temporal-pattern recognition. Generally, state-of-the-art (SotA) ST-CNNs inflate the feature maps and weights from well-known CNN backbones to represent the additional time dimension. However, edge computing applications would suffer tremendously from such large computation or memory overhead. Fortunately, the overlapping nature of ST-CNN enables various optimizations, such as the dilated causal convolution structure and Depth-First (DF) layer fusion to reuse the computation between time steps and CNN sliding windows, respectively. Yet, no hardware-aware approach has been proposed that jointly explores the optimal strategy from a scheduling as well as a hardware point of view. To this end, we present ACCO, an automated optimizer that explores efficient Causal CNN transformation and DF scheduling for ST-CNNs on edge hardware accelerators. By cost-modeling the computation and data movement on the accelerator architecture, ACCO automatically selects the best scheduling strategy for the given hardware-algorithm target. Compared to the fixed dilated causal structure, ST-CNNs with ACCO reach an ~8.4x better Energy-Delay-Product. Meanwhile, ACCO improves ~20% in layer-fusion optimals compared to the SotA DF exploration toolchain. When jointly optimizing ST-CNN on the temporal and spatial dimension, ACCO’s scheduling outcomes are on average 19x faster and 37x more energy-efficient than spatial DF schemes.

22. Reliable and Energy-Efficient Diabetic Retinopathy Screening Using Memristor-Based Neural Networks

Diware, S., Chilakala, K., Joshi, R.V., Hamdioui, S. and Bishnoi, R., 2024. Reliable and Energy-efficient Diabetic Retinopathy Screening using Memristor-based Neural Networks. IEEE Access.

Abstract: Diabetic retinopathy (DR) is a leading cause of permanent vision loss worldwide. It refers to irreversible retinal damage caused due to elevated glucose levels and blood pressure. Regular screening for DR can facilitate its early detection and timely treatment. Neural network-based DR classifiers can be leveraged to achieve such screening in a convenient and automated manner. However, these classifiers suffer from reliability issue where they exhibit strong performance during development but degraded performance after deployment. Moreover, they do not provide supplementary information about the prediction outcome, which severely limits their widespread adoption. Furthermore, energy-efficient deployment of these classifiers on edge devices remains unaddressed, which is crucial to enhance their global accessibility. In this paper, we present a reliable and energy-efficient hardware for DR detection, suitable for deployment on edge devices. We first develop a DR classification model using custom training data that incorporates diverse image quality and image sources along with improved class balance. This enables our model to effectively handle both on-field variations in retinal images and minority DR classes, enhancing its post-deployment reliability. We then propose a pseudo-binary classification scheme to further improve the model performance and provide supplementary information about the model prediction. Additionally, we present an energy-efficient hardware design for our model using memristor-based computation-in-memory, to facilitate its deployment on edge devices. Our proposed approach achieves reliable DR classification with three orders of magnitude reduction in energy consumption over state-of-the-art hardware platforms.

23. Alternate Path μ-op Cache Prefetching

S. Singh, A. Perais, A. Jimborean and A. Ros, “Alternate Path μ-op Cache Prefetching,” 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), Buenos Aires, Argentina, 2024, pp. 1230-1245, doi: 10.1109/ISCA59077.2024.00092.

Abstract: Datacenter applications are well-known for their large code footprints. This has caused frontend design to evolve by implementing decoupled fetching and large prediction structures – branch predictors, Branch Target Buffers (BTBs) – to mitigate the stagnating size of the instruction cache by prefetching instructions well in advance. In addition, many designs feature a micro operation (μ-op) cache, which primarily provides power savings by bypassing the instruction cache and decoders once warmed up. However, this μ-op cache often has lower reach than the instruction cache, and it is not filled up speculatively using the decoupled fetcher. As a result, the μ-op cache is often over-subscribed by datacenter applications, up to the point of becoming a burden. This paper first shows that because of this pressure, blindly prefetching into the μ-op cache using state-of-the-art standalone prefetchers would not provide significant gains. As a consequence, this paper proposes to prefetch only critical μ-ops into the μ op cache, by focusing on execution points where the μ-op cache provides the most gains: Pipeline refills. Concretely, we use hardto-predict conditional branches as indicators that a pipeline refill is likely to happen in the near future, and prefetch into the μ-op cache the μ-ops that belong to the path opposed to the predicted path, which we call alternate path. Identifying hard-to-predict branches requires no additional state if the branch predictor confidence is used to classify branches. Including extra alternate branch predictors with limited budget (8.95 KB to 12.95 KB), our proposal provides average speedups of 1.9% to 2% and as high as 12% on a subset of CVP-1 traces.

24. Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps

Nilsson, M., Miccini, R., Laroche, C., Piechowiak, T., Zenke, F. (2024) Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps. Proc. Interspeech 2024, 2975-2979, doi: 10.21437/Interspeech.2024-1979

Abstract: As speech processing systems in mobile and edge devices become more commonplace, the demand for unintrusive speech quality monitoring increases. Deep learning methods provide high-quality estimates of objective and subjective speech quality metrics. However, their significant computational requirements are often prohibitive on resource-constrained devices. To address this issue, we investigated binary activation maps (BAMs) for speech quality prediction on a convolutional architecture based on DNSMOS. We show that the binary activation model with quantization aware training matches the predictive performance of the baseline model. It further allows using other compression techniques. Combined with 8-bit weight quantization, our approach results in a 25-fold memory reduction during inference, while replacing almost all dot products with summations. Our findings show a path toward substantial resource savings by supporting mixed-precision binary multiplication in hard- and software.

25. Late Breaking Results: Language-level QoR modeling for High-Level Synthesis

Masouros, D., Ferikoglou, A., Zervakis, G., Xydis, S. and Soudris, D., 2024, June. Late breaking results: Language-level qor modeling for high-level synthesis. In Proceedings of the 61st ACM/IEEE Design Automation Conference (pp. 1-2).

Abstract: This paper proposes a language-level modeling approach for High-Level Synthesis based on the state-of-the-art Transformer architecture. Our approach estimates the performance and required resources of HLS applications directly from the source code when different synthesis directives, in
terms of HLS #pragmas, are applied. Results show that the proposed architecture achieves 96.02% accuracy for predicting the feasibility class of applications and an average of 0.95 and 0.91 R2 scores for predicting the actual performance and required resources, respectively.

26. ESAM: Energy-efficient SNN Architecture using 3nm FinFET Multiport SRAM-based CIM with Online Learning

Huijbregts, Lucas, et al. “Energy-efficient SNN Architecture using 3nm FinFET Multiport SRAM-based CIM with Online Learning.” Proceedings of the 61st ACM/IEEE Design Automation Conference. 2024.

Abstract: Current Artificial Intelligence (AI) computation systems face challenges, primarily from the memory-wall issue, limiting overall system-level performance, especially for Edge devices with constrained battery budgets, such as smartphones, wearables, and Internet-of-Things sensor systems. In this paper, we propose a new SRAM-based Compute-In-Memory (CIM) accelerator optimized for Spiking Neural Networks (SNNs) Inference. Our proposed architecture employs a multiport SRAM design with multiple decoupled Read ports to enhance the throughput and Transposable ReadWrite ports to facilitate online learning. Furthermore, we develop an Arbiter circuit for efficient data-processing and port allocations during the computation. Results for a 128×128 array in 3nm FinFET technology demonstrate a 3.1× improvement in speed and a 2.2× enhancement in energy efficiency with our proposed multiport SRAM design compared to the traditional single-port design. At system-level, a throughput of 44 MInf/s at 607 pJ/Inf and 29mW is achieved.

27. Data-driven HLS optimization for reconfigurable accelerators

Ferikoglou, A., Kakolyris, A., Kypriotis, V., Masouros, D., Soudris, D. and Xydis, S., 2024, June. Data-driven HLS optimization for reconfigurable accelerators. In Proceedings of the 61st ACM/IEEE Design Automation Conference (pp. 1-6).

Abstract: High-Level Synthesis (HLS) has played a pivotal role in making FPGAs accessible to a broader audience by facilitating high-level device programming and rapid microarchitecture customization through the use of directives. However, manually selecting the right directives can be a formidable challenge for programmers lacking a hardware background. This paper introduces an ultra-fast, knowledge-based HLS design optimization method that automatically extracts and applies the most promising directive configurations to the original source code. This optimization approach is entirely data-driven, offering a generalized HLS tuning solution without reliance on Quality of Result (QoR) models or meta-heuristics. We design, implement, and evaluate our methodology using over 100 applications sourced from well-established benchmark suites and GitHub repositories, all running on a Xilinx ZCU104 FPGA. The results are promising, including an average geometric mean speedup of ×7.2 and ×1.35 compared to designer-optimized designs and resource over-provisioning strategies, respectively. Additionally, it demonstrates a high design feasibility score and maintains an average inference latency of 38ms. Comparative analysis with traditional genetic algorithm-based Design Space Exploration (DSE) methods and State-of-the-Art (SoA) approaches reveals that it produces designs of similar quality but at speeds 2-3 orders of magnitude faster. This suggests that it is a highly promising solution for ultra-fast and automated HLS optimization.

28. Falcon: A Scalable Analytical Cache Model

Pitchanathan, A., Grover, K. and Grosser, T., 2024. Falcon: A scalable analytical cache model. Proceedings of the ACM on Programming Languages, 8(PLDI), pp.1854-1878.

Abstract: Compilers often use performance models to decide how to optimize code. This is often preferred over using hardware performance measurements, since hardware measurements can be expensive, limited by hardware availability, and makes the output of compilation non-deterministic. Analytical models, on the other hand, serve as efficient and noise-free performance indicators. Since many optimizations focus on improving memory performance, memory cache miss rate estimations can serve as an effective and noise-free performance indicator for superoptimizers, worst-case execution time analyses, manual program optimization, and many other performance-focused use cases. Existing methods to model the cache behavior of affine programs work on small programs such as those in the Polybench benchmark but do not scale to the larger programs we would like to optimize in production, which can be orders of magnitude bigger by lines of code. These analytical approaches hand of the whole program to a Presburger solver and perform expensive mathematical operations on the huge resulting formulas. We develop a scalable cache model for affine programs that splits the computation into smaller pieces that do not trigger the worst-case asymptotic behavior of these solvers. We evaluate our approach on 46 TorchVision neural networks, finding that our model has a geomean runtime of 44.9 seconds compared to over 32 minutes for the state-of-the-art prior cache model, and the latter is actually smaller than the true value because the prior model reached our four hour time limit on 54% of the networks, and this limit was never reached by our tool. Our model exploits parallelism effectively: running it on sixteen cores is 8.2x faster than running it single-threaded. While the state-of-the-art model takes over four hours to analyze a majority of the benchmark programs, Falcon produces results in at most 3 minutes and 3 seconds; moreover, after a local modification to the program being analyzed, our model efficiently updates the predictions in 513 ms on average (geomean). Thus, we provide the first scalable analytical cache model.

29. Verifying Peephole Rewriting In SSA Compiler IRs

Bhat, S., Keizer, A., Hughes, C., Goens, A. and Grosser, T., 2024. Verifying Peephole Rewriting In SSA Compiler IRs. arXiv preprint arXiv:2407.03685.

Abstract: There is an increasing need for domain-specific reasoning in modern compilers. This has fueled the use of tailored intermediate representations (IRs) based on static single assignment (SSA), like in the MLIR compiler framework. Interactive theorem provers (ITPs) provide strong guarantees for the end-to-end verification of compilers (e.g., CompCert). However, modern compilers and their IRs evolve at a rate that makes proof engineering alongside them prohibitively expensive. Nevertheless, well-scoped push-button automated verification tools such as the Alive peephole
verifier for LLVM-IR gained recognition in domains where SMT solvers offer efficient (semi) decision procedures. In this paper, we aim to combine the convenience of automation with the versatility of ITPs for verifying peephole rewrites across domain-specific IRs. We formalize a core calculus for
SSA-based IRs that is generic over the IR and covers so-called regions (nested scoping used by many domain-specific IRs in the MLIR ecosystem). Our mechanization in the Lean proof assistant provides a user-friendly frontend for translating MLIR syntax into our calculus. We provide scaffolding for defining and verifying peephole rewrites, offering tactics to eliminate the abstraction overhead of our SSA calculus. We prove correctness theorems about peephole rewriting, as well as two classical program transformations. To evaluate our framework, we consider three use cases from the MLIR
ecosystem that cover different levels of abstractions: (1) bitvector rewrites from LLVM, (2) structured control flow, and (3) fully homomorphic encryption. We envision that our mechanization provides a foundation for formally verified rewrites on new domain-specific IRs.

30. HTVM: Efficient Neural Network Deployment On Heterogeneous TinyML Platforms

Van Delm, J., Vandersteegen, M., Burrello, A., Sarda, G.M., Conti, F., Pagliari, D.J., Benini, L. and Verhelst, M., 2023, July. HTVM: Efficient neural network deployment on heterogeneous TinyML platforms. In 2023 60th ACM/IEEE Design Automation Conference (DAC) (pp. 1-6). IEEE.

Abstract: Optimal deployment of deep neural networks (DNNs) on state-of-the-art Systems-on-Chips (SoCs) is crucial for tiny machine learning (TinyML) at the edge. The complexity of these SoCs makes deployment non-trivial, as they typically contain multiple heterogeneous compute cores with limited, programmer-managed memory to optimize latency and energy efficiency. We propose HTVM – a compiler that merges TVM with DORY to maximize the utilization of heterogeneous accelerators and minimize data movements. HTVM allows deploying the MLPerf™ Tiny suite on DIANA, an SoC with a RISC-V CPU, and digital and analog compute-in-memory AI accelerators, at 120x improved performance over plain TVM deployment.

31. Decoupled Access-Execute Enabled DVFS for TinyML Deployments on STM32 Microcontrollers

Alvanaki, E.L., Katsaragakis, M., Masouros, D., Xydis, S. and Soudris, D., 2024, March. Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers. In 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1-6). IEEE.

Abstract: Over the last years the rapid growth Machine Learning (ML) inference applications deployed on the Edge is rapidly increasing. Recent Internet of Things (IoT) devices and microcontrollers (MCUs), become more and more mainstream in everyday activities. In this work we focus on the family of STM32 MCUs. We propose a novel methodology for CNN deployment on the STM32 family, focusing on power optimization through effective clocking exploration and configuration and decoupled access-execute convolution kernel execution. Our approach is enhanced with optimization of the power consumption through Dynamic Voltage and Frequency Scaling (DVFS) under various latency constraints, composing an NP-complete optimization problem. We compare our approach against the state-of-the-art TinyEngine inference engine, as well as TinyEngine coupled with power-saving modes of the STM32 MCUs, indicating that we can achieve up to 25.2% less energy consumption for varying QoS levels.

32. SECOMP: Formally Secure Compilation of Compartmentalized C Programs

Thibault, J., Blanco, R., Lee, D., Argo, S., Azevedo de Amorim, A., Georges, A.L., Hriţcu, C. and Tolmach, A., 2024, December. SECOMP: Formally Secure Compilation of Compartmentalized C Programs. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security (pp. 1061-1075).

Abstract: Undefined behavior in C often causes devastating security vulnerabilities. One practical mitigation is compartmentalization, which allows developers to structure large programs into mutually distrustful compartments with clearly specified privileges and interactions. In this paper we introduce SECOMP, a compiler for compartmentalized C code that comes with machine-checked proofs guaranteeing that the scope of undefined behavior is restricted to the compartments that encounter it and become dynamically compromised. These guarantees are formalized as the preservation of safety properties against adversarial contexts, a secure compilation criterion similar to full abstraction, and this is the first time such a strong criterion is proven for a mainstream programming language. To achieve this we extend the languages of the CompCert verified C compiler with isolated compartments that can only interact via procedure calls and returns, as specified by cross-compartment interfaces. We adapt the passes and optimizations of CompCert as well as their correctness proofs to this compartment-aware setting. We then use compiler correctness as an ingredient in a larger secure compilation proof that involves several proof engineering novelties, needed to scale formally secure compilation up to a C compiler.

33. xDSL: Sidekick Compilation for SSA-Based Compilers

Fehr, M., Weber, M., Ulmann, C., Lopoukhine, A., Lücke, M.P., Degioanni, T., Vasiladiotis, C., Steuwer, M. and Grosser, T., 2025, March. xDSL: Sidekick Compilation for SSA-Based Compilers. In Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization (pp. 179-192)

Abstract: Traditionally, compiler researchers either conduct experiments within an existing production compiler or develop their own prototype compiler; both options come with trade-offs. On one hand, prototyping in a production compiler can be cumbersome, as they are often optimized for program compilation speed at the expense of software simplicity and development speed. On the other hand, the transition from a prototype compiler to production requires significant engineering work. To bridge this gap, we introduce the concept of sidekick compiler frameworks, an approach that uses multiple frameworks that interoperate with each other by leveraging textual interchange formats and declarative descriptions of abstractions. Each such compiler framework is specialized for specific use cases, such as performance or prototyping. Abstractions are by design shared across frameworks, simplifying the transition from prototyping to production. We demonstrate this idea with xDSL, a sidekick for MLIR focused on prototyping and teaching. xDSL interoperates with MLIR through a shared textual IR and the exchange of IRs through an IR Definition Language. The benefits of sidekick compiler frameworks are evaluated by showing on three use cases how xDSL impacts their development: teaching, DSL compilation, and rewrite system prototyping. We also investigate the trade-offs that xDSL offers, and demonstrate how we simplify the transition between frameworks using the IRDL dialect. With sidekick compilation, we envision a future in which engineers minimize the cost of development by choosing a framework built for their immediate needs, and later transitioning to production with minimal overhead.

34. A Multi-level Compiler Backend for Accelerated Micro-kernels Targeting RISC-V ISA Extensions

Lopoukhine, A., Ficarelli, F., Vasiladiotis, C., Lydike, A., Van Delm, J., Dutilleul, A., Benini, L., Verhelst, M. and Grosser, T., 2025, March. A Multi-level Compiler Backend for Accelerated Micro-kernels Targeting RISC-V ISA Extensions. In Proceedings of the 23rd ACM/IEEE International Symposium on Code Generation and Optimization (pp. 163-178)

Abstract: High-performance micro-kernels must fully exploit today’s diverse and specialized hardware to deliver peak performance to deep neural networks (DNNs). While higher-level optimizations for DNNs are offered by numerous compilers (e.g., MLIR, TVM, OpenXLA), performance-critical micro-kernels are left to specialized code generators or handwritten assembly. Even though widely-adopted compilers (e.g., LLVM, GCC) offer tuned backends, their CPU-focused input abstraction, unstructured intermediate representation (IR) and general-purpose best-effort design inhibit tailored code generation for innovative hardware. We think it is time to widen the classical hourglass backend and embrace progressive lowering across a diverse set of structured abstractions to bring domain-specific code generation to compiler backends. We demonstrate this concept by implementing a custom backend for a RISC-V-based accelerator with hardware loops and streaming registers, leveraging knowledge about the hardware at levels of abstraction that match its custom instruction set architecture (ISA). We use incremental register allocation over structured IRs, while dropping classical spilling heuristics, and show up to 90% floating-point unit (FPU) utilization across key DNN kernels. By breaking the backend hourglass model, we reopen the path from domain-specific abstractions to specialized hardware.

35. Auditory Anomaly Detection using Recurrent Spiking Neural Networks

S. Kshirasagar, B. Cramer, A. Guntoro and C. Mayr, “Auditory Anomaly Detection using Recurrent Spiking Neural Networks,” 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS), Abu Dhabi, United Arab Emirates, 2024, pp. 278-281

Abstract: Brain-inspired networks promise capabilities of achieving high computational efficacy with low energy footprint. Auditory perception systems are resource constrained when deployed on low power edge AI devices. Hence, we employ spiking neural networks (SNNs) for auditory scene analysis, specifically targeting temporal detection of anomaly cues particularly siren sounds. We generate artificial audio sequences from a publicly available dataset containing various siren and noise sounds. We train small-scale recurrent SNNs with leaky-integrate-and-fire (LIF) neurons in the hidden layer and achieve accurate predictions with precious few parameters. Further, we provide a baseline for conventional RNNs of similar network topology on the same task. With comparable accuracy, reduced parameter, and sparse spiking activity in hidden layer in contrast to conventional methods, we found bio-inspired approach realized using SNNs to be promising in solving the time-series auditory anomaly detection task.

36. Multi-Partner Project: Securing Future Edge-AI Processors in Practice

S. Argo et al., “Multi-Partner Project: Securing Future Edge-AI Processors in Practice (CONVOLVE),” 2025 Design, Automation & Test in Europe Conference (DATE), Lyon, France, 2025, pp. 1-7

Abstract: Artificial Intelligence (AI) has had a profound impact on our contemporary society, and it is indisputable that it will continue to play a significant role in the future. To further enhance AI experience and performance, a transition from large-scale server applications towards AI-powered edge devices is inevitable. In fact, current projections indicate that the market for Smart Edge Processors (SEPs) will grow beyond 70 Billion USD by 2026 [1]. Such a shift comes with major challenges, as these devices have limited computing and energy resources yet need to be highly performant. Additionally, security mechanisms need to be implemented to protect against diverse attack vectors as attackers now have physical access to the device. Besides cryptographic keys, Intellectual Property (IP), including neural network weights, may also be potential targets. The CONVOLVE [2] project (currently in its intermediate stage) follows a holistic approach to address these challenges and establish the EU in a leading position in embedded, ultra-low-power and secure processors for edge computing. It encompasses novel hardware technologies, end-to-end integrated workflows, and a security-by-design approach. This paper highlights the security aspects of future edge-AI processors by illustrating challenges encountered in CONVOLVE, the solutions we pursue including some early results, and directions for future research.

37. Scalable Speech Enhancement With Dynamic Channel Pruning

R. Miccini, C. Laroche, T. Piechowiak and L. Pezzarossa, “Scalable Speech Enhancement With Dynamic Channel Pruning,” ICASSP 2025 – 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 2025, pp. 1-5

Abstract: Speech Enhancement (SE) is essential for improving productivity in remote collaborative environments. Although deep learning models are highly effective at SE, their computational demands make them impractical for embedded systems. Furthermore, acoustic conditions can change significantly in terms of difficulty, whereas neural networks are usually static with regard to the amount of computation performed. To this end, we introduce Dynamic Channel Pruning to the audio domain for the first time and apply it to a custom convolutional architecture for SE. Our approach works by identifying unnecessary convolutional channels at runtime and saving computational resources by not computing the activations for these channels and retrieving their filters. When trained to only use 25% of channels, we save up to 32.4% of MACs while only causing a 0.32% drop in PESQ. Thus, DynCP offers a promising path toward deploying larger and more powerful SE solutions on resource-constrained devices.

38. NeuralCasting: A Front-End Compilation Infrastructure for Neural Networks

A. Cerioli et al., “NeuralCasting: A Front-End Compilation Infrastructure for Neural Networks,” 2024 11th International Conference on Internet of Things: Systems, Management and Security (IOTSMS), Malmö, Sweden, 2024, pp. 161-168

Abstract: This study presents the development of NeuralCasting, a front-end compiler framework capable of converting (casting) neural networks encoded in the ONNX format to optimized C code. The primary objective is to enable the compilation of neural networks depending only on standard C libraries, thus eliminating the need for a separate inference engine, such as ONNX runtime. Furthermore, this feature allows the generation of C code suitable both for bare metal embedded devices and other resource-constrained devices, extending model applicability to a plethora of hardware targets. The framework addresses critical applications such as real-time audio processing, especially regarding latency constraints. As an example use case, we compile three different models: a Multi-Layer Perceptron (MLP) composed of fully connected layers, a ResNet cell for image recognition, and an NSNet 2 for speech enhancement and noise suppression. We analyze and compare the performance with other widely used frameworks, such as ONNX Runtime and PyTorch. Our findings indicate that the developed compiler successfully generates optimized C code that meets real-time processing requirements for latency-sensitive applications, outperforming ONNX Runtime and PyTorch and reaching a speedup close to x10 for small-sized MLP models, which are suitable for deployment on edge devices.

39. Adaptive Slimming for Scalable and Efficient Speech Enhancement

R. Miccini, M. Kim, C. Laroche, L. Pezzarossa and P. Smaragdis, “Adaptive Slimming for Scalable and Efficient Speech Enhancement,” 2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), Tahoe City, CA, USA, 2025, pp. 1-5

Abstract: Speech enhancement (SE) enables robust speech recognition, real-time communication, hearing aids, and other applications where speech quality is crucial. However, deploying such systems on resource-constrained devices involves choosing a static trade-off between performance and computational efficiency. In this paper, we introduce dynamic slimming to DEMUCS, a popular SE architecture, making it scalable and input-adaptive. Slimming lets the model operate at different utilization factors (UF), each corresponding to a different performance/efficiency trade-off, effectively mimicking multiple model sizes without the extra storage costs. In addition, a router subnet, trained end-to-end with the backbone, determines the optimal UF for the current input. Thus, the system saves resources by adaptively selecting smaller UFs when additional complexity is unnecessary. We show that our solution is Pareto-optimal against individual UFs, confirming the benefits of dynamic routing. When training the proposed dynamically-slimmable model to use 10 % of its capacity on average, we obtain the same or better speech quality as the equivalent static 25 % utilization while reducing MACs by 29 %.

40. Efficient Streaming Speech Quality Prediction with Spiking Neural Networks

Nilsson, M., Miccini, R., Rossbroich, J., Laroche, C., Piechowiak, T., Zenke, F. (2025) Efficient Streaming Speech Quality Prediction with Spiking Neural Networks. Proc. Interspeech 2025, 5423-5427

Abstract: As speech processing systems become more ubiquitous, the need for real-time, efficient speech quality prediction (SQP) is growing. Conventional artificial neural networks (ANNs) offer strong prediction performance but can be computationally demanding, which limits their deployment on mobile and edge devices. Spiking neural networks (SNNs) present a promising alternative for ultra-low-power, streaming inference due to their sparse activity and event-driven processing. However, their potential for SQP remains largely unexplored. This article introduces deep convolutional SNNs for SQP and evaluates their performance against state-of-the-art ANN models. Our results show that SNNs achieve comparable accuracy while significantly reducing computational cost. These findings highlight the potential of SNNs to enable real-time, energy-efficient SQP in resource-constrained settings.

41. DREAM-CIM: A Digital SRAM-Based CIM Accelerator for Energy- and Area-Efficient Edge AI

A. El Arrassi et al., “DREAM-CIM: A Digital SRAM-Based CIM Accelerator for Energy- and Area-Efficient Edge AI,” in IEEE Transactions on Circuits and Systems for Artificial Intelligence, vol. 2, no. 3, pp. 211-221, Sept. 2025

Abstract: With the rise of energy-constrained smart edge applications, there is a pressing need for energy-efficient computing engines that process generated data locally, at least for small and medium-sized applications. To address this issue, this paper proposes DREAM-CIM, a digital SRAM-based computation-in-memory (CIM) accelerator. It targets an energy- and area-efficient implementation of the multiply-and-accumulate (MAC) operation, which is the core operation of neural networks. The accelerator is based on a multi-sub-array macro to increase parallelism, integrates multiplication operations within the memory cells such that they are executed while reading the cells, makes use of pipelining to further optimize the throughput of the MAC operations, and gets rid of the expensive adder-tree structures commonly used in State-of-The-Art (SOTA) digital CIM solutions by replacing them with a custom accumulation circuit to reduce power and area. The SPICE simulation results of the DREAM-CIM accelerator show an energy efficiency of 5097 TOPS/W (normalized to a 1-bit × 1-bit MAC operation) and an area efficiency of 3854 TOPS/mm2 using 22 nm technology node. The obtained circuit-level results were fed into a python-based system-level simulator to benchmark the system architecture using two applications, i.e., image classification (using MNIST and CIFAR-10 dataset on LeNet5 and Resnet-20 models) and object detection (using COCO dataset on the YoloV6 model). The system-level results show that DREAM-CIM can achieve an energy efficiency of 0.1mJ, 0.2mJ, and 11.02mJ per inference for the MNIST, YOLOv6, and CIFAR-10 datasets, respectively, while maintaining SOTA accuracy.

42. OpenGeMM: A Highly-Efficient GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling

Yi, X., Antonio, R., Dumoulin, J., Sun, J., Van Delm, J., Pereira Paim, G. and Verhelst, M., 2025, January. OpenGeMM: A Highly-Efficient GeMM Accelerator Generator with Lightweight RISC-V Control and Tight Memory Coupling. In Proceedings of the 30th Asia and South Pacific Design Automation Conference (pp. 1055-1061)

Abstract: Deep neural networks (DNNs) face significant challenges when deployed on resource-constrained extreme edge devices due to their computational and data-intensive nature. While standalone accelerators tailored for specific application scenarios suffer from inflexible control and limited programmability, generic hardware acceleration platforms coupled with RISC-V CPUs can enable high reusability and flexibility, yet typically at the expense of system-level efficiency and low utilization.

To fill this gap, we propose OpenGeMM, an open-source acceleration platform, jointly demonstrating high efficiency and utilization, as well as ease of configurability and programmability. OpenGeMM encompasses a parameterized Chisel-coded GeMM accelerator, a lightweight RISC-V processor, and a tightly coupled multi-banked scratchpad memory. The GeMM core utilization and system efficiency are boosted through three mechanisms: configuration pre-loading, input pre-fetching with output buffering, and programmable strided memory access. Experimental results show that OpenGeMM can consistently achieve hardware utilization ranging from 81.89% to 99.34% across diverse CNN and Transformer workloads. Compared to the SotA open-source Gemmini accelerator, OpenGeMM demonstrates a 3.58× to 16.40× speedup on normalized throughput across a wide variety of GeMM workloads, while achieving 4.68 TOPS/W system efficiency.

43. DataMaestro: A Versatile and Efficient Data Streaming Engine Bringing Decoupled Memory Access To Dataflow Accelerators

X. Yi, Y. Deng, R. Antonio, F. Kong, G. Paim and M. Verhelst, “DataMaestro: A Versatile and Efficient Data Streaming Engine Bringing Decoupled Memory Access To Dataflow Accelerators,” 2025 62nd ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 2025, pp. 1-7

Abstract: Deep Neural Networks (DNNs) have achieved remarkable success across various intelligent tasks but encounter performance and energy challenges in inference execution due to data movement bottlenecks. We introduce DataMaestro, a versatile and efficient data streaming unit that brings the decoupled access/execute architecture to DNN dataflow accelerators to address this issue. DataMaestro supports flexible and programmable access patterns to accommodate diverse workload types and dataflows, incorporates fine-grained prefetch and addressing mode switching to mitigate bank conflicts, and enables customizable on-the-fly data manipulation to reduce memory footprints and access counts. We integrate five DataMaestros with a Tensor Core-like GeMM accelerator and a Quantization accelerator into a RISC-V host system for evaluation. The FPGA prototype and VLSI synthesis results demonstrate that DataMaestro helps the GeMM core achieve nearly 100% utilization, which is 1.05 21.39× better than state-of-the-art solutions, while minimizing area and energy consumption to merely 6.43% and 15.06% of the total system.

44. Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators

R. Geens, M. Shi, A. Symons, C. Fang and M. Verhelst, “Energy Cost Modelling for Optimizing Large Language Model Inference on Hardware Accelerators,” 2024 IEEE 37th International System-on-Chip Conference (SOCC), Dresden, Germany, 2024, pp. 1-6

Abstract: The rise of Large Language Models (LLMs) has significantly escalated the demand for efficient LLM inference, primarily fulfilled through cloud-based GPU computing. This approach, while effective, is associated with high energy consumption resulting in large operating expenses and considerable carbon footprints. In the meantime, growing privacy concerns advocate for inference on edge devices, which are constrained by a limited battery capacity. Both cloud and edge scenarios necessitate energy-efficient LLM inference strategies.This paper addresses the urgent need for energy-efficient inference by proposing an open-source framework designed to model LLM workloads on dedicated accelerators. Our framework facilitates early identification of energy bottlenecks through rapid modeling of the execution efficiency of a wide range of LLMs on diverse hardware architectures. Key innovations include a PyTorch-based generalized LLM template to easily generate custom workload graphs, extensions of the ZigZag design space exploration framework and techniques to significantly speed up simulation time at a negligible loss of accuracy. Using a representative hardware architecture, we conduct three case studies to reveal critical energy bottlenecks in Llama2-7B inference, revealing that 1) memory-bound computing in the decode stage is detrimental not only for the latency, but also for the energy cost; 2) aggressive weight-only quantization can reduce the energy cost by 4.6 × and shift the bottleneck from weight fetching to the attention mechanism; 3) in edge scenarios, the relative energy cost of the prefill stage is more significant, encouraging efforts to optimize both prefill and decode stage. Our framework is available open-source at github.com/KULeuven-MICAS/zigzag-llm.

45. Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators

A. Symons, L. Mei, S. Colleman, P. Houshmand, S. Karl and M. Verhelst, “Stream: Design Space Exploration of Layer-Fused DNNs on Heterogeneous Dataflow Accelerators,” in IEEE Transactions on Computers, vol. 74, no. 1, pp. 237-249, Jan. 2025

Abstract: As the landscape of deep neural networks evolves, heterogeneous dataflow accelerators, in the form of multi-core architectures or chiplet-based designs, promise more flexibility and higher inference performance through scalability. So far, these systems exploit the increased parallelism by coarsely mapping a single layer at a time across cores, which incurs frequent costly off-chip memory accesses, or by pipelining batches of inputs, which falls short in meeting the demands of latency-critical applications. To alleviate these bottlenecks, this work explores a new fine-grain mapping paradigm, referred to as layer fusion, on heterogeneous dataflow accelerators through a novel design space exploration framework called Stream. Stream captures a wide variety of heterogeneous dataflow architectures and mapping granularities, and implements a memory and communication-aware latency and energy analysis validated with three distinct state-of-the-art hardware implementations. As such, it facilitates a holistic exploration of architecture and mapping, by strategically allocating the workload through constraint optimization. The findings demonstrate that the integration of layer fusion with heterogeneous dataflow accelerators yields up to 2.2× lower energy-delay product in inference efficiency, addressing both energy consumption and latency concerns. The framework is available open-source at: github.com/kuleuven-micas/stream.

46. A Bespoke Design Approach to Low-Power Printed Microprocessors for Machine Learning Applications

P. Chaidos, G. Armeniakos, S. Xydis and D. Soudris, “A Bespoke Design Approach to Low-Power Printed Microprocessors for Machine Learning Applications,” 2025 IEEE International Symposium on Circuits and Systems (ISCAS), London, United Kingdom, 2025

Abstract: Printed electronics have gained significant traction in recent years, presenting a viable path to integrating computing into everyday items, from disposable products to low-cost healthcare. However, the adoption of computing in these domains is hindered by strict area and power constraints, limiting the effectiveness of general-purpose microprocessors. This paper proposes a bespoke microprocessor design approach to address these challenges, by tailoring the design to specific applications and eliminating unnecessary logic. Targeting machine learning applications, we further optimize core operations by integrating a SIMD MAC unit supporting 4 precision configurations that boost the efficiency of microprocessors. Our evaluation across 6 ML models and the large-scale Zero-Riscy core, shows that our methodology can achieve improvements of 22.2%, 23.6%, and 33.79% in area, power, and speed, respectively, without compromising accuracy. Against state-of-the-art printed processors, our approach can still offer significant speedups, but along with some accuracy degradation. This work explores how such trade-offs can enable low-power printed microprocessors for diverse ML applications.

47. EFLOP: a sparsity-aware metric for evaluating computational cost in spiking and non-spiking neural networks

Narduzzi, S., Zenke, F., Liu, S.C. and Dunbar, L.A., 2025. EFLOP: A Sparsity-Aware Metric for Evaluating Computational Cost in Spiking and Non-Spiking Neural Networks. Neuromorphic Computing and Engineering.

Abstract: Deploying energy-efficient deep neural networks on energy-constrained edge devices is an important research topic in both machine learning and circuit design communities. Both artificial neural networks (ANNs) and spiking neural networks (SNNs) have been proposed as candidates for these tasks. In particular, SNNs are considered energy-efficient because they leverage temporal sparsity in their outputs. However, existing computational frameworks fail to accurately estimate the cost of running sparse networks on modern time-stepped hardware, which exploits sparsity by skipping zero-valued operations. Meanwhile, weight sparsity-aware training remains underexplored for SNNs and lacks systematic benchmarking against optimized ANNs, making fair comparisons between the two paradigms difficult. To bridge this gap, we introduce the effective floating-point operation (EFLOP), a metric that accounts for the sparse operations during pre-activation updates of both ANNs and SNNs. Applying weight sparsity-aware training to both SNNs and ANNs, we achieve up to 8.9× reduction in EFLOPs for gated recurrent unit models and 3.6× for LIF models by sparsifying weights by 80, without sacrificing accuracy on the Spiking Heidelberg Digits and Spiking Speech Command datasets. These findings highlight the critical role of network sparsity in designing energy-efficient neural networks and establish EFLOPs as a robust framework for cross-paradigm comparisons.

48. Decoding Finger Velocity from Cortical Spike Trains with Recurrent Spiking Neural Networks

T. Liu, J. Gygax, J. Rossbroich, Y. Chua, S. Zhang and F. Zenke, “Decoding Finger Velocity from Cortical Spike Trains with Recurrent Spiking Neural Networks,” 2024 IEEE Biomedical Circuits and Systems Conference (BioCAS), Xi’an, China, 2024, pp. 1-5

Abstract: Invasive cortical brain-machine interfaces (BMIs) can significantly improve the life quality of motor-impaired patients. Nonetheless, externally mounted pedestals pose an infection risk, which calls for fully implanted systems. Such systems, however, must meet strict latency and energy constraints while providing reliable decoding performance. While recurrent spiking neural networks (RSNNs) are ideally suited for ultra-low-power, low-latency processing on neuromorphic hardware, it is unclear whether they meet the above requirements. To address this question, we trained RSNNs to decode finger velocity from cortical spike trains (CSTs) of two macaque monkeys. First, we found that a large RSNN model outperformed existing feed-forward spiking neural networks (SNNs) and artificial neural networks (ANNs) in terms of their decoding accuracy. We next developed a tiny RSNN with a smaller memory footprint, low firing rates, and sparse connectivity. Despite its reduced computational requirements, the resulting model performed substantially better than existing SNN and ANN decoders. Our results thus demonstrate that RSNNs offer competitive CST decoding performance under tight resource constraints and are promising candidates for fully implanted ultra-low-power BMIs with the potential to revolutionize patient care.

49. Implicit variance regularization in non-contrastive SSL

Srinath Halvagal, M., Laborieux, A. and Zenke, F., 2023. Implicit variance regularization in non-contrastive SSL. Advances in Neural Information Processing Systems, 36, pp.63409-63436

Abstract: Non-contrastive SSL methods like BYOL and SimSiam rely on asymmetric predictor networks to avoid representational collapse without negative samples. Yet, how predictor networks facilitate stable learning is not fully understood. While previous theoretical analyses assumed Euclidean losses, most practical implementations rely on cosine similarity. To gain further theoretical insight into non-contrastive SSL, we analytically study learning dynamics in conjunction with Euclidean and cosine similarity in the eigenspace of closed-form linear predictor networks. We show that both avoid collapse through implicit variance regularization albeit through different dynamical mechanisms. Moreover, we find that the eigenvalues act as effective learning rate multipliers and propose a family of isotropic loss functions (IsoLoss) that equalize convergence rates across eigenmodes. Empirically, IsoLoss speeds up the initial learning dynamics and increases robustness, thereby allowing us to dispense with the EMA target network typically used with non-contrastive methods. Our analysis sheds light on the variance regularization mechanisms of non-contrastive SSL and lays the theoretical grounds for crafting novel loss functions that shape the learning dynamics of the predictor’s spectrum.

50. Time-Predictable Deep Noise Suppression on an Edge Device

A. Cerioli, T. B. Strøm, C. Laroche, T. Piechowiak, L. Pezzarossa and M. Schoeberl, “Time-Predictable Deep Noise Suppression on an Edge Device,” 2025 28th International Symposium on Real-Time Distributed Computing (ISORC), Toulouse, France, 2025, pp. 380-386

Abstract: Hearing aids and remote conference systems benefit from noise reduction. Current noise reduction approaches include machine-learning models that run on edge devices like hearing aids, AirPods, or headsets. Although not a safety-critical application, audio processing is a real-time application. We present a real-time enabled solution of speech enhancement with generation of C code for embedded devices, executing on a real-time processor, and analyzing the worst-case execution time for that application. Using the Patmos processor and the Platin WCET analysis tool, we can guarantee that we process noise canceling within the given deadline.

51. Enabling Automatic Compiler-Driven Vectorization of Transformers

Alladi, S., Ros, A. and Jimborean, A., 2026, January. Enabling Automatic Compiler-Driven Vectorization of Transformers. In 2026 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (pp. 319-333). IEEE.

Abstract: Compiling neural networks and Transformers for edge devices faces significant challenges due to resource constraints and the reliance on manually optimized operations for performance among others. These limitations hinder the scalability and portability of neural networks on resource-constrained platforms, such as edge devices utilizing the RISC-V ecosystem. Addressing these issues, this paper introduces innovative techniques to overcome the inefficiencies of current compilation methods and reduce dependence on manual optimizations.This work proposes a novel compilation flow, ONNX- MLIR-LLVM (OML), which leverages MLIR and LLVM IR to enable automatic optimizations and generate stand-alone RISC-V binaries. Through comprehensive analysis, we identify key barriers preventing the auto-vectorizer from handling vectorization-friendly operators, particularly reduction operations and vectorization-unfriendly data layouts. We address these through a versatile MLIR reduction detection pass and a compile-time transpose pass, respectively.Our automatic transformations (OML-vect) unlock the capabilities of the MLIR affine super-vectorizer, reducing reliance on manual vectorization. Evaluations on both x86 and RISC-V across eight neural networks and Transformer models demonstrate that automatic vectorization via OML-vect achieves, on average, 5% and 59% on x86 and RISC-V, respectively, compared to baseline (manually vectorized libraries), offering an efficient and portable solution for edge device deployments.

52. Compiler-Assisted Instruction Fusion

Reddy, R.R., Singh, S., Perais, A., Ros, A. and Jimborean, A., 2026, January. Compiler-Assisted Instruction Fusion. In 2026 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (pp. 726-739). IEEE.

Abstract: Hardware instruction fusion combines multiple architectural instructions into a single operation, improving performance by freeing up resources. While fusion typically involves consecutive instructions, there are proposals to fuse non-consecutive instructions to maximize potential. However, such approaches require complex and costly hardware to predict and either validate fusion or unfuse, which significantly increases the cost of fusion. In this work, we propose a compiler technique, CAIF – Compiler Assisted Instruction Fusion, for fusion-aware instruction scheduling. CAIF identifies fusible but nonconsecutive memory operations and reorders eligible pairs of instructions such that they appear consecutively in the instruction stream.Our experiments demonstrate that for neural network workloads, a hardware that only fuses consecutive instructions obtains 1.2% average performance improvements over a no-fusion baseline when applications are compiled with a standard compiler and 19.6% when compiled with CAIF. In addition, when non-consecutive hardware fusion (Helios) is enabled, CAIF boosts performance from 6.6% to 20.3%. Moreover, CAIF can effectively handle the statically challenging general-purpose application and boost performance on SPEC CPU 2017 from 2.4% to 6.4%, and from 14.4% to 17.7%, respectively, on the hardware configurations mentioned above.

53. LOKI: a 0.266 pJ/SOP Digital SNN Accelerator with Multi-Cycle Clock-Gated SRAM in 22nm

Luiken, R., Pes, L., Gomony, M.D. and Stuijk, S., 2026, January. LOKI: a 0.266 pJ/SOP Digital SNN Accelerator with Multi-Cycle Clock-Gated SRAM in 22 nm. In 2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC) (pp. 133-139). IEEE.

Abstract: Bio-inspired sensors like Dynamic Vision Sensors (DVS) and silicon cochleas are often combined with Spiking Neural Networks (SNNs), enabling efficient, event-driven processing similar to biological sensory systems. To realize the low-power constraints of the edge, the SNN should run on a hardware architecture that can exploit the sparse nature of the spikes. In this paper, we introduce LOKI, a digital architecture for Fully-Connected (FC) SNNs. By using Multi-Cycle Clock-Gated (MCCG) SRAMs, LOKI can operate at 0.59 V, while running at a clock frequency of 667 MHz. At full throughput, LOKI only consumes 0.266pJ/ SOP. We evaluate LOKI on both the Neuromorphic MNIST (N-MNIST) and the Keyword Spotting (KWS) tasks, achieving 98.0 % accuracy at 119.8 nJ /inference and 93.0% accuracy at 546.5 nJ /inference respectively.

54. Mixed-precision neural networks on risc-v cores: Isa extensions for multi-pumped soft simd operations

Armeniakos, G., Maras, A., Xydis, S. and Soudris, D., 2024, October. Mixed-Precision Neural Networks on RISC-V Cores: ISA Extensions for Multi-Pumped Soft SIMD Operations. In 2024 ACM/IEEE International Conference On Computer Aided Design (ICCAD) (pp. 1-9). IEEE.

Abstract: Recent advancements in quantization and mixed-precision approaches offers substantial opportunities to improve the speed and energy efficiency of Neural Networks (NN). Research has shown that individual parameters with varying low precision, can attain accuracies comparable to full-precision counterparts. However, modern embedded microprocessors provide very limited support for mixed-precision NNs regarding both Instruction Set Architecture (ISA) extensions and their hardware design for efficient execution of mixed-precision operations, i.e., introducing several performance bottlenecks due to numerous instructions for data packing and unpacking, arithmetic unit under-utilizations etc. In this work, we bring together, for the first time, ISA extensions tailored to mixed-precision hardware optimizations, targeting energy-efficient DNN inference on leading RISC-V CPU architectures. We introduce a hardware-software co-design framework that supports cooperative hardware design, mixed-precision quantization, ISA extensions, and cycle-accurate emulations. At the hardware level, we expand the ALU unit in our micro-architecture for configurable mixed-precision arithmetic operations and implement multi-pumping to reduce execution latency, with soft SIMD optimization for 2-bit operations. At the ISA level, we encode three distinct MAC instructions extending the RISC-V ISA, each for different mixed-precision modes, and expose them to the compiler. Our extensive experimental evaluation over widely used DNNs and datasets, such as CIFAR10 and ImageNet, demonstrates that our framework can achieve, on average, 15× energy reduction for less than 1% accuracy loss and outperforms the ISA-agnostic state-of-the-art RISC-V cores.

55. MaRVIn: A Cross-Layer Mixed-Precision RISC-V Framework for DNN Inference, from ISA Extension to Hardware Acceleration

Armeniakos, G., Maras, A., Xydis, S. and Soudris, D., 2025. MaRVIn: A Cross-Layer Mixed-Precision RISC-V Framework for DNN Inference, from ISA Extension to Hardware Acceleration. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

Abstract: The evolution of quantization and mixed-precision techniques has unlocked new possibilities for enhancing the speed and energy efficiency of Neural Networks (NNs). Several recent studies indicate that adapting precision levels across different parameters can maintain accuracy comparable to full-precision models while significantly reducing computational demands. However, existing embedded microprocessors lack sufficient architectural support for efficiently executing mixed-precision NNs, both in terms of ISA extensions and hardware design. This limitation results in inefficiencies such as excessive data packing/unpacking and underutilized arithmetic units, leading to performance bottlenecks. In this work, to address these challenges, we propose novel ISA extensions and the micro-architecture implementation specifically designed to optimize mixed-precision execution, enabling energy-efficient deep learning inference on RISC-V architectures. We introduce MaRVIn, a cross-layer hardware-software co-design framework that enhances power efficiency and performance through a combination of hardware improvements, mixed-precision quantization, ISA-level optimizations, and cycle-accurate emulation. At the hardware level, we enhance the ALU with configurable mixed-precision arithmetic (2-, 4-, and 8-bit) for weights and/or activations. To further improve execution efficiency, we employ multi-pumping to reduce execution latency and implement soft SIMD for efficient 2-bit operations. We also extend ISA to support these mixed-precision operations. At the software level, we integrate a pruning-aware fine-tuning method to optimize model compression. Additionally, we introduce a greedy-based design space exploration (DSE) approach to efficiently search for Pareto-optimal mixed-quantized models. Finally, we incorporate voltage scaling to boost the power efficiency of our system. Our extensive experimental evaluation over widely used DNNs and datasets, such as CIFAR10 and ImageNet, demonstrates that our framework can achieve, on average, 17.6× speedup for less than 1% accuracy loss and outperforms the ISA-agnostic state-of-the-art RISC-V cores, delivering up to 1.8 TOPs/W.

56. Hardware-Centric Analysis of DeepSeek’s Multi-Head Latent Attention

Geens, R. and Verhelst, M., 2025. Hardware‐Centric Analysis of DeepSeek’s Multi‐Head Latent Attention. Electronics Letters, 61(1), p.e70504.

Abstract: Multi-head latent attention (MLA), introduced in DeepSeek-V2, improves the efficiency of large language models by projecting query, key and value tensors into a compact latent space. This architectural change reduces the KV-cache size and significantly lowers memory bandwidth demands, particularly in the autoregressive decode phase. This letter presents the first hardware-centric analysis of MLA, comparing it to conventional multi-head attention (MHA) and evaluating its implications for accelerator performance. We identify two alternative execution schemes of MLA-reusing, respectively recomputing latent projection matrices—which offer distinct trade-offs between compute and memory access. Using the Stream design space exploration framework, we model their throughput and energy cost across a range of hardware platforms and find that MLA can shift attention workloads toward the compute-bound regime. Our results show that MLA not only reduces bandwidth usage but also enables adaptable execution strategies aligned with hardware constraints. Compared to MHA, it provides more stable and efficient performance, particularly on bandwidth-limited hardware platforms. These findings emphasize MLA’s relevance as a co-design opportunity for future AI accelerators.

57. TreeGRNG: Binary Tree Gaussian Random Number Generator for Efficient Probabilistic AI Hardware

Crols, J., Paim, G., Zhao, S. and Verhelst, M., 2024, March. TreeGRNG: Binary Tree Gaussian Random Number Generator for Efficient Probabilistic AI Hardware. In 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1-6). IEEE.

Abstract: Bayesian Neural Networks (BNNs) offer opportunities for greatly enhancing the trustworthiness of conventional neural networks by monitoring the uncertainties in decision-making. A significant drawback for BNN inference at the extreme edge, however, is the imperative need to incorporate Gaussian Random Number Generators (GRNG) within each neuron. State-of-the-art GRNG algorithms heavily depend on multiple arithmetic operations and the use of extensive look-up tables, posing significant implementation challenges for ultra-low power hardware implementations. To overcome this, this paper presents an innovative binary tree random number generator (TreeGRNG) allowing the use of ultra-low-cost constant comparators instead of arithmetic units. We further enhance the TreeGRNG proposal with a set of hardware-aware optimizations exploiting the Gaussian properties. The optimized TreeGRNG surpasses the State-of-the-Art (SoTA) in terms of distribution accuracy while achieving a 3.7 × reduction in energy per sample and boosting the throughput per unit area by 5.8×. Moreover, our TreeGRNG proposal possesses a distinct advantage over the current SoTA in terms of flexibility, as it easily enables designers to adjust the shape of the sampled probability distribution, extending beyond the capabilities of traditional GRNGs, opening the horizon towards future probabilistic AI designs. The TreeGRNG design is available open-source in the link11https://github.com/KULeuven-MICAS/TreeGRNG.

58. An Open-Source HW-SW Co-Development Framework Enabling Efficient Multi-Accelerator Systems

Antonio, R.A., Dumoulin, J., Yi, X., Van Delm, J., Deng, Y., Paim, G. and Verhelst, M., 2025, August. An open-source hw-sw co-development framework enabling efficient multi-accelerator systems. In 2025 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) (pp. 1-7). IEEE.

Abstract: Heterogeneous accelerator-centric compute clusters are emerging as efficient solutions for diverse AI workloads. However, current integration strategies often compromise data movement efficiency and encounter compatibility issues in hardware and software. This prevents a unified approach that balances performance and ease of use. To this end, we present SNAX, an open-source integrated HW-SW framework enabling efficient multi-accelerator platforms through a novel hybrid-coupling scheme, consisting of loosely coupled asynchronous control and tightly coupled data access. SNAX brings reusable hardware modules designed to enhance compute accelerator utilization, and its customizable MLIR-based compiler to automate key system management tasks, jointly enabling rapid development and deployment of customized multi-accelerator compute clusters. Through extensive experimentation, we demonstrate SNAX’s efficiency and flexibility in a low-power heterogeneous SoC. Accelerators can be easily integrated and programmed to achieve >10× improvement in neural network performance compared to other accelerator systems while maintaining accelerator utilization of >90% in full system operation.

59. Enabling Efficient Hardware Acceleration of Hybrid Vision Transformer (ViT) Networks at the Edge

Dumoulin, J., Houshmand, P., Jain, V. and Verhelst, M., 2024, May. Enabling Efficient Hardware Acceleration of Hybrid Vision Transformer (ViT) Networks at the Edge. In 2024 IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1-5). IEEE.

Abstract: Hybrid vision transformers combine the elements of conventional neural networks (NN) and vision transformers (ViT) to enable lightweight and accurate detection. However, several challenges remain for their efficient deployment on resource-constrained edge devices. The hybrid models suffer from a widely diverse set of NN layer types and large intermediate data tensors, hampering efficient hardware acceleration. To enable their execution at the edge, this paper proposes innovations across the hardware-scheduling stack: a.) At the lowest level, a configurable PE array supports all hybrid ViT layer types; b.) temporal loop re-ordering within one layer, enabling hardware support for normalization and softmax layers, minimizing on-chip data transfers; c.) further scheduling optimization employs layer fusion across inverted bottleneck layers to drastically reduce off-chip memory transfers. The resulting accelerator is implemented in 28nm CMOS, achieving a peak energy efficiency of 1.39 TOPS/W at 25.6 GMACs/s.

60. Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration

Geens, R., Symons, A. and Verhelst, M., 2025, November. Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration. In 2025 34th International Conference on Parallel Architectures and Compilation Techniques (PACT) (pp. 281-291). IEEE.

Abstract: State Space Models (SSMs) offer a promising alternative to transformers for long-sequence processing. However, their efficiency remains hindered by memory-bound operations, particularly in the prefill stage. While MARCA, a recent first effort to accelerate SSMs through a dedicated hardware accelerator, achieves a great speedup over high-end GPUs, an analysis of the broader accelerator design space is lacking. This work systematically analyzes SSM acceleration opportunities from both the scheduling perspective, through fine-grained operator fusion, and the hardware perspective, through design space exploration, using an extended version of the Stream modeling framework. Our results demonstrate that the improved data locality stemming from our optimized fusion and scheduling strategy enables a speedup of up to 4.8× over unfused execution, while our adaptive memory-aware fusion approach reduces on-chip memory requirements by an order of magnitude without sacrificing performance. We further explore accelerator design trade-offs, showing that a fusion-aware hardware architecture can achieve 1.78× higher performance than the state-of-the-art MARCA accelerator, within the same area budget. These results establish operator fusion as a key enabler for next-generation SSM accelerators.ACM Reference Format:Robin Geens, Arne Symons, and Marian Verhelst. 2025. Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration. In Proceedings of (PACT ’25). ACM, New York, NY, USA, 11 pages.

61. Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning

Cuyckens, S., Yi, X., Murthy, N.S., Fang, C. and Verhelst, M., 2025, August. Efficient precision-scalable hardware for microscaling (MX) processing in robotics learning. In 2025 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED) (pp. 1-7). IEEE.

Abstract: Autonomous robots require efficient on-device learning to adapt to new environments without cloud dependency. For this edge training, Microscaling (MX) data types offer a promising solution by combining integer and floating-point representations with shared exponents, reducing energy consumption while maintaining accuracy. However, the state-of-the-art continuous learning processor, namely Dacapo, faces limitations with its MXINT-only support and inefficient vector-based grouping during backpropagation. In this paper, we present, to the best of our knowledge, the first work that addresses these limitations with two key innovations: (1) a precision-scalable arithmetic unit that supports all six MX data types by exploiting sub-word parallelism and unified integer and floating-point processing; and (2) support for square shared exponent groups to enable efficient weight handling during backpropagation, removing storage redundancy and quantization overhead.We evaluate our design against Dacapo under iso-peak-throughput on four robotics workloads in TSMC 16nm FinFET technology at 400MHz, reaching a 51% lower memory footprint, and 4× higher effective training throughput, while achieving comparable energy efficiency, enabling efficient robotics continual learning at the edge.

62. Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration

Cuyckens, S., Yi, X., Geens, R., Dumoulin, J., Wiesner, M., Fang, C. and Verhelst, M., 2026, January. Precision-Scalable Microscaling Datapaths with Optimized Reduction Tree for Efficient NPU Integration. In 2026 31st Asia and South Pacific Design Automation Conference (ASP-DAC) (pp. 611-617). IEEE.

Abstract: Emerging continual learning applications necessitate next-generation neural processing unit (NPU) platforms to support both training and inference operations. The promising Microscaling (MX) standard enables narrow bit-widths for inference and large dynamic ranges for training. However, existing MX multiply-accumulate (MAC) designs face a critical trade-off: integer accumulation requires expensive conversions from narrow floating-point products, while FP32 accumulation suffers from quantization losses and costly normalization. To address these limitations, we propose a hybrid precision-scalable reduction tree for MX MACs that combines the benefits of both approaches, enabling efficient mixed-precision accumulation with controlled accuracy relaxation. Moreover, we integrate an 8×8 array of these MACs into the state-of-the-art (SotA) NPU integration platform, SNAX, to provide efficient control and data transfer to our optimized precision-scalable MX datapath. We evaluate our design both on MAC and system level and compare it to the SotA. Our integrated system achieves an energy efficiency of 657, 1438-1675, and 4065 GOPS/W, respectively, for MXINT8, MXFP8/6, and MXFP4, with a throughput of 64, 256, and 512 GOPS.

63. XDMA: A Distributed, Extensible DMA Architecture for Layout-Flexible Data Movements in Heterogeneous Multi-Accelerator SoCs

Kong, F., Deng, Y., Yi, X., Antonio, R. and Verhelst, M., 2025, November. XDMA: A Distributed, Extensible DMA Architecture for Layout-Flexible Data Movements in Heterogeneous Multi-Accelerator SoCs. In 2025 IEEE 43rd International Conference on Computer Design (ICCD) (pp. 690-693). IEEE.

Abstract: As modern AI workloads increasingly rely on heterogeneous accelerators, ensuring high-bandwidth and layout-flexible data movements between accelerator memories has become a pressing challenge. Direct Memory Access (DMA) engines promise high bandwidth utilization for data movements but are typically optimal only for contiguous memory access, thus requiring additional software loops for data layout transformations. This, in turn, leads to excessive control overhead and underutilized on-chip interconnects. To overcome this inefficiency, we present XDMA, a distributed and extensible DMA architecture that enables layout-flexible data movements with high link utilization. We introduce three key innovations: (1) a data streaming engine as XDMA Frontend, replacing software address generators with hardware ones; (2) a distributed DMA architecture that maximizes link utilization and separates configuration from data transfer; (3) flexible plugins for XDMA enabling on-the-fly data manipulation during data transfers. XDMA demonstrates up to 151.2×/8.2× higher link utilization than software-based implementations in synthetic workloads and achieves 2.3× average speedup over accelerators with SoTA DMA in real-world applications. Our design incurs <2% area overhead over SoTA DMA solutions while consuming 17% of system power. XDMA proves that co-optimizing memory access, layout transformation, and interconnect protocols is key to unlocking heterogeneous multi-accelerator SoC performance.

64. Flexible Hardware Accelerators for Ultra-Low Power Edge AI: The CONVOLVE Approach

Chaidos, P., Maras, A., Alexandris, G., El Arrassi, A., Ahn, B., Garside, J.D., Deng, Y., Spyrou, T., Gebregiorgis, A., Taouil, M. and Gomony, M.D., 2025, July. Flexible Hardware Accelerators for Ultra-Low Power Edge AI: The CONVOLVE Approach. In 2025 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) (Vol. 1, pp. 1-6). IEEE.

Abstract: The prevalence of Artificial Intelligence (AI) applications has been undisputed in most fields of modern computing. As the paradigm shifts from high-performance Cloud computing infrastructure to decentralized smart Edge devices that offer higher reliability and lower latency, with no need for high-speed connectivity, more effort has been required to address resource and energy concerns. In this paper, we present the advancements of the HORIZON EU CONVOLVE project and focus on the development of flexible accelerators targeting several types of modern Edge AI applications. We outline the architectural decisions for the CONVOLVE SoC and highlight the various techniques employed in 4 of our developed accelerators. These steps mark a significant step toward enabling Ultra-Low Power (ULP) AI processing at the edge. Finally, we present quantitative hardware results and discuss the design trade-offs and performance characteristics of the proposed micro-architectures.

65. Extracting Weights of CIM-Based Neural Networks Through Power Analysis on Adder-Tree

Mir, F.J., Aljuffri, A., Hamdioui, S. and Taouil, M., 2024, May. Extracting weights of CIM-based neural networks through power analysis of adder-trees. In 2024 IEEE European Test Symposium (ETS) (pp. 1-4). IEEE.

Abstract: Computation-in-Memory (CIM) architectures present a promising solution for efficient implementation of Neural Networks. Particularly, SRAM-based digital CIM architectures are optimal candidates to realize them. Recent studies have revealed potential weaknesses in these architectures, particularly against power attacks. This study introduces a novel attack method enabling weight extraction through the analysis of the adder tree component within the architecture. In our attack, the k-means clustering technique is employed to identify the hamming weights of the CIM weights. Subsequently, we correlate traces belonging to known weights with traces belonging to Hamming groups with unknown weights in order to identify their weight values. As a case study, the attack was applied on SRAM CIM implementation based on 40nm TSMC technology. The results indicate that the weights stored in the CIM crossbar can be retrieved with 100% accuracy purely by analyzing the power consumption.

66. Early Exiting Predictive Coding Neural Networks for Edge AI

Zniber, A., Ghogho, M., Karrakchou, O. and Zakroum, M., 2025, September. Early Exiting Predictive Coding Neural Networks for Edge AI. In 2025 33rd European Signal Processing Conference (EUSIPCO) (pp. 1687-1691). IEEE.

Abstract: The Internet of Things is transforming various fields, with sensors increasingly embedded in wearables, smart buildings, and connected equipment. While deep learning enables valuable insights from IoT data, conventional models are too computationally demanding for resource-limited edge devices. Moreover, privacy concerns and real-time processing needs make local computation a necessity over cloud-based solutions. Inspired by the brain’s energy efficiency, we propose a shallow bidirectional predictive coding network with early exiting, dynamically halting computations once a performance threshold is met. This reduces the memory footprint and computational overhead while maintaining high accuracy. We validate our approach using the CIFAR-10 dataset. Our model achieves performance comparable to deep networks with significantly fewer parameters and lower computational complexity, demonstrating the potential of biologically inspired architectures for efficient edge AI.

67. AFSRAM-CIM: Adder Free SRAM-Based Digital Computation-in-Memory for BNN

El Arrassi, A., Yaldagard, M.A., Tao, X., Shahroodi, T., Mir, F., Biyani, Y., Gomony, M.D., Gebregiorgis, A., Joshi, R. and Hamdioui, S., 2024, October. Afsram-cim: Adder free sram-based digital computation-in-memory for bnn. In 2024 IFIP/IEEE 32nd International Conference on Very Large Scale Integration (VLSI-SoC) (pp. 1-6). IEEE.

Abstract: Binary Neural Networks (BNNs) have demonstrated significant advantages in reducing computation and memory costs, all while maintaining acceptable accuracy on various image detection tasks. Thus, BNNs have the potential to support practical cognitive tasks on resource-constrained platforms, such as edge computing devices. To realize this, SRAM-based digital Computation-in-Memory (CIM) has gained growing attention as it overcomes the analog CIM architecture bottlenecks such as limited computing accuracy due to process variation, non-linearity, power and area-hungry Analog-to-Digital Converters (ADCs), etc. However, digital CIM architectures are highly dominated by power-hungry adder-trees, which can nullify the benefits of SRAM-based digital CIM. To address this issue, this paper proposes an adder free SRAM-based digital CIM, AFSRAM-CIM, for BNN acceleration. The proposed CIM architecture utilizes a multi-functional 10-T SRAM cell-based crossbar array and a new energy-efficient approach to perform the popcount operation. Simulation results using the MNIST dataset show that the proposed architecture maintains the state-of-the-art inference accuracy of 99.21% with only 11.86 fJ energy per operation. Moreover, AFSRAM-CIM achieves over 3× energy and ≈17× area savings when compared to the conventional digital CIM approaches.

68. Collaborative P4-SDN DDoS Detection and Mitigation with Early-Exit Neural Networks

O. Karrakchou, A. Zniber, A. Sebbar and M. Ghogho, “Collaborative P4-SDN DDoS Detection and Mitigation with Early-Exit Neural Networks,” GLOBECOM 2025 – 2025 IEEE Global Communications Conference, Taipei, Taiwan, 2025, pp. 6081-6086, doi: 10.1109/GLOBECOM59602.2025.11431920.

Abstract: Distributed Denial of Service (DDoS) attacks pose a persistent threat to network security, requiring timely and scalable mitigation strategies. In this paper, we propose a novel collaborative architecture that integrates a P4-programmable data plane with an SDN control plane to enable real-time DDoS detection and response. At the core of our approach is a split early-exit neural network that performs partial inference in the data plane using a quantized Convolutional Neural Network (CNN), while deferring uncertain cases to a Gated Recurrent Unit (GRU) module in the control plane. This design enables high-speed classification at line rate with the ability to escalate more complex flows for deeper analysis. Experimental evaluation using real-world DDoS datasets demonstrates that our approach achieves high detection accuracy with significantly reduced inference latency and control plane overhead. These results highlight the potential of tightly coupled ML-P4-SDN systems for efficient, adaptive, and low-latency DDoS defense.

69. FlexiGen: An Automated AI Accelerator Generation Framework With Decoupled-Access-Execute and Dynamic Dataflows

Yi, X., Shi, M., Dumoulin, J., Sun, J., Deng, Y., Antonio, R., Kong, F., Zhang, Y., Jiang, Y. and Verhelst, M., 2026. FlexiGen: An Automated AI Accelerator Generation Framework With Decoupled-Access-Execute and Dynamic Dataflows. IEEE Transactions on Circuits and Systems I: Regular Papers.

Abstract: Modern tensor applications, especially artificial intelligence (AI) applications, are evolving rapidly, posing a significant demand for agile hardware design. While numerous hardware generators have been developed, they suffer from three significant limitations: 1) they are either limited to a single dataflow/data type generation, failing to cater to the computational requirements of diverse workloads; 2) or focus only on the array level optimization, omitting system-level effects, such as the influence of on-chip memory bandwidth and contention; and 3) customized workload mapping/configuration is needed, resulting in increased programming complexity. To address these challenges, we propose FlexiGen, a flexible and extensible hardware generation framework, which targets diverse deep neural networks (DNN) tensor applications and can generate a complete synthesizable acceleration system at the RTL level with arbitrary dataflow and its combinations. Our key contributions are threefold: 1) we incorporate decoupled-access-execute architecture inside FlexiGen, enabling full system generation while maintaining flexibility and efficiency; 2) we propose a versatile spatial core generator that supports dynamic spatial dataflows and multiple data precisions in the same array and a compatible data streaming engine generator that can support arbitrary temporal dataflows and N -dimensional data access patterns; and 3) we leverage a uniform programming interface and provide a customized kernel library, enabling agile configuration programming. We conduct an intensive evaluation to demonstrate the versatility of FlexiGen in dataflow accelerator generation and show the trade-offs of performance, area, and power across a wide range of dataflows and workloads at both the array level and system level. Our case study experiment shows FlexiGen’s usefulness as a hardware generator to rapidly generate desired dataflow acceleration systems. Compared with the state-of-the-art (SotA) hardware generation framework LEGO, FlexiGen achieves 36.79% and 57.16% less area and power when generating the same dual spatial dataflow design. FlexiGen is open-source and available at https://github.com/KULeuven-MICAS/snax_cluster

70. Synthetic data generation techniques for training deep acoustic siren identification networks

Damiano, S., Cramer, B., Guntoro, A. and van Waterschoot, T., 2024. Synthetic data generation techniques for training deep acoustic siren identification networks. Frontiers in Signal Processing, 4, p.1358532.

Abstract: Acoustic sensing has been widely exploited for the early detection of harmful situations in urban environments: in particular, several siren identification algorithms based on deep neural networks have been developed and have proven robust to the noisy and non-stationary urban acoustic scene. Although high classification accuracy can be achieved when training and evaluating on the same dataset, the cross-dataset performance of such models remains unexplored. To build robust models that generalize well to unseen data, large datasets that capture the diversity of the target sounds are needed, whose collection is generally expensive and time consuming. To overcome this limitation, in this work we investigate synthetic data generation techniques for training siren identification models. To obtain siren source signals, we either collect from public sources a small set of stationary, recorded siren sounds, or generate them synthetically. We then simulate source motion, acoustic propagation and Doppler effect, and finally combine the resulting signal with background noise. This way, we build two synthetic datasets used to train three different convolutional neural networks, then tested on real-world datasets unseen during training. We show that the proposed training strategy based on the use of recorded source signals and synthetic acoustic propagation performs best. In particular, this method leads to models that exhibit a better generalization ability, as compared to training and evaluating in a cross-dataset setting. Moreover, the proposed method loosens the data collection requirement and is entirely built using publicly available resources.

71. Performance Modeling & Mapping of LLM Inference on Heterogeneous Vectorized CGRAs

Kefallinos, D., Alexandris, G., Maras, A., Chaidos, P., Gomony, M.D., Corporaal, H., Soudris, D. and Xydis, S., 2026, April. Performance Modeling & Mapping of LLM Inference on Heterogeneous Vectorized CGRAs. In 17th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures 15th Workshop on Design Tools (p. 1).

Abstract: Since the emergence of transformer-based models, the computational demands for Large Language Model (LLM) inference have been increasing exponentially, primarily due to their compounding parameter sizes, their structural complexity, and the use of non-linear functions. This tendency leads to the necessity of deploying them on low-power edge devices and DNN accelerators, to fuel next-generation agentic AI systems. Coarse-Grained Reconfigurable Architectures (CGRAs) have proven to be a compelling paradigm for edge acceleration, combining the programmability of general-purpose platforms with the high performance and energy efficiency associated with ASICs. In this work, we introduce an end-to-end performance modeling and mapping framework for LLM inference on heterogeneous CGRAs. Our methodology enables rapid exploration of the micro-architectural design space parameters, i.e., the number of processing elements, vector sizes, and memory configurations, by providing an accurate, explainable, and analytical CGRA performance modeling methodology, with an average cycle error of 0.9%. Architecturally, we build upon R-Blocks, a heterogeneous CGRA platform, and extend it to support floating-point arithmetic operations as well as a full-stack compilation and mapping flow for both full (FP32) and quantized (INT8) Llama2 models. The proposed methodology, evaluated on a 22nm technology node, achieves superior peak performance per Watt compared to related works such as REVAMP and CFEACT (1.8× and 2.8× respectively).

72. CIMple: Standard-cell SRAM-based CIM with LUT-based split softmax for attention acceleration

Ahn, B., Tao, X., Gomony, M.D., Geilen, M. and Corporaal, H., 2025, October. CIMple: Standard-cell SRAM-based CIM with LUT-based split softmax for attention acceleration. In 2025 Cross-Disciplinary Conference on Memory-Centric Computing (CCMCC) (pp. 1-10). IEEE.

Abstract: Large Language Models (LLMs) such as LLaMA and DeepSeek, are built on transformer architectures, which have become a standard model for achieving state-of-the-art performance in natural language processing tasks. Recently, there has been growing interest in deploying LLMs on edge devices. Although smaller LLM models are being proposed, they often still contain billions of parameters. Since edge devices are limited in their resources this poses a significant challenge for edge deployment. Compute-in-memory (CIM) is a promising architecture that addresses this by reducing data movement through the integration of computational logic directly into memory. However, existing CIM architectures support only static Multiply-Accumulate (MAC) operations which limit their configurability in supporting nonlinear operations and various types of transformer models. This paper presents a fully digital standard-cell SRAM-based CIM architecture accelerator for self-attention, called CIMple, designed to overcome these limitations, inside transformer models. The key contributions of CIMple are: 1) A novel dual-banked CIM-based fully digital self-attention accelerator using 8-bit parallel weight feeding. 2) A look-up-table (LUT) based fixed-point implementation reducing latency with minimal accuracy degradation. 3) A performance evaluation of a 32kb CIM-based self-attention accelerator implemented in 28nm, which achieves 26.1 TOPS/W at 0.85V and 2.31 TOPS/mm^2 at 1.2V, both with INT8 precision.

73. First-Class Verification Dialects for MLIR

Fehr, M., Fan, Y., Pompougnac, H., Regehr, J. and Grosser, T., 2025. First-Class Verification Dialects for MLIR. Proceedings of the ACM on Programming Languages, 9(PLDI), pp.1466-1490.

Abstract: MLIR is a toolkit supporting the development of extensible and composable intermediate representations (IRs) called dialects; it was created in response to rapid changes in hardware platforms, programming languages, and application domains such as machine learning. MLIR supports development teams creating compilers and compiler-adjacent tools by factoring out common infrastructure such as parsers and printers. A major limitation of MLIR is that it is syntax-focused: it has no support for directly encoding the semantics of operations in its dialects. Thus, at present, the parts of MLIR tools that depend on semantics—optimizers, analyzers, verifiers, transformers—must all be engineered by hand.

Our work makes formal semantics a first-class citizen in the MLIR ecosystem. We designed and implemented a collection of semantics-supporting MLIR dialects for encoding the semantics of compiler IRs. These dialects support a separation of concerns between three domains of expertise when building formal-methods-based tooling for compilers. First, compiler developers define their dialect’s semantics as a lowering (compilation transformation) from their dialect to one or more of ours. Second, SMT solver experts provide tools to optimize domain-specific high-level semantics and lower them to SMT queries. Third, tool builders create dialect-independent verification tools.

We validate our work by defining semantics for five key MLIR dialects, defining a state-of-the-art SMT encoding for memory-based semantics, and building three dialect-agnostic tools, which we used to find five miscompilation bugs in upstream MLIR, verify a canonicalization pass, and also formally verify transfer functions for two dataflow analyses: “known bits” (that finds individual bits that are always zero or one in all executions) and “demanded bits” (that finds don’t-care bits). The transfer functions that we verify are improved versions of those in upstream MLIR; they detect on average 36.6% more known bits in real-world MLIR programs compared to the upstream implementation.

74. Certified Decision Procedures for Width-Independent Bitvector Predicates

Fehr, M., Fan, Y., Pompougnac, H., Regehr, J. and Grosser, T., 2025. First-Class Verification Dialects for MLIR. Proceedings of the ACM on Programming Languages, 9(PLDI), pp.1466-1490.

75. Interactive Bitvector Reasoning using Verified Bit-Blasting

Böving, H., Bhat, S., Cicolini, L., Keizer, A., Frenot, L., Mohamed, A., Stefanesco, L., Khan, H., Clune, J., Barrett, C. and Grosser, T., 2025. Interactive bitvector reasoning using verified bit-blasting. Proceedings of the ACM on Programming Languages, 9(OOPSLA2), pp.3259-3285.

Abstract: Bit-blasting SMT solvers enable efficient automatic reasoning about bitvectors, which are fundamental for the verification of compiler backends, cryptographic algorithms, hardware designs and other soft- or hardware tasks. Despite the clear demand for efficient bitvector reasoning infrastructure and the impressive advancements in state-of-the-art bit-blasting SMT solvers such as Bitwuzla, effective bitvector reasoning within interactive theorem provers (ITPs) remains a challenge, hindering their use for mechanized proofs. Incomplete bitvector libraries, unavailable or only partially integrated decision procedures, complex and hard-to-bitblast operations, and limited integration with the host language prevent the wide adoption of bitvector reasoning in proving contexts. We introduce bv_decide: the first end-to-end verified bitblaster designed for interactive bitvector reasoning in a dependently-typed ITP. Our verified bitblaster is scalable, comes with a complete end-to-end proof (trusting only the Lean compiler and kernel), and is available as a proof tactic that allows interactive reasoning right from within a programming language, in our case Lean. We use Lean’s Functional But In-Place (FBIP) paradigm to efficiently encode our core data structures (e.g., AIGs), demonstrating that fast execution of an SMT solver need not come at the expense of rigorous formalization. We enable dependable interactive verification of user-written-code by basing Lean’s C-Style standard dataypes UInt/SInt on our bitvector type, adding a lowering from enums and structs to bitvectors to enable transparent bit-blasting support for composed types, and by offering an interactive tactic that either solves a goal or provides a counter-example. Moreover, we present the design of Lean’s canonical bitvector library, which supports all operations (with reasoning principles) for the SMT-LIB 2.7 standard (including overflow modeling), is fast-to-execute, and offers a comprehensive API and automation for bit-width-independent reasoning. We thoroughly evaluate our bit-blaster on a comprehensive set of benchmarks, including the full SMT-LIB dataset, where bv_decide solves more theorems than the state-of-the-art in verified bit-blasting, CoqQFBV. We also verify over 7000 SMT statements extracted from LLVM, providing the largest mechanized verification of LLVM rewrites to date, to our knowledge. By making bit-blasting bitvector reasoning a polished, well-supported, and interactive feature of modern ITPs, we enable effective, dependable white-box reasoning for bitvector-level verification.

76. The Configuration Wall: Characterization and Elimination of Accelerator Configuration Overhead

Van Delm, J., Lydike, A., Dumoulin, J., Crols, J., Yi, X., Antonio, R., Woodruff, J., Grosser, T. and Verhelst, M., 2026, March. The Configuration Wall: Characterization and Elimination of Accelerator Configuration Overhead. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1 (pp. 265-280).

Abstract: Contemporary compute platforms increasingly offload compute kernels from CPU to integrated hardware accelerators to reach maximum performance per Watt. Unfortunately, the time the CPU spends on setup control and synchronization has increased with growing accelerator complexity. For systems with complex accelerators, this means that performance can be configuration-bound. Faster accelerators are more severely impacted by this overlooked performance drop, which we call the configuration wall. Prior work evidences this wall and proposes ad-hoc solutions to reduce configuration overhead. However, these solutions are not universally applicable, nor do they offer comprehensive insights into the underlying causes of performance degradation. In this work, we first introduce a widely-applicable variant of the well-known roofline model to quantify when system performance is configuration-bound. To move systems out of the performance-bound region, we subsequently propose a domain-specific compiler abstraction and associated optimization passes. We implement the abstraction and passes in the MLIR compiler framework to run optimized binaries on open-source architectures to prove its effectiveness and generality. Experiments demonstrate a geomean performance boost of 2x on the open-source OpenGeMM system, by eliminating redundant configuration cycles and by automatically hiding the remaining configuration cycles. Our work provides key insights in how accelerator performance is affected by setup mechanisms, thereby facilitating automatic code generation for circumventing the configuration wall.

77. Improving equilibrium propagation without weight symmetry through Jacobian homeostasis

Laborieux, A. and Zenke, F., Improving equilibrium propagation without weight symmetry through Jacobian homeostasis. In The Twelfth International Conference on Learning Representations.

Abstract: Equilibrium propagation (EP) is a compelling alternative to the backpropagation of error algorithm (BP) for computing gradients of neural networks on biological or analog neuromorphic substrates. Still, the algorithm requires weight symmetry and infinitesimal equilibrium perturbations, i.e., nudges, to estimate unbiased gradients efficiently. Both requirements are challenging to implement in physical systems. Yet, whether and how weight asymmetry affects its applicability is unknown because, in practice, it may be masked by biases introduced through the finite nudge. To address this question, we study generalized EP, which can be formulated without weight symmetry, and analytically isolate the two sources of bias. For complex-differentiable non-symmetric networks, we show that the finite nudge does not pose a problem, as exact derivatives can still be estimated via a Cauchy integral. In contrast, weight asymmetry introduces bias resulting in low task performance due to poor alignment of EP’s neuronal error vectors compared to BP. To mitigate this issue, we present a new homeostatic objective that directly penalizes functional asymmetries of the Jacobian at the network’s fixed point. This homeostatic objective dramatically improves the network’s ability to solve complex tasks such as ImageNet 32×32. Our results lay the theoretical groundwork for studying and mitigating the adverse effects of imperfections of physical networks on learning algorithms that rely on the substrate’s relaxation dynamics.

78. From Diet to Free Lunch: Estimating Auxiliary Signal Properties using Dynamic Pruning Masks in Speech Enhancement Networks

R. Miccini, C. Laroche, T. Piechowiak, X. Fafoutis and L. Pezzarossa, “From Diet to Free Lunch: Estimating Auxiliary Signal Properties Using Dynamic Pruning Masks in Speech Enhancement Networks,” ICASSP 2026 – 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2026, pp. 15427-15431

Abstract: Speech Enhancement (SE) in audio devices is often supported by auxiliary modules for Voice Activity Detection (VAD), SNR estimation, or Acoustic Scene Classification to ensure robust context-aware behavior and seamless user experience. Just like SE, these tasks often employ deep learning; however, deploying additional models on-device is computationally impractical, whereas cloud-based inference would introduce additional latency and compromise privacy. Prior work on SE employed Dynamic Channel Pruning (DynCP) to reduce computation by adaptively disabling specific channels based on the current input. In this work, we investigate whether useful signal properties can be estimated from these internal pruning masks, thus removing the need for separate models. We show that simple, interpretable predictors achieve up to 93% accuracy on VAD, 84% on noise classification, and an R2 of 0.86 on F0 estimation. With binary masks, predictions reduce to weighted sums, inducing negligible overhead. Our contribution is twofold: on one hand, we examine the emergent behavior of DynCP models through the lens of downstream prediction tasks, to reveal what they are learning; on the other, we repurpose and re-propose DynCP as a holistic solution for efficient SE and simultaneous estimation of signal properties.

79. DualRes: Production-ready Dynamic Object Detection

El Hassani, J. and Verelst, T., 2026. DualRes: Production-ready Dynamic Object Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 7842-7851).

Abstract: Dynamic Neural Networks (DNNs) have emerged as a promising solution to improve the computational efficiency of deep neural networks by adaptively adjusting inference complexity based on input characteristics. Despite their advantages, the deployment of dynamic networks in real-world applications remains challenging because most methods are hard to adapt for practical use cases such as object detection, in combination with the lacking support of inference infrastructure.In this work, we present a dynamic neural network architecture specifically designed for object detection. Using our method, we build a variety of Pareto-optimal models for object detection on COCO for models in the 7-10 GFLOPs range.Additionally, to measure the routing efficacy, we introduce an evaluation metric that facilitates standardized benchmarking across different dynamic network approaches. Finally, we introduce an evaluation of a deployment pipeline utilizing the ONNX format, thus building a DNN that shows speedup in a realistic deployment scenario. Experimental results demonstrate the performance and practical viability of our approach for efficient object detection in resource-constrained scenarios.

80. The PMP Snapshot Engine: Fast and Fault-Resilient PMP Reconfiguration for RISC-V

Larmann, C., Aljuffri, A., Marotzke, A., Garza, A., Hamdioui, S. and Taouil, M., 2026, April. The PMP Snapshot Engine: Fast and Fault-Resilient PMP Reconfiguration for RISC-V. In 2026 Design, Automation & Test in Europe Conference (DATE) (pp. 1-7). IEEE.

Abstract: This paper presents a Physical Memory Protection Snapshot Engine (PSE), a lightweight hardware extension for RISC-V that addresses both performance and security challenges of Physical Memory Protection (PMP) reconfiguration. By storing and restoring full PMP configurations in a single cycle, the PSE drastically reduces the overhead of context switches typically used in Trusted Execution Environments (TEEs) and secure real-time systems. At the same time, the redundant storage and two-dimensional parity protection provide an efficient and effective defense against fault injection attacks that target PMP registers. In 100k randomized trials, our experimental results demonstrate that the PSE can reliably detect and prevent FI-induced privilege escalations, while incurring only 11.7% area overhead. This makes it a practical solution for embedded devices where both efficiency and trustworthiness are essential.

81. AIMC Modeling and Parameter Tuning for Layer-Wise Optimal Operating Point in DNN Inference

Dadras, I., Sarda, G.M., Laubeuf, N., Bhattacharjee, D. and Mallik, A., 2023. AIMC Modeling and Parameter Tuning for Layer-Wise Optimal Operating Point in DNN Inference. IEEEAccess, 11, pp.87189-87199.

Abstract: Analog in-memory computing (AIMC) has been utilized in convolutional neural networks (CNNs) edge inference engines to solve the memory bottleneck problem and increase efficiency. However, AIMC analog-to-digital converters (ADCs) restricted resolution imposes quantization of output activations that can reduce the accuracy without meticulous optimization. A study conducted output quantization calibration and obtained configurations with which low-resolution ADCs did not affect the accuracy. The configurations were layer-specific. Therefore, a real-time quantization adjustment was required. AIMC output quantization is adjusted by controlling analog gain entangling it with analog parameters and nonlinear functions. AIMC dynamic output quantization control without interrupting its operation has been an unsettled problem until now. This paper introduces a technique for imposing output quantization configurations obtained from calibration processes on AIMC through circuit parameters setup. The technique permits on-the-fly quantization adjustments enabling layer-wise calibration that increases achievable network accuracies on AIMC platforms. As a case study, we deployed the method on the AIMC macro of an artificial intelligence (AI) inference engine SoC platform with a RISC-V processor and hybrid DIgital-ANAlog accelerators (DIANA). We related its controllable circuit parameters with the quantization configuration in a look-up table. This case study has noteworthy side benefits in identifying platform limitations due to nonlinearities and design imperfections. These limitations are investigated, and design advice that is transferable to future AIMC designs is provided to avoid imperfections such as mismatch, bias voltage drop, and interconnect delay. In addition, the study of output quantization from different levels of abstraction leads to design guidelines to facilitate dynamic quantization control during the application phase.

82. On-Sensor Printed Machine Learning Classification via Bespoke ADC and Decision Tree Co-Designating Point in DNN Inference

Armeniakos, G., Duarte, P.L., Pal, P., Zervakis, G., Tahoori, M.B. and Soudris, D., 2024, March. On-sensor printed machine learning classification via bespoke adc and decision tree co-design. In 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1-6). IEEE.

Abstract: Printed electronics (PE) technology provides cost-effective hardware with unmet customization, due to their low non-recurring engineering and fabrication costs. PE exhibit features such as flexibility, stretchability, porosity, and conformality, which make them a prominent candidate for enabling ubiquitous computing. Still, the large feature sizes in PE limit the realization of complex printed circuits, such as machine learning classifiers, especially when processing sensor inputs is necessary, mainly due to the costly analog-to-digital converters (ADCs). To this end, we propose the design of fully customized ADCs and present, for the first time, a co-design framework for generating bespoke Decision Tree classifiers. Our comprehensive evaluation shows that our co-design enables self-powered operation of on-sensor printed classifiers in all benchmark cases.

83. Impact of Sliding Window Variation and Neuronal Time Constants on Acoustic Anomaly Detection Using Recurrent Spiking Neural Networks in Automotive Environment

Kshirasagar, S., Guntoro, A. and Mayr, C., 2024. Impact of sliding window variation and neuronal time constants on acoustic anomaly detection using recurrent spiking neural networks in automotive environment. Algorithms, 17(10), p.440.

84. Toward Attention-Based TinyML: A Heterogeneous Accelerated Architecture and Automated Deployment Flow

Wiese, P., İslamoğlu, G., Scherer, M., Macan, L., Jung, V.J.B., Burrello, A., Conti, F. and Benini, L., 2025. Toward attention-based TinyML: A heterogeneous accelerated architecture and automated deployment flow. IEEE Design & Test, 42(5), pp.63-72.

Abstract: One of the challenges for Tiny Machine Learning (tinyML) is keeping up with the evolution of Machine Learning models from Convolutional Neural Networks to Transformers. We address this by leveraging a heterogeneous architectural template coupling RISC-V processors with hardwired accelerators supported by an automated deployment flow. We demonstrate Attention-based models in a tinyML power envelope with an octacore cluster coupled with an accelerator for quantized Attention. Our deployment flow enables end-to-end 8-bit Transformer inference, achieving leading-edge energy efficiency and throughput of 2960 GOp/J and 154GOp/s (0.65 V, 22nm FD-SOI technology).

85. Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs

Bochem, S., Jung, V.J., Prasad, A.S., Conti, F. and Benini, L., 2025, March. Distributed Inference with Minimal Off-Chip Traffic for Transformers on Low-Power MCUs. In 2025 Design, Automation & Test in Europe Conference (DATE) (pp. 1-7). IEEE.

Abstract: Contextual Artificial Intelligence (AI) based on emerging Transformer models is predicted to drive the next technology revolution in interactive wearable devices such as new-generation smart glasses. By coupling numerous sensors with small, low-power Micro-Controller Units (MCUs), these devices will enable on-device intelligence and sensor control. A major bottleneck in this class of systems is the small amount of on-chip memory available in the MCUs. In this paper, we propose a methodology to deploy real-world Transformers on low-power wearable devices with minimal off-chip traffic exploiting a distributed system of MCUs, partitioning inference across multiple devices and enabling execution with stationary on-chip weights. We validate the scheme by deploying the TinyLlama-42M decoder-only model on a system of 8 parallel ultra-low-power MCUs. The distributed system achieves an energy consumption of 0.64 mJ, a latency of 0.54 ms per inference, a super-linear speedup of 26.1 x, and an Energy Delay Product (EDP) improvement of 27.2 x, compared to a single-chip system. On MobileBERT, the distributed system’s runtime is 38.8 ms, with a super-linear 4.7 × speedup when using 4 MCUs compared to a single-chip system.

86. Correction Fault Attacks on Randomized CRYSTALS-Dilithium Minimal Off-Chip Traffic for Transformers on Low-Power MCUs

Krahmer, E., Pessl, P., Land, G., & Güneysu, T. (2024). Correction Fault Attacks on Randomized CRYSTALS-Dilithium. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2024(3), 174-199

Abstract: After NIST’s selection of Dilithium as the primary future standard for quantum-secure digital signatures, increased efforts to understand its implementation security properties are required to enable widespread adoption on embedded devices. Concretely, there are still many open questions regarding the susceptibility of Dilithium to fault attacks. This is especially the case for Dilithium’s randomized (or hedged) signing mode, which, likely due to devastating implementation attacks on the deterministic mode, was selected as the default by NIST.
This work takes steps towards closing this gap by presenting two new key-recovery fault attacks on randomized/hedged Dilithium. Both attacks are based on the idea< of correcting faulty signatures after signing. A successful correction yields the value of a secret intermediate that carries information on the key. After gathering many faulty signatures and corresponding correction values, it is possible to solve for the
signing key via either simple linear algebra or lattice-reduction techniques. Our first attack extends a previously published attack based on an instruction-skipping fault to the randomized setting. Our second attack injects faults in the matrix A, which is part of the public key. As such, it is not sensitive to side-channel leakage and has, potentially for this reason, not seen prior analysis regarding faults.
We show that for Dilithium2, the attacks allow key recovery with as little as 1024 and 512 faulty signatures, with each signature generated by injecting a single targeted fault. We also demonstrate how our attacks can be adapted to circumvent several popular fault countermeasures with a moderate increase in the computational runtime and the number of required faulty signatures. These results are verified using both simulated faults and clock glitches on an ARM-based standard microcontroller. The presented attacks demonstrate that also randomized Dilithium can be subject to diverse fault attacks, that certain countermeasures might be easily bypassed, and that potential fault targets reach beyond side-channel sensitive operations. Still, many further operations are likely also susceptible, implying the need for increased analysis efforts in the future.

87. Memristor-Based Lightweight Encryption

Siddiqi, M.A., Hernández, J.A.G., Gebreziorgis, A., Bishnoi, R., Strydis, C., Hamdioui, S. and Taouil, M., 2023, September. Memristor-based lightweight encryption. In 2023 26th Euromicro Conference on Digital System Design (DSD) (pp. 634-641). IEEE.

Abstract: Next-generation personalized healthcare devices are undergoing extreme miniaturization in order to improve user acceptability. However, such developments make it difficult to incorporate cryptographic primitives using available target tech-nologies since these algorithms are notorious for their energy consumption. Besides, strengthening these schemes against side-channel attacks further adds to the device overheads. Therefore, viable alternatives among emerging technologies are being sought. In this work, we investigate the possibility of using memristors for implementing lightweight encryption. We propose a 40-nm RRAM-based GIFT-cipher implementation using a 1TIR configuration with promising results; it exhibits roughly half the energy consumption of a CMOS-only implementation. More importantly, its non-volatile and reconfigurable substitution boxes offer an energy-efficient protection mechanism against side-channel attacks. The complete cipher takes 0.0034 mm2of area, and encrypting a 128-bit block consumes a mere 242 pJ.

88. Accelerating TinyML Inference on Microcontrollers Through Approximate Kernelsweight Encryption

Armeniakos, G., Mentzos, G. and Soudris, D., 2024, November. Accelerating TinyML inference on microcontrollers through approximate kernels. In 2024 31st IEEE International Conference on Electronics, Circuits and Systems (ICECS) (pp. 1-4). IEEE.

Abstract: The rapid growth of microcontroller-based IoT devices has opened up numerous applications, from smart manufacturing to personalized healthcare. Despite the widespread adoption of energy-efficient microcontroller units (MCUs) in the Tiny Machine Learning (TinyML) domain, they still face significant limitations in terms of performance and memory (RAM, Flash). In this work, we combine approximate computing and software kernel design to accelerate the inference of approximate CNN models on MCUs. Our kernel-based approximation framework firstly unpacks the operands of each convolution layer and then conducts an offline calculation to determine the significance of each operand. Subsequently, through a design space exploration, it employs a computation skipping approximation strategy based on the calculated significance. Our evaluation on an STM32-Nucleo board and 2 popular CNNs trained on the CIFAR-10 dataset shows that, compared to state-of-the-art exact inference, our Pareto optimal solutions can feature on average 21% latency reduction with no degradation in Top-1 classification accuracy, while for lower accuracy requirements, the corresponding reduction becomes even more pronounced.

89. Deeploy: Enabling Energy-Efficient Deployment of Small Language Models on Heterogeneous Microcontrollers

Scherer, M., Macan, L., Jung, V.J., Wiese, P., Bompani, L., Burrello, A., Conti, F. and Benini, L., 2024. Deeploy: Enabling energy-efficient deployment of small language models on heterogeneous microcontrollers. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 43(11), pp.4009-4020

Abstract: With the rise of embodied foundation models (EFMs), most notably small language models (SLMs), adapting Transformers for the edge applications has become a very active field of research. However, achieving the end-to-end deployment of SLMs on the microcontroller (MCU)-class chips without high-bandwidth off-chip main memory access is still an open challenge. In this article, we demonstrate high efficiency end-to-end SLM deployment on a multicore RISC-V (RV32) MCU augmented with ML instruction extensions and a hardware neural processing unit (NPU). To automate the exploration of the constrained, multidimensional memory versus computation tradeoffs involved in the aggressive SLM deployment on the heterogeneous (multicore+NPU) resources, we introduce Deeploy, a novel deep neural network (DNN) compiler, which generates highly optimized C code requiring minimal runtime support. We demonstrate that Deeploy generates the end-to-end code for executing SLMs, fully exploiting the RV32 cores’ instruction extensions and the NPU. We achieve leading-edge energy and throughput of 490μ J per token, at 340 token per second for an SLM trained on the TinyStories dataset, running for the first time on an MCU-class device without the external memory.

90. Marsellus: A Heterogeneous RISC-V AI-IoT End-Node SoC With 2–8 b DNN Acceleration and 30%-Boost Adaptive Body Biasing

Conti, F., Paulin, G., Garofalo, A., Rossi, D., Di Mauro, A., Rutishauser, G., Ottavi, G., Eggiman, M., Okuhara, H. and Benini, L., 2023. Marsellus: A heterogeneous RISC-V AI-IoT end-node SoC with 2–8 b DNN acceleration and 30%-boost adaptive body biasing. IEEE Journal of Solid-State Circuits, 59(1), pp.128-142.

Abstract: Emerging artificial intelligence-enabled Internet-of-Things (AI-IoT) system-on-chip (SoC) for augmented reality, personalized healthcare, and nanorobotics need to run many diverse tasks within a power envelope of a few tens of mW over a wide range of operating conditions: compute-intensive but strongly quantized deep neural network (DNN) inference, as well as signal processing and control requiring high-precision floating point. We present MARSELLUS, an all-digital heterogeneous SoC for AI-IoT end-nodes fabricated in GlobalFoundries 22 nm FDX that combines: 1) a general-purpose cluster of 16 RISC-V digital signal processing (DSP) cores attuned for the execution of a diverse range of workloads exploiting 4- and 2-bit arithmetic extensions (XpulpNN), combined with fused multiply accumulate (MAC) and LOAD operations and floating-point support; 2) a 2–8 bit reconfigurable binary engine (RBE) to accelerate A33 and A11 (pointwise) convolutions in DNNs; 3) a set of ON-chip monitoring (OCM) blocks connected to an adaptive body biasing (ABB) generator and a hardware control loop, enabling on- the-fly adaptation of transistor threshold voltages. MARSELLUS achieves up to 180 Gop/s or 3.32 Top/s/W on 2-bit precision arithmetic in software, and up to 637 Gop/s or 12.4 Top/s/W on hardware-accelerated DNN layers.

91. MemPool: A Scalable Manycore Architecture With a Low-Latency Shared L1 Memory

Riedel, S., Cavalcante, M., Andri, R. and Benini, L., 2023. MemPool: A scalable manycore architecture with a low-latency shared L1 memory. IEEE Transactions on Computers, 72(12), pp.3561-3575.

Abstract: Shared L1 memory clusters are a common architectural pattern (e.g., in GPGPUs) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly-coupled clusters would not scale beyond a few tens of PEs. In this work, we tackle scaling shared L1 clusters to hundreds of PEs while supporting a flexible and productive programming model and maintaining high efficiency. We present MemPool, a manycore system with 256 RV32IMAXpulpimg “Snitch” cores featuring application-tunable functional units. We designed and implemented an efficient low-latency PE to L1-memory interconnect, an optimized instruction path to ensure each PE’s independent execution, and a powerful DMA engine and system interconnect to stream data in and out. MemPool is easy to program, with all the cores sharing a global view of a large, multi-banked, L1 scratchpad memory, accessible within at most five cycles in the absence of conflicts. We provide multiple runtimes to program MemPool at different abstraction levels and illustrate its versatility with a wide set of applications. MemPool runs at 600 MHz (60 gate delays) in typical conditions (TT/0.80 V/25 ∘C) in 22 nm FDX technology and achieves a performance of up to 229 GOPS or 180 GOPS/W with less than 2% of execution stalls.

92. TransAxx: Efficient Transformers With Approximate Computing

Danopoulos, D., Zervakis, G., Soudris, D. and Henkel, J., 2025. Transaxx: Efficient transformers with approximate computing. IEEE Transactions on Circuits and Systems for Artificial Intelligence.

Abstract: Vision Transformer (ViT) models which were recently introduced by the transformer architecture have shown to be very competitive and often become a popular alternative to Convolutional Neural Networks (CNNs). However, the high computational requirements of these models limit their practical applicability especially on low-power devices. Current state-of-the-art employs approximate multipliers to address the highly increased compute demands of DNN accelerators but no prior research has explored their use on ViT models. In this work we propose TransAxx, a framework based on the popular PyTorch library that enables fast inherent support for approximate arithmetic to seamlessly evaluate the impact of approximate computing on DNNs such as ViT models. Using TransAxx we analyze the sensitivity of transformer models on the ImageNet dataset to approximate multiplications and perform approximate-aware finetuning to regain accuracy. Furthermore, we propose a methodology to generate approximate accelerators for ViT models. Our approach uses a Monte Carlo Tree Search (MCTS) algorithm to efficiently search the space of possible configurations using a hardware-driven hand-crafted policy. Our evaluation demonstrates the efficacy of our methodology in achieving significant trade-offs between accuracy and power, resulting in substantial gains without compromising on performance.

93. Leveraging edge artificial intelligence for sustainable agriculture

El Jarroudi, M., Kouadio, L., Delfosse, P., Bock, C.H., Mahlein, A.K., Fettweis, X., Mercatoris, B., Adams, F., Lenné, J.M. and Hamdioui, S., 2024. Leveraging edge artificial intelligence for sustainable agriculture. Nature Sustainability, 7(7), pp.846-854.

Abstract: Effectively feeding a burgeoning world population is one of the main goals of sustainable agricultural practices. Digital technology, such as edge artificial intelligence (AI), has the potential to introduce substantial benefits to agriculture by enhancing farming practices that can improve agricultural production efficiency, yield, quality and safety. However, the adoption of edge AI faces several challenges, including the need for innovative and efficient edge AI solutions and greater investment in infrastructure and training, all compounded by various environmental, social and economic constraints. Here we provide a roadmap for leveraging edge AI at the intersection of food production and sustainability.

94. Siracusa: A 16 nm Heterogenous RISC-V SoC for Extended Reality With At-MRAM Neural Engine

Prasad, A.S., Scherer, M., Conti, F., Rossi, D., Di Mauro, A., Eggimann, M., Gómez, J.T., Li, Z., Sarwar, S.S., Wang, Z. and De Salvo, B., 2024. Siracusa: A 16 nm heterogenous risc-v soc for extended reality with at-mram neural engine. IEEE Journal of Solid-State Circuits, 59(7), pp.2055-2069.

Abstract: Extended reality (XR) applications are machine learning (ML)-intensive, featuring deep neural networks (DNNs) with millions of weights, tightly latency-bound (10–20 ms end-to-end), and power-constrained (low tens of mW average power). While ML performance and efficiency can be achieved by introducing neural engines within low-power systems-on-chip (SoCs), system-level power for nontrivial DNNs depends strongly on the energy of non-volatile memory (NVM) access for network weights. This work introduces Siracusa, a near-sensor heterogeneous SoC for next-generation XR devices manufactured in 16 nm CMOS. Siracusa couples an octa-core cluster of RISC-V digital signal processing (DSP) cores with a novel tightly coupled “At-Memory” integration between a state-of-the-art digital neural engine called N-EUREKA and an on-chip NVM based on magnetoresistive random access memory (MRAM), achieving 1.7 × higher throughput and 3 × better energy efficiency than XR SoCs using NVM as background memory. The fabricated SoC prototype achieves an area efficiency of 65.2 GOp/s/mm2 and a peak energy efficiency of 8.84 TOp/J for DNN inference while supporting complex, heterogeneous application workloads, which combine ML with conventional signal processing and control.

95. HAETAE: Shorter Lattice-Based Fiat-Shamir Signatures

Cheon, J.H., Choe, H., Devevey, J., Güneysu, T., Hong, D., Krausz, M., Land, G., Möller, M., Stehlé, D. and Yi, M., 2024. HAETAE: Shorter lattice-based fiat-shamir signatures. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2024(3), pp.25-75.

Abstract: We present HAETAE (Hyperball bimodAl modulE rejecTion signAture schemE), a new lattice-based signature scheme. Like the NIST-selected Dilithium signature scheme, HAETAE is based on the Fiat-Shamir with Aborts paradigm, but our design choices target an improved complexity/compactness compromise that is highly relevant for many space-limited application scenarios. We primarily focus on reducing signature and verification key sizes so that signatures fit into one TCP or UDP datagram while preserving a high level of security against a variety of attacks. As a result, our scheme has signature and verification key sizes up to 39% and 25% smaller, respectively, compared than Dilithium. We provide a portable, constanttime reference implementation together with an optimized implementation using AVX2 instructions and an implementation with reduced stack size for the Cortex-M4. Moreover, we describe how to efficiently protect HAETAE against implementation attacks such as side-channel analysis, making it an attractive candidate for use in IoT and other embedded systems.

96. Voltage Aware Approximate CGRA Synthesis for Energy Efficient DNN Inference

Alexandris, G., Chaidos, P., Maras, A., de Bruin, B., Gomony, M.D., Corporaal, H., Soudris, D. and Xydis, S., 2026, April. Voltage Aware Approximate CGRA Synthesis for Energy Efficient DNN Inference. In 2026 Design, Automation & Test in Europe Conference (DATE) (pp. 1-7). IEEE.

Abstract: The ever-increasing complexity and operational diversity of modern Neural Networks (NNs) have caused the need for low-power and, at the same time, high-performance edge devices for AI applications. Coarse Grained Reconfigurable Architectures (CGRAs) form a promising design paradigm to address these challenges, delivering a close-to-ASIC performance while allowing for hardware programmability. In this paper, we introduce a novel end-to-end exploration and synthesis framework for approximate CGRA processors, enabling the transparent and optimized integration and mapping of approximate multiplication components into CGRAs. Our framework includes an exploration of state-of-the-art approximate multiplication units on the hardware side, along with a software exploration, based on a per-channel model analysis, that maps specific output features onto approximate components based on accuracy degradation constraints, utilizing also SW-based optimization techniques. This enables the optimization of the system’s energy consumption while retaining the accuracy above a certain threshold. At the circuit level, the integration of approximate components enables the creation of voltage islands that operate at reduced voltage levels, which is attributed to their inherently shorter critical paths. This key enabler allows us to effectively reduce the overall power consumption by an average of 30% across our analyzed architectures, compared to their baseline counterparts, while incurring only a minimal 2% area overhead. The proposed methodology was evaluated on the convolutional kernels of a widely used NN model, MobileNetV2, on the ImageNet dataset, demonstrating that the generated architectures can deliver up to 440 GOPS/W with relatively small output error during inference, outperforming several State-of-the-Art CGRA architectures in terms of throughput and energy efficiency.