Skip to content

SCIENTIFIC PUBLICATIONS


Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge

Rutishauser, G., Conti, F. and Benini, L., 2023, June. Free bits: Latency optimization of mixed-precision quantized neural networks on the edge. In 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS) (pp. 1-5). IEEE.

Abstract: Mixed-precision quantization, where a deep neural network’s layers are quantized to different precisions, offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy beyond what can be achieved with homogeneous-bit-width quantization. To navigate the intractable search space of mixed-precision configurations for a given network, this paper proposes a hybrid search methodology. It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware heuristic optimization to find mixed-precision configurations latency-optimized for a specific hardware target. We evaluate our algorithm on MobileNetV1 and MobileNetV2 and deploy the resulting networks on a family of multi-core RISC-V microcontroller platforms with different hardware characteristics. We achieve up to 28.6% reduction of end-to-end latency compared to an 8-bit model at a negligible accuracy drop from a full-precision baseline on the 1000-class ImageNet dataset. We demonstrate speedups relative to an 8-bit baseline, even on systems with no hardware support for sub-byte arithmetic at negligible accuracy drop. Furthermore, we show the superiority of our approach with respect to differentiable search targeting reduced binary operation counts as a proxy for latency.

SALSA: Simulated Annealing-based Loop-Ordering Scheduler for DNN Accelerators

Jung, V.J., Symons, A., Mei, L., Verhelst, M. and Benini, L., 2023. SALSA: Simulated Annealing based Loop-Ordering Scheduler for DNN Accelerators. arXiv preprint arXiv:2304.12931.

Abstract: To meet the growing need for computational power for DNNs, multiple specialized hardware architectures have been proposed. Each DNN layer should be mapped onto the hardware with the most efficient schedule, however, SotA schedulers struggle to consistently provide optimum schedules in a reasonable time across all DNN-HW combinations.
This paper proposes SALSA, a fast dual-engine scheduler to generate optimal execution schedules for both even and uneven mapping. We introduce a new strategy, combining exhaustive search with simulated annealing to address the dynamic nature of the loop ordering design space size across layers. SALSA is extensively benchmarked against two SotA schedulers, LOMA and Timeloop on 5 different DNNs, on average SALSA finds schedules with 11.9% and 7.6% lower energy while speeding up the search by 1.7x and 24x compared to LOMA and Timeloop, respectively.

Dependability of Future Edge-AI Processors: Pandora’s Box

Gomony, M.D., Gebregiorgis, A., Fieback, M., Geilen, M., Stuijk, S., Richter-Brockmann, J., Bishnoi, R., Argo, S., Andradas, L.A., Güneysu, T. and Taouil, M., 2023, May. Dependability of Future Edge-AI Processors: Pandora’s Box. In 2023 IEEE European Test Symposium (ETS) (pp. 1-6). IEEE.

Abstract: This paper addresses one of the directions of the HORIZON EU CONVOLVE project being dependability of smart edge processors based on computation-in-memory and emerging memristor devices such as RRAM. It discusses how this alternative computing paradigm will change the way we used to do manufacturing test. In addition, it describes how these emerging devices inherently suffering from many non-idealities are calling for new solutions in order to ensure accurate and reliable edge computing. Moreover, the paper also covers the security aspects for future edge processors and shows the challenges and the future directions.

PetaOps/W edge-AI  Processors: Myth or reality?

Gomony, M.D., De Putter, F., Gebregiorgis, A., Paulin, G., Mei, L., Jain, V., Hamdioui, S., Sanchez, V., Grosser, T., Geilen, M. and Verhelst, M., 2023, April. PetaOps/W edge-AI $\mu $ Processors: Myth or reality?. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1-6). IEEE.

Abstract: With the rise of deep learning (DL), our world braces for artificial intelligence (AI) in every edge device, creating an urgent need for edge-AI SoCs. This SoC hardware needs to support high throughput, reliable and secure AI processing at ultra-low power (ULP), with a very short time to market. With its strong legacy in edge solutions and open processing platforms, the EU is well-positioned to become a leader in this SoC market. However, this requires AI edge processing to become at least 100 times more energy-efficient, while offering sufficient flexibility and scalability to deal with AI as a fast-moving target. Since the design space of these complex SoCs is huge, advanced tooling is needed to make their design tractable. The CONVOLVE project (currently in Inital stage) addresses these roadblocks. It takes a holistic approach with innovations at all levels of the design hierarchy. Starting with an overview of SOTA DL processing support and our project methodology, this paper presents 8 important design choices largely impacting the energy efficiency and flexibility of DL hardware. Finding good solutions is key to making smart-edge computing a reality.

Challenges and Opportunities of Security-Aware EDA

Feldtkeller, J., Sasdrich, P. and Güneysu, T., 2023. Challenges and Opportunities of Security-Aware EDA. ACM Transactions on Embedded Computing Systems22(3), pp.1-34.

Abstract: The foundation of every digital system is based on hardware in which security, as a core service of many applications, should be deeply embedded. Unfortunately, the knowledge of system security and efficient hardware design is spread over different communities and, due to the complex and ever-evolving nature of hardware-based system security, state-of-the-art security is not always implemented in state-of-the-art hardware. However, automated security-aware hardware design seems to be a promising solution to bridge the gap between the different communities. In this work, we systematize state-of-the-art research with respect to security-aware Electronic Design Automation (EDA) and identify a modern security-aware EDA framework. As part of this work, we consider threats in the form of information flow, timing and power side channels, and fault injection, which are the fundamental building blocks of more complex hardware-based attacks. Based on the existing research, we provide important observations and research questions to guide future research in support of modern, holistic, and security-aware hardware design infrastructures.

A Holistic Approach Towards Side-Channel Secure Fixed-Weight Polynomial Sampling

Krausz, M., Land, G., Richter-Brockmann, J. and Güneysu, T., 2023, May. A Holistic Approach Towards Side-Channel Secure Fixed-Weight Polynomial Sampling. In IACR International Conference on Public-Key Cryptography (pp. 94-124). Cham: Springer Nature Switzerland.

Abstract: The sampling of polynomials with fixed weight is a procedure required by round-4 Key Encapsulation Mechanisms (KEMs) for Post-Quantum Cryptography (PQC) standardization (BIKE, HQC, McEliece) as well as NTRU, Streamlined NTRU Prime, and NTRU LPRrime. Recent attacks have shown in this context that side-channel leakage of sampling methods can be exploited for key recoveries. While countermeasures regarding such timing attacks have already been presented, still, there is no comprehensive work covering solutions that are also secure against power side channels. To close this gap, the contribution of this work is threefold: First, we analyze requirements for the different use cases of fixed weight sampling. Second, we demonstrate how all known sampling methods can be implemented securely against timing and power/EM side channels and propose performance-enhancing modifications. Furthermore, we propose a new, comparison-based methodology that outperforms existing methods in the masked setting for the three round-4 KEMs BIKE, HQC, and McEliece. Third, we present bitsliced and arbitrary-order masked software implementations and benchmarked them for all relevant cryptographic schemes to be able to infer recommendations for each use case. Additionally, we provide a hardware implementation of our new method as a case study and analyze the feasibility of implementing the other approaches in hardware.

Combined Private Circuits – Combined Security Refurbished

Feldtkeller, J., Güneysu, T., Moos, T., Richter-Brockmann, J., Saha, S., Sasdrich, P. and Standaert, F.X., 2023, November. Combined private circuits-combined security refurbished. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (pp. 990-1004).

Abstract: Physical attacks are well-known threats to cryptographic implementations. While countermeasures against passive Side-Channel Analysis (SCA) and active Fault Injection Analysis (FIA) exist individually, protecting against their combination remains a significant challenge. A recent attempt at achieving joint security has been published at CCS 2022 under the name CINI-MINIS. The authors introduce relevant security notions and aim to construct arbitrary-order gadgets that remain trivially composable in the presence of a combined adversary. Yet, we show that all CINI-MINIS gadgets at any order are susceptible to a devastating attack with only a single fault and probe due to a lack of error correction modules in the compression. We explain the details of the attack, pinpoint the underlying problem in the constructions, propose an additional design principle, and provide new (fixed) provably secure and composable gadgets for arbitrary order. Luckily, the changes in the compression stage help us to save correction modules and registers elsewhere, making the resulting Combined Private Circuits (CPC) more secure and more efficient than the original ones. We also explain why the discovered flaws have been missed by the associated formal verification tool VERICA (TCHES 2022) and propose fixes to remove its blind spot. Finally, we explore alternative avenues to repair the compression stage without additional corrections based on non-completeness, i.e., constructing a compression that never recombines any secret. Yet, while this approach could have merit for low-order gadgets, it is, for now, hard to generalize and scales poorly to higher orders. We conclude that our refurbished arbitrary order CINI gadgets provide a solid foundation for further research.

Quantitative Fault Injection Analysis

Feldtkeller, J., Güneysu, T. and Schaumont, P., 2023, December. Quantitative Fault Injection Analysis. In International Conference on the Theory and Application of Cryptology and Information Security (pp. 302-336). Singapore: Springer Nature Singapore.

Abstract:

Active fault injection is a credible threat to real-world digital systems computing on sensitive data. Arguing about security in the presence of faults is non-trivial, and state-of-the-art criteria are overly conservative and lack the ability of fine-grained comparison. However, comparing two alternative implementations for their security is required to find a satisfying compromise between security and performance. In addition, the comparison of alternative fault scenarios can help optimize the implementation of effective countermeasures. In this work, we use quantitative information flow analysis to establish a vulnerability metric for hardware circuits under fault injection that measures the severity of an attack in terms of information leakage. Potential use cases range from comparing implementations with respect to their vulnerability to specific fault scenarios to optimizing countermeasures. We automate the computation of our metric by integrating it into a state-of-the-art evaluation tool for physical attacks and provide new insights into the security under an active fault attacker.

Gadget-based Masking of Streamlined NTRU Prime Decapsulation in Hardware

Land, G., Marotzke, A., Richter-Brockmann, J. and Güneysu, T., 2024. Gadget-based Masking of Streamlined NTRU Prime Decapsulation in Hardware. IACR Transactions on Cryptographic Hardware and Embedded Systems2024(1), pp.1-26.

Abstract: Streamlined NTRU Prime is a lattice-based Key Encapsulation Mechanism (KEM) that is, together with X25519, the default algorithm in OpenSSH 9. Based on lattice assumptions, it is assumed to be secure also against attackers with access to< large-scale quantum computers. While Post-Quantum Cryptography (PQC) schemes have been subject to extensive research in recent years, challenges remain with respect to protection mechanisms against attackers that have additional side-channel information, such as the power consumption of a device processing secret data. As a countermeasure to such attacks, masking has been shown to be a promising and effective approach. For public-key schemes, including any recent PQC schemes, usually, a mixture of Boolean and arithmetic techniques is applied on an algorithmic level. Our generic hardware implementation of Streamlined NTRU Prime decapsulation, however, follows an idea that until now was assumed to be solely applicable efficiently to symmetric cryptography: gadget-based masking. The hardware design is transformed into a secure implementation by replacing each gate with a composable secure gadget that operates on uniform random shares of secret values. In our work, we show the feasibility of applying this approach also to PQC schemes and present the first Public-Key Cryptography (PKC)–pre-and post-quantum–implementation masked with the gadget-based approach considering several trade-offs and design choices. By the nature of gadget-based masking, the implementation can be instantiated at arbitrary masking order. We synthesize our implementation both for Artix-7 Field-Programmable Gate Arrays (FPGAs) and 45nm Application-Specific Integrated Circuits (ASICs), yielding practically feasible results regarding the area, randomness requirement, and latency. We verify the side-channel security of our implementation using formal verification on the one hand, and practically using Test Vector Leakage Assessment (TVLA) on the other. Finally, we also analyze the applicability of our concept to Kyber and Dilithium, which will be standardized by the National Institute of Standards and Technology (NIST).

Dynamic nsNet2: Efficient Deep Noise Suppression with Early Exiting

Miccini, R., Zniber, A., Laroche, C., Piechowiak, T., Schoeberl, M., Pezzarossa, L., Karrakchou, O., Sparsø, J. and Ghogho, M., 2023, September. Dynamic nsNET2: Efficient Deep Noise Suppression with Early Exiting. In 2023 IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1-6). IEEE.

Abstract: Although deep learning has made strides in the field of deep noise suppression, leveraging deep architectures on resourceconstrained devices still proved challenging. Therefore, we present an early-exiting model based on nsNet2 that provides several levels of accuracy and resource savings by halting computations at different stages. Moreover, we adapt the original architecture by splitting the information flow to take into account the injected dynamism. We show the trade-offs between performance and computational complexity based on established metrics.

Differentiable Transportation Pruning

Li, Y., van Gemert, J.C., Hoefler, T., Moons, B., Eleftheriou, E. and Verhoef, B.E., 2023. Differentiable Transportation Pruning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 16957-16967).

Abstract: Deep learning algorithms are increasingly employed at the edge. However, edge devices are resource constrained and thus require efficient deployment of deep neural networks. Pruning methods are a key tool for edge deployment as they can improve storage, compute, memory bandwidth, and energy usage. In this paper we propose a novel accurate pruning technique that allows precise control over the output network size. Our method uses an efficient optimal transportation scheme which we make end-to-end differentiable and which automatically tunes the exploration-exploitation behavior of the algorithm to find accurate sparse sub-networks. We show that our method achieves state-of-the-art performance compared to previous pruning methods on 3 different datasets, using 5 different models, across a wide range of pruning ratios, and with two types of sparsity budgets and pruning granularities.

CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions

Singh, S., Feliu, J., Acacio, M.E., Jimborean, A. and Ros, A., 2023, October. CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions. In 2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT) (pp. 1-13). IEEE.

Abstract: Efficient Total Store Order (TSO) implementations allow loads to execute speculatively out-of-order. To detect order violations, the load queue (LQ) holds all the in-flight loads and is searched on every invalidation and cache eviction. Moreover, in a simultaneous multithreading processor (SMT), stores also search the LQ when writing to cache. LQ searches entail considerable energy consumption. Furthermore, the processor stalls upon encountering the LQ full or when its ports are busy. Hence, the LQ is a critical structure in terms of both energy and performance. In this work, we observe that the use of the LQ could be dramatically optimized under the guarantees of the datarace-free (DRF) property imposed by modern programming languages. To leverage this observation, we propose CELLO, a software-hardware co-design in which the compiler detects memory operations in DRF regions and the hardware optimizes their execution by safely skipping LQ searches without violating the TSO consistency model. Furthermore, CELLO allows removing DRF loads from the LQ earlier, as they do not need to be searched to detect consistency violations. With minimal hardware overhead, we show that an 8-core 2-way SMT processor with CELLO avoids almost all conservative searches to the LQ and significantly reduces its occupancy. CELLO allows i) to reduce the LQ energy expenditure by 33% on average (up to 53%) while performing 2.8% better on average (up to 18.6%) than the baseline system, and ii) to shrink the LQ size from 192 to only 80 entries, reducing the LQ energy expenditure as much as 69% while performing on par with a mainstream LQ implementation.

Towards a tailored mixed-precision sub-8-bit quantization scheme for Gated Recurrent Units using Genetic Algorithms

Miccini, R., Cerioli, A., Laroche, C., Piechowiak, T., Sparsø, J. and Pezzarossa, L., 2024. Towards a tailored mixed-precision sub-8bit quantization scheme for Gated Recurrent Units using Genetic Algorithms. arXiv preprint arXiv:2402.12263.

Abstract: Despite the recent advances in model compression techniques for deep neural networks, deploying such models on ultra-low-power embedded devices still proves challenging. In particular, quantization schemes for Gated Recurrent Units (GRU) are difficult to tune due to their dependence on an internal state, preventing them from fully benefiting from sub-8bit quantization. In this work, we propose a modular integer quantization scheme for GRUs where the bit width of each operator can be selected independently. We then employ Genetic Algorithms (GA) to explore the vast search space of possible bit widths, simultaneously optimizing for model size and accuracy. We evaluate our methods on four different sequential tasks and demonstrate that mixed-precision solutions exceed homogeneous-precision ones in terms of Pareto efficiency. Our results show a model size reduction between 25% and 55% while maintaining an accuracy comparable with the 8-bit homogeneous equivalent.

Optimising GPGPU Execution Through Runtime Micro-Architecture Parameter Analysis

Sarda, G.M., Shah, N., Bhattacharjee, D., Debacker, P. and Verhelst, M., 2023, October. Optimising GPGPU Execution Through Runtime Micro-Architecture Parameter Analysis. In 2023 IEEE International Symposium on Workload Characterization (IISWC) (pp. 226-228). IEEE.

Abstract: GPGPU execution analysis has always been tied to closed-source, proprietary benchmarking tools that provide high-level, non-exhaustive, and/or statistical information, preventing a thorough understanding of bottlenecks and optimization possibilities. Open-source hardware platforms offer opportunities to overcome such limits and co-optimize the full hardware-mapping-algorithm compute stack. Yet, so far, this has remained under-explored. In this work, we exploit micro-architecture parameter analysis to develop a hardware-aware, runtime mapping technique for OpenCL kernels on the open Vortex RISC-V GPGPU. Our method is based on trace observations and targets optimal hardware resource utilization to achieve superior performance and flexibility compared to hardware-agnostic mapping approaches. The technique was validated on different architectural GPU configurations across several OpenCL kernels. Overall, our approach significantly enhances the performance of the open-source Vortex GPGPU, contributing to unlocking its potential and usability.

DeFiNES: Enabling Fast Exploration of the Depth-first Scheduling Space for DNN Accelerators through Analytical Modeling

Mei, L., Goetschalckx, K., Symons, A. and Verhelst, M., 2023, February. Defines: Enabling fast exploration of the depth-first scheduling space for dnn accelerators through analytical modeling. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA) (pp. 570-583). IEEE.

Abstract: DNN workloads can be scheduled onto DNN accelerators in many different ways: from layer-by-layer scheduling to cross-layer depth-first scheduling (a.k.a. layer fusion, or cascaded execution). This results in a very broad scheduling space, with each schedule leading to varying hardware (HW) costs in terms of energy and latency. To rapidly explore this vast space for a wide variety of hardware architectures, analytical cost models are crucial to estimate scheduling effects on the HW level. However, state-of-the-art cost models are lacking support for exploring the complete depth-first scheduling space, for instance focusing only on activations while ignoring weights, or modeling only DRAM accesses while overlooking on-chip data movements. These limitations prevent researchers from systematically and accurately understanding the depth-first scheduling space.After formalizing this design space, this work proposes a unified modeling framework, DeFiNES, for layer-by-layer and depth-first scheduling to fill in the gaps. DeFiNES enables analytically estimating the hardware cost for possible schedules in terms of both energy and latency, while considering data access at every memory level. This is done for each schedule and HW architecture under study by optimally choosing the active part of the memory hierarchy per unique combination of operand, layer, and feature map tile. The hardware costs are estimated, taking into account both data computation and data copy phases. The analytical cost model is validated against measured data from a taped-out depth-first DNN accelerator, DepFiN, showing good modeling accuracy at the end-to-end neural network level. A comparison with generalized state-of-the-art demonstrates up to 10× better solutions found with DeFiNES.

Optimizing Layer-Fused Scheduling of Transformer Networks on Multi-accelerator Platforms

Colleman, S., Symons, A., Jung, V.J. and Verhelst, M., 2024, April. Optimizing Layer-Fused Scheduling of Transformer Networks on Multi-accelerator Platforms. In 2024 25th International Symposium on Quality Electronic Design (ISQED) (pp. 1-6). IEEE.

Abstract: The impact of transformer networks is booming, yet, they come with significant computational complexity. It is therefore essential to understand how to optimally map and execute these networks on modern neural processor hardware. So far, literature on transformer scheduling optimization has been focusing on deployment on GPU and specific ASICs. This work enables extensive hardware/mapping exploration by extending the DSE framework Stream towards support for transformers across a wide variety of hardware architectures and different execution schedules. After validation, we explore the optimal schedule for transformer layers/attention heads and investigate whether layer fusion is beneficial to improve latency, energy or memory requirements. Our study shows that the memory requirements for active feature data can be drastically reduced, by adapting the execution schedule based on the size of the input of the attention head.

Analog or Digital In-Memory Computing? Benchmarking Through Quantitative Modeling

Sun, J., Houshmand, P. and Verhelst, M., 2023, October. Analog or Digital In-Memory Computing? Benchmarking Through Quantitative Modeling. In 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD) (pp. 1-9). IEEE.

Abstract: In-Memory Computing (IMC) has emerged as a promising paradigm for energy-efficient, throughput-efficient and area-efficient machine learning at the edge. However, the differences in hardware architectures, array dimensions, and fabrication technologies among published IMC realizations have made it difficult to grasp their relative strengths. Moreover, previous studies have primarily focused on exploring and bench-marking the peak performance of a single IMC macro rather than full system performance on real workloads. This paper aims to address the lack of a quantitative comparison of Analog In-Memory Computing (AIMC) and Digital In-Memory Computing (DIMC) processor architectures. We propose an analytical IMC performance model that is validated against published implementations and integrated into a system-level exploration framework for comprehensive performance assessments on different work-loads with varying IMC configurations. Our experiments show that while DIMC generally has higher computational density than AIMC, AIMC with large macro sizes may have better energy efficiency than DIMC on convolutional-layers and pointwise-layers, which can exploit high spatial unrolling. On the other hand, DIMC with small macro size outperforms AIMC on depthwise-layers, which feature limited spatial unrolling opportunities inside a macro.

CMDS: Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories

Shi, M., Colleman, S., VanDeMieroop, C., Joseph, A., Meijer, M., Dehaene, W. and Verhelst, M., 2023, April. CMDS: Cross-layer Dataflow Optimization for DNN Accelerators Exploiting Multi-bank Memories. In 2023 24th International Symposium on Quality Electronic Design (ISQED) (pp. 1-8). IEEE.

Abstract: Deep neural networks (DNN) use a wide range of network topologies to achieve high accuracy within diverse applications. This model diversity makes it impossible to identify a single “dataflow” (execution schedule) to perform optimally across all possible layers and network topologies. Several frameworks support the exploration of the best dataflow for a given DNN layer and hardware. However, switching the dataflow from one layer to the next layer within one DNN model can result in hardware inefficiencies stemming from memory data layout mismatch among the layers. Unfortunately, all existing frameworks treat each layer independently and typically model memories as black boxes (one large monolithic wide memory), which ignores the data layout and can not deal with the data layout dependencies of sequential layers. These frameworks are not capable of doing dataflow cross-layer optimization. This work, hence, aims at cross-layer dataflow optimization, taking the data dependency and data layout reshuffling overheads among layers into account. Additionally, we propose to exploit the multibank memories typically present in modern DNN accelerators towards efficiently reshuffling data to support more dataflow at low overhead. These innovations are supported through the Cross-layer Memory-aware Dataflow Scheduler (CMDS). CMDS can model DNN execution energy/latency while considering the different data layout requirements due to the varied optimal dataflow of layers. Compared with the state-of-the-art (SOTA), which performs layer-optimized memory-unaware scheduling, CMDS achieves up to 5.5× energy reduction and 1.35× latency reduction with negligible hardware cost.

An Empirical Evaluation of Sliding Windows on Siren Detection Task using Spiking Neural Networks

Kshirasagar, S., Guntoro, A. and Mayr, C., 2024. An Empirical Evaluation of Sliding Windows on Siren Detection Task Using Spiking Neural Networks. Advances in Signal Processing and Artificial Intelligence, p.112.

Abstract: Anomaly acoustic cues like siren sounds, when undetected, could lead to road safety issues like collisions or accidents. Auditory perception systems are resource bound when deployed on power constrained sensory edge devices. Spiking neural networks (SNN) premise brain-like computing with high energy-efficiency. This work presents a quantitative analysis of the variation of sliding window on the performance of acoustic anomaly detection task for siren sounds. We perform FFT based pre-processing and employ Mel-spectrogram features fed as input to the recurrent spiking neural network. SNN model in this work comprises of leaky-integrate-and-fire (LIF) neurons in the hidden layer and a single readout with leaky integrator cell. The non-trivial motivation of this research is to understand the effect of encoding behavior of spiking neurons with sliding windows. We conduct experiments with different window sizes, and the overlapping ratio within the windows. We present our results for performance measures like accuracy and onset latency to provide an insight on the choice of optimal window.

COAC: Cross-Layer Optimization of Accelerator Configurability for Efficient CNN Processing

Colleman, S., Shi, M. and Verhelst, M., 2023. COAC: Cross-Layer Optimization of Accelerator Configurability for Efficient CNN Processing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

Abstract: To achieve high accuracy, convolutional neural networks (CNNs) are increasingly growing in complexity and diversity in layer types and topologies. This makes it very challenging to efficiently deploy such networks on custom processor architectures for resource-scarce edge devices. Existing mapping exploration frameworks enable searching for the optimal execution schedules or hardware mappings of individual network layers, by optimizing each layer’s spatial (dataflow parallelization) and temporal unrolling (TU, execution order). However, these tools fail to take into account the overhead of supporting different unrolling schemes within a common hardware architecture. Using a fixed unrolling scheme across all layers is also not ideal, as this misses significant opportunities for energy and latency savings from optimizing the mapping of diverse layer types. A balanced approach assesses the right amount of mapping flexibility needed across target neural networks, while taking into account the overhead to support multiple unrollings. This article, therefore, presents cross-layer optimization of accelerator configurability (COAC), a cross-layer design space exploration and mapping framework to optimize the flexibility of neural processing architectures by balancing configurability overhead against resulting energy and latency savings for end-to-end inference. COAC does not only provide a systematical analysis of the architectural overhead in function of the supported spatial unrollings (SUs), but also builds an automated flow to find the best unrolling combination(s) for efficient end-to-end inference with limited hardware overhead. Results demonstrate that architectures with carefully optimized flexibility can achieve up to 38% energy-delay-product (EDP) savings for a set of six neural networks at the expense of a relative area increase of 9.5%.

ACCO: Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators

Yin, J., Mei, L., Guntoro, A. and Verhelst, M., 2023, November. ACCO: Automated Causal CNN Scheduling Optimizer for Real-Time Edge Accelerators. In 2023 IEEE 41st International Conference on Computer Design (ICCD) (pp. 391-398). IEEE.

Abstract: Spatio-Temporal Convolutional Neural Networks (ST-CNN) allow extending CNN capabilities from image processing to consecutive temporal-pattern recognition. Generally, state-of-the-art (SotA) ST-CNNs inflate the feature maps and weights from well-known CNN backbones to represent the additional time dimension. However, edge computing applications would suffer tremendously from such large computation or memory overhead. Fortunately, the overlapping nature of ST-CNN enables various optimizations, such as the dilated causal convolution structure and Depth-First (DF) layer fusion to reuse the computation between time steps and CNN sliding windows, respectively. Yet, no hardware-aware approach has been proposed that jointly explores the optimal strategy from a scheduling as well as a hardware point of view. To this end, we present ACCO, an automated optimizer that explores efficient Causal CNN transformation and DF scheduling for ST-CNNs on edge hardware accelerators. By cost-modeling the computation and data movement on the accelerator architecture, ACCO automatically selects the best scheduling strategy for the given hardware-algorithm target. Compared to the fixed dilated causal structure, ST-CNNs with ACCO reach an ~8.4x better Energy-Delay-Product. Meanwhile, ACCO improves ~20% in layer-fusion optimals compared to the SotA DF exploration toolchain. When jointly optimizing ST-CNN on the temporal and spatial dimension, ACCO’s scheduling outcomes are on average 19x faster and 37x more energy-efficient than spatial DF schemes.

Reliable and Energy-Efficient Diabetic Retinopathy Screening Using Memristor-Based Neural Networks

Diware, S., Chilakala, K., Joshi, R.V., Hamdioui, S. and Bishnoi, R., 2024. Reliable and Energy-efficient Diabetic Retinopathy Screening using Memristor-based Neural Networks. IEEE Access.

Abstract: Diabetic retinopathy (DR) is a leading cause of permanent vision loss worldwide. It refers to irreversible retinal damage caused due to elevated glucose levels and blood pressure. Regular screening for DR can facilitate its early detection and timely treatment. Neural network-based DR classifiers can be leveraged to achieve such screening in a convenient and automated manner. However, these classifiers suffer from reliability issue where they exhibit strong performance during development but degraded performance after deployment. Moreover, they do not provide supplementary information about the prediction outcome, which severely limits their widespread adoption. Furthermore, energy-efficient deployment of these classifiers on edge devices remains unaddressed, which is crucial to enhance their global accessibility. In this paper, we present a reliable and energy-efficient hardware for DR detection, suitable for deployment on edge devices. We first develop a DR classification model using custom training data that incorporates diverse image quality and image sources along with improved class balance. This enables our model to effectively handle both on-field variations in retinal images and minority DR classes, enhancing its post-deployment reliability. We then propose a pseudo-binary classification scheme to further improve the model performance and provide supplementary information about the model prediction. Additionally, we present an energy-efficient hardware design for our model using memristor-based computation-in-memory, to facilitate its deployment on edge devices. Our proposed approach achieves reliable DR classification with three orders of magnitude reduction in energy consumption over state-of-the-art hardware platforms.

Alternate Path μ-op Cache Prefetching (new)

S. Singh, A. Perais, A. Jimborean and A. Ros, “Alternate Path μ-op Cache Prefetching,” 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), Buenos Aires, Argentina, 2024, pp. 1230-1245, doi: 10.1109/ISCA59077.2024.00092.

Abstract: Datacenter applications are well-known for their large code footprints. This has caused frontend design to evolve by implementing decoupled fetching and large prediction structures – branch predictors, Branch Target Buffers (BTBs) – to mitigate the stagnating size of the instruction cache by prefetching instructions well in advance. In addition, many designs feature a micro operation (μ-op) cache, which primarily provides power savings by bypassing the instruction cache and decoders once warmed up. However, this μ-op cache often has lower reach than the instruction cache, and it is not filled up speculatively using the decoupled fetcher. As a result, the μ-op cache is often over-subscribed by datacenter applications, up to the point of becoming a burden. This paper first shows that because of this pressure, blindly prefetching into the μ-op cache using state-of-the-art standalone prefetchers would not provide significant gains. As a consequence, this paper proposes to prefetch only critical μ-ops into the μ op cache, by focusing on execution points where the μ-op cache provides the most gains: Pipeline refills. Concretely, we use hardto-predict conditional branches as indicators that a pipeline refill is likely to happen in the near future, and prefetch into the μ-op cache the μ-ops that belong to the path opposed to the predicted path, which we call alternate path. Identifying hard-to-predict branches requires no additional state if the branch predictor confidence is used to classify branches. Including extra alternate branch predictors with limited budget (8.95 KB to 12.95 KB), our proposal provides average speedups of 1.9% to 2% and as high as 12% on a subset of CVP-1 traces.

Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps

Nilsson, M., Miccini, R., Laroche, C., Piechowiak, T., Zenke, F. (2024) Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps. Proc. Interspeech 2024, 2975-2979, doi: 10.21437/Interspeech.2024-1979

Abstract: As speech processing systems in mobile and edge devices become more commonplace, the demand for unintrusive speech quality monitoring increases. Deep learning methods provide high-quality estimates of objective and subjective speech quality metrics. However, their significant computational requirements are often prohibitive on resource-constrained devices. To address this issue, we investigated binary activation maps (BAMs) for speech quality prediction on a convolutional architecture based on DNSMOS. We show that the binary activation model with quantization aware training matches the predictive performance of the baseline model. It further allows using other compression techniques. Combined with 8-bit weight quantization, our approach results in a 25-fold memory reduction during inference, while replacing almost all dot products with summations. Our findings show a path toward substantial resource savings by supporting mixed-precision binary multiplication in hard- and software.

Late Breaking Results: Language-level QoR modeling for High-Level Synthesis

Masouros, D., Ferikoglou, A., Zervakis, G., Xydis, S., & Soudris, D. (2024). Language-level QoR Modeling for High-Level Synthesis. Zenodo. https://doi.org/10.5281/zenodo.11582090https://zenodo.org/records/11582090

Abstract: This paper proposes a language-level modeling approach for High-Level Synthesis based on the state-of-the-art Transformer architecture. Our approach estimates the performance and required resources of HLS applications directly from the source code when different synthesis directives, in
terms of HLS #pragmas, are applied. Results show that the proposed architecture achieves 96.02% accuracy for predicting the feasibility class of applications and an average of 0.95 and 0.91 R2 scores for predicting the actual performance and required resources, respectively.

ESAM: Energy-efficient SNN Architecture using 3nm FinFET Multiport SRAM-based CIM with Online Learning

Huijbregts, Lucas, et al. “Energy-efficient SNN Architecture using 3nm FinFET Multiport SRAM-based CIM with Online Learning.” Proceedings of the 61st ACM/IEEE Design Automation Conference. 2024.

Abstract: Current Artificial Intelligence (AI) computation systems face challenges, primarily from the memory-wall issue, limiting overall system-level performance, especially for Edge devices with constrained battery budgets, such as smartphones, wearables, and Internet-of-Things sensor systems. In this paper, we propose a new SRAM-based Compute-In-Memory (CIM) accelerator optimized for Spiking Neural Networks (SNNs) Inference. Our proposed architecture employs a multiport SRAM design with multiple decoupled Read ports to enhance the throughput and Transposable ReadWrite ports to facilitate online learning. Furthermore, we develop an Arbiter circuit for efficient data-processing and port allocations during the computation. Results for a 128×128 array in 3nm FinFET technology demonstrate a 3.1× improvement in speed and a 2.2× enhancement in energy efficiency with our proposed multiport SRAM design compared to the traditional single-port design. At system-level, a throughput of 44 MInf/s at 607 pJ/Inf and 29mW is achieved.

Data-driven HLS optimization for reconfigurable accelerators

Ferikoglou, A., Kakolyris, A., Kypriotis, V., Masouros, D., Soudris, D. and Xydis, S., 2024, June. Data-driven HLS optimization for reconfigurable accelerators. In Proceedings of the 61st ACM/IEEE Design Automation Conference (pp. 1-6).

Abstract: High-Level Synthesis (HLS) has played a pivotal role in making FPGAs accessible to a broader audience by facilitating high-level device programming and rapid microarchitecture customization through the use of directives. However, manually selecting the right directives can be a formidable challenge for programmers lacking a hardware background. This paper introduces an ultra-fast, knowledge-based HLS design optimization method that automatically extracts and applies the most promising directive configurations to the original source code. This optimization approach is entirely data-driven, offering a generalized HLS tuning solution without reliance on Quality of Result (QoR) models or meta-heuristics. We design, implement, and evaluate our methodology using over 100 applications sourced from well-established benchmark suites and GitHub repositories, all running on a Xilinx ZCU104 FPGA. The results are promising, including an average geometric mean speedup of ×7.2 and ×1.35 compared to designer-optimized designs and resource over-provisioning strategies, respectively. Additionally, it demonstrates a high design feasibility score and maintains an average inference latency of 38ms. Comparative analysis with traditional genetic algorithm-based Design Space Exploration (DSE) methods and State-of-the-Art (SoA) approaches reveals that it produces designs of similar quality but at speeds 2-3 orders of magnitude faster. This suggests that it is a highly promising solution for ultra-fast and automated HLS optimization.

Falcon: A Scalable Analytical Cache Model

Pitchanathan, A., Grover, K. and Grosser, T., 2024. Falcon: A scalable analytical cache model. Proceedings of the ACM on Programming Languages8(PLDI), pp.1854-1878.

Abstract: Compilers often use performance models to decide how to optimize code. This is often preferred over using hardware performance measurements, since hardware measurements can be expensive, limited by hardware availability, and makes the output of compilation non-deterministic. Analytical models, on the other hand, serve as efficient and noise-free performance indicators. Since many optimizations focus on improving memory performance, memory cache miss rate estimations can serve as an effective and noise-free performance indicator for superoptimizers, worst-case execution time analyses, manual program optimization, and many other performance-focused use cases. Existing methods to model the cache behavior of affine programs work on small programs such as those in the Polybench benchmark but do not scale to the larger programs we would like to optimize in production, which can be orders of magnitude bigger by lines of code. These analytical approaches hand of the whole program to a Presburger solver and perform expensive mathematical operations on the huge resulting formulas. We develop a scalable cache model for affine programs that splits the computation into smaller pieces that do not trigger the worst-case asymptotic behavior of these solvers. We evaluate our approach on 46 TorchVision neural networks, finding that our model has a geomean runtime of 44.9 seconds compared to over 32 minutes for the state-of-the-art prior cache model, and the latter is actually smaller than the true value because the prior model reached our four hour time limit on 54% of the networks, and this limit was never reached by our tool. Our model exploits parallelism effectively: running it on sixteen cores is 8.2x faster than running it single-threaded. While the state-of-the-art model takes over four hours to analyze a majority of the benchmark programs, Falcon produces results in at most 3 minutes and 3 seconds; moreover, after a local modification to the program being analyzed, our model efficiently updates the predictions in 513 ms on average (geomean). Thus, we provide the first scalable analytical cache model.

Verifying Peephole Rewriting In SSA Compiler IRs

Bhat, S., Keizer, A., Hughes, C., Goens, A. and Grosser, T., 2024. Verifying Peephole Rewriting In SSA Compiler IRs. arXiv preprint arXiv:2407.03685.

Abstract: There is an increasing need for domain-specific reasoning in modern compilers. This has fueled the use of tailored intermediate representations (IRs) based on static single assignment (SSA), like in the MLIR compiler framework. Interactive theorem provers (ITPs) provide strong guarantees for the end-to-end verification of compilers (e.g., CompCert). However, modern compilers and their IRs evolve at a rate that makes proof engineering alongside them prohibitively expensive. Nevertheless, well-scoped push-button automated verification tools such as the Alive peephole
verifier for LLVM-IR gained recognition in domains where SMT solvers offer efficient (semi) decision procedures. In this paper, we aim to combine the convenience of automation with the versatility of ITPs for verifying peephole rewrites across domain-specific IRs. We formalize a core calculus for
SSA-based IRs that is generic over the IR and covers so-called regions (nested scoping used by many domain-specific IRs in the MLIR ecosystem). Our mechanization in the Lean proof assistant provides a user-friendly frontend for translating MLIR syntax into our calculus. We provide scaffolding for defining and verifying peephole rewrites, offering tactics to eliminate the abstraction overhead of our SSA calculus. We prove correctness theorems about peephole rewriting, as well as two classical program transformations. To evaluate our framework, we consider three use cases from the MLIR
ecosystem that cover different levels of abstractions: (1) bitvector rewrites from LLVM, (2) structured control flow, and (3) fully homomorphic encryption. We envision that our mechanization provides a foundation for formally verified rewrites on new domain-specific IRs.

HTVM: Efficient Neural Network Deployment On Heterogeneous TinyML Platforms

Van Delm, J., Vandersteegen, M., Burrello, A., Sarda, G.M., Conti, F., Pagliari, D.J., Benini, L. and Verhelst, M., 2023, July. HTVM: Efficient neural network deployment on heterogeneous TinyML platforms. In 2023 60th ACM/IEEE Design Automation Conference (DAC) (pp. 1-6). IEEE.

Abstract: Optimal deployment of deep neural networks (DNNs) on state-of-the-art Systems-on-Chips (SoCs) is crucial for tiny machine learning (TinyML) at the edge. The complexity of these SoCs makes deployment non-trivial, as they typically contain multiple heterogeneous compute cores with limited, programmer-managed memory to optimize latency and energy efficiency. We propose HTVM – a compiler that merges TVM with DORY to maximize the utilization of heterogeneous accelerators and minimize data movements. HTVM allows deploying the MLPerf™ Tiny suite on DIANA, an SoC with a RISC-V CPU, and digital and analog compute-in-memory AI accelerators, at 120x improved performance over plain TVM deployment.

Decoupled Access-Execute Enabled DVFS for TinyML Deployments on STM32 Microcontrollers

Alvanaki, E.L., Katsaragakis, M., Masouros, D., Xydis, S. and Soudris, D., 2024, March. Decoupled Access-Execute enabled DVFS for tinyML deployments on STM32 microcontrollers. In 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1-6). IEEE.

Abstract: Over the last years the rapid growth Machine Learning (ML) inference applications deployed on the Edge is rapidly increasing. Recent Internet of Things (IoT) devices and microcontrollers (MCUs), become more and more mainstream in everyday activities. In this work we focus on the family of STM32 MCUs. We propose a novel methodology for CNN deployment on the STM32 family, focusing on power optimization through effective clocking exploration and configuration and decoupled access-execute convolution kernel execution. Our approach is enhanced with optimization of the power consumption through Dynamic Voltage and Frequency Scaling (DVFS) under various latency constraints, composing an NP-complete optimization problem. We compare our approach against the state-of-the-art TinyEngine inference engine, as well as TinyEngine coupled with power-saving modes of the STM32 MCUs, indicating that we can achieve up to 25.2% less energy consumption for varying QoS levels.

Dynamic Early Exiting Predictive Coding Neural Networks

Zniber, A., Karrakchou, O. and Ghogho, M., 2023. Dynamic Early Exiting Predictive Coding Neural Networks. arXiv preprint arXiv:2309.02022.

Abstract: Internet of Things (IoT) sensors are nowadays heavily utilized in various real-world applications ranging from wearables to smart buildings passing by agrotechnology and health monitoring. With the huge amounts of data generated by these tiny devices, Deep Learning (DL) models have been extensively used to enhance them with intelligent processing. However, with the urge for smaller and more accurate devices, DL models became too heavy to deploy. It is thus necessary to incorporate the hardware’s limited resources in the design process. Therefore, inspired by the human brain known for its efficiency and low power consumption, we propose a shallow bidirectional network based on predictive coding theory and dynamic early exiting for halting further computations when a performance threshold is surpassed. We achieve comparable accuracy to VGG-16 in image classification on CIFAR-10 with fewer parameters and less computational complexity.

SECOMP: Formally Secure Compilation of Compartmentalized C Programs

Thibault, J., Blanco, R., Lee, D., Argo, S., Azevedo de Amorim, A., Georges, A.L., Hriţcu, C. and Tolmach, A., 2024, December. SECOMP: Formally Secure Compilation of Compartmentalized C Programs. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security (pp. 1061-1075).

Abstract: Undefined behavior in C often causes devastating security vulnerabilities. One practical mitigation is compartmentalization, which allows developers to structure large programs into mutually distrustful compartments with clearly specified privileges and interactions. In this paper we introduce SECOMP, a compiler for compartmentalized C code that comes with machine-checked proofs guaranteeing that the scope of undefined behavior is restricted to the compartments that encounter it and become dynamically compromised. These guarantees are formalized as the preservation of safety properties against adversarial contexts, a secure compilation criterion similar to full abstraction, and this is the first time such a strong criterion is proven for a mainstream programming language. To achieve this we extend the languages of the CompCert verified C compiler with isolated compartments that can only interact via procedure calls and returns, as specified by cross-compartment interfaces. We adapt the passes and optimizations of CompCert as well as their correctness proofs to this compartment-aware setting. We then use compiler correctness as an ingredient in a larger secure compilation proof that involves several proof engineering novelties, needed to scale formally secure compilation up to a C compiler.