Skip to content

OUTPUTS


CONVOLVE-related scientific publications from the project start until February 2024.

1. Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge

Rutishauser, G., Conti, F. and Benini, L., 2023, June. Free bits: Latency optimization of mixed-precision quantized neural networks on the edge. In 2023 IEEE 5th International Conference on Artificial Intelligence Circuits and Systems (AICAS) (pp. 1-5). IEEE.

Abstract: Mixed-precision quantization, where a deep neural network’s layers are quantized to different precisions, offers the opportunity to optimize the trade-offs between model size, latency, and statistical accuracy beyond what can be achieved with homogeneous-bit-width quantization. To navigate the intractable search space of mixed-precision configurations for a given network, this paper proposes a hybrid search methodology. It consists of a hardware-agnostic differentiable search algorithm followed by a hardware-aware heuristic optimization to find mixed-precision configurations latency-optimized for a specific hardware target. We evaluate our algorithm on MobileNetV1 and MobileNetV2 and deploy the resulting networks on a family of multi-core RISC-V microcontroller platforms with different hardware characteristics. We achieve up to 28.6% reduction of end-to-end latency compared to an 8-bit model at a negligible accuracy drop from a full-precision baseline on the 1000-class ImageNet dataset. We demonstrate speedups relative to an 8-bit baseline, even on systems with no hardware support for sub-byte arithmetic at negligible accuracy drop. Furthermore, we show the superiority of our approach with respect to differentiable search targeting reduced binary operation counts as a proxy for latency.

2. SALSA: Simulated Annealing-based Loop-Ordering Scheduler for DNN Accelerators

Jung, V.J., Symons, A., Mei, L., Verhelst, M. and Benini, L., 2023. SALSA: Simulated Annealing based Loop-Ordering Scheduler for DNN Accelerators. arXiv preprint arXiv:2304.12931.

Abstract: To meet the growing need for computational power for DNNs, multiple specialized hardware architectures have been proposed. Each DNN layer should be mapped onto the hardware with the most efficient schedule, however, SotA schedulers struggle to consistently provide optimum schedules in a reasonable time across all DNN-HW combinations.
This paper proposes SALSA, a fast dual-engine scheduler to generate optimal execution schedules for both even and uneven mapping. We introduce a new strategy, combining exhaustive search with simulated annealing to address the dynamic nature of the loop ordering design space size across layers. SALSA is extensively benchmarked against two SotA schedulers, LOMA and Timeloop on 5 different DNNs, on average SALSA finds schedules with 11.9% and 7.6% lower energy while speeding up the search by 1.7x and 24x compared to LOMA and Timeloop, respectively.

3. Dependability of Future Edge-AI Processors: Pandora’s Box

Gomony, M.D., Gebregiorgis, A., Fieback, M., Geilen, M., Stuijk, S., Richter-Brockmann, J., Bishnoi, R., Argo, S., Andradas, L.A., Güneysu, T. and Taouil, M., 2023, May. Dependability of Future Edge-AI Processors: Pandora’s Box. In 2023 IEEE European Test Symposium (ETS) (pp. 1-6). IEEE.

Abstract: This paper addresses one of the directions of the HORIZON EU CONVOLVE project being dependability of smart edge processors based on computation-in-memory and emerging memristor devices such as RRAM. It discusses how this alternative computing paradigm will change the way we used to do manufacturing test. In addition, it describes how these emerging devices inherently suffering from many non-idealities are calling for new solutions in order to ensure accurate and reliable edge computing. Moreover, the paper also covers the security aspects for future edge processors and shows the challenges and the future directions.

4. PetaOps/W edge-AI  Processors: Myth or reality?

Gomony, M.D., De Putter, F., Gebregiorgis, A., Paulin, G., Mei, L., Jain, V., Hamdioui, S., Sanchez, V., Grosser, T., Geilen, M. and Verhelst, M., 2023, April. PetaOps/W edge-AI $\mu $ Processors: Myth or reality?. In 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE) (pp. 1-6). IEEE.

Abstract: With the rise of deep learning (DL), our world braces for artificial intelligence (AI) in every edge device, creating an urgent need for edge-AI SoCs. This SoC hardware needs to support high throughput, reliable and secure AI processing at ultra-low power (ULP), with a very short time to market. With its strong legacy in edge solutions and open processing platforms, the EU is well-positioned to become a leader in this SoC market. However, this requires AI edge processing to become at least 100 times more energy-efficient, while offering sufficient flexibility and scalability to deal with AI as a fast-moving target. Since the design space of these complex SoCs is huge, advanced tooling is needed to make their design tractable. The CONVOLVE project (currently in Inital stage) addresses these roadblocks. It takes a holistic approach with innovations at all levels of the design hierarchy. Starting with an overview of SOTA DL processing support and our project methodology, this paper presents 8 important design choices largely impacting the energy efficiency and flexibility of DL hardware. Finding good solutions is key to making smart-edge computing a reality.

5. Challenges and Opportunities of Security-Aware EDA

Feldtkeller, J., Sasdrich, P. and Güneysu, T., 2023. Challenges and Opportunities of Security-Aware EDA. ACM Transactions on Embedded Computing Systems22(3), pp.1-34.

Abstract: The foundation of every digital system is based on hardware in which security, as a core service of many applications, should be deeply embedded. Unfortunately, the knowledge of system security and efficient hardware design is spread over different communities and, due to the complex and ever-evolving nature of hardware-based system security, state-of-the-art security is not always implemented in state-of-the-art hardware. However, automated security-aware hardware design seems to be a promising solution to bridge the gap between the different communities. In this work, we systematize state-of-the-art research with respect to security-aware Electronic Design Automation (EDA) and identify a modern security-aware EDA framework. As part of this work, we consider threats in the form of information flow, timing and power side channels, and fault injection, which are the fundamental building blocks of more complex hardware-based attacks. Based on the existing research, we provide important observations and research questions to guide future research in support of modern, holistic, and security-aware hardware design infrastructures.

6. A Holistic Approach Towards Side-Channel Secure Fixed-Weight Polynomial Sampling

Krausz, M., Land, G., Richter-Brockmann, J. and Güneysu, T., 2023, May. A Holistic Approach Towards Side-Channel Secure Fixed-Weight Polynomial Sampling. In IACR International Conference on Public-Key Cryptography (pp. 94-124). Cham: Springer Nature Switzerland.

Abstract: The sampling of polynomials with fixed weight is a procedure required by round-4 Key Encapsulation Mechanisms (KEMs) for Post-Quantum Cryptography (PQC) standardization (BIKE, HQC, McEliece) as well as NTRU, Streamlined NTRU Prime, and NTRU LPRrime. Recent attacks have shown in this context that side-channel leakage of sampling methods can be exploited for key recoveries. While countermeasures regarding such timing attacks have already been presented, still, there is no comprehensive work covering solutions that are also secure against power side channels. To close this gap, the contribution of this work is threefold: First, we analyze requirements for the different use cases of fixed weight sampling. Second, we demonstrate how all known sampling methods can be implemented securely against timing and power/EM side channels and propose performance-enhancing modifications. Furthermore, we propose a new, comparison-based methodology that outperforms existing methods in the masked setting for the three round-4 KEMs BIKE, HQC, and McEliece. Third, we present bitsliced and arbitrary-order masked software implementations and benchmarked them for all relevant cryptographic schemes to be able to infer recommendations for each use case. Additionally, we provide a hardware implementation of our new method as a case study and analyze the feasibility of implementing the other approaches in hardware.

7. Combined Private Circuits – Combined Security Refurbished

Feldtkeller, J., Güneysu, T., Moos, T., Richter-Brockmann, J., Saha, S., Sasdrich, P. and Standaert, F.X., 2023, November. Combined private circuits-combined security refurbished. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (pp. 990-1004).

Abstract: Physical attacks are well-known threats to cryptographic implementations. While countermeasures against passive Side-Channel Analysis (SCA) and active Fault Injection Analysis (FIA) exist individually, protecting against their combination remains a significant challenge. A recent attempt at achieving joint security has been published at CCS 2022 under the name CINI-MINIS. The authors introduce relevant security notions and aim to construct arbitrary-order gadgets that remain trivially composable in the presence of a combined adversary. Yet, we show that all CINI-MINIS gadgets at any order are susceptible to a devastating attack with only a single fault and probe due to a lack of error correction modules in the compression. We explain the details of the attack, pinpoint the underlying problem in the constructions, propose an additional design principle, and provide new (fixed) provably secure and composable gadgets for arbitrary order. Luckily, the changes in the compression stage help us to save correction modules and registers elsewhere, making the resulting Combined Private Circuits (CPC) more secure and more efficient than the original ones. We also explain why the discovered flaws have been missed by the associated formal verification tool VERICA (TCHES 2022) and propose fixes to remove its blind spot. Finally, we explore alternative avenues to repair the compression stage without additional corrections based on non-completeness, i.e., constructing a compression that never recombines any secret. Yet, while this approach could have merit for low-order gadgets, it is, for now, hard to generalize and scales poorly to higher orders. We conclude that our refurbished arbitrary order CINI gadgets provide a solid foundation for further research.

8. Quantitative Fault Injection Analysis

Feldtkeller, J., Güneysu, T. and Schaumont, P., 2023, December. Quantitative Fault Injection Analysis. In International Conference on the Theory and Application of Cryptology and Information Security (pp. 302-336). Singapore: Springer Nature Singapore.

Abstract:

Active fault injection is a credible threat to real-world digital systems computing on sensitive data. Arguing about security in the presence of faults is non-trivial, and state-of-the-art criteria are overly conservative and lack the ability of fine-grained comparison. However, comparing two alternative implementations for their security is required to find a satisfying compromise between security and performance. In addition, the comparison of alternative fault scenarios can help optimize the implementation of effective countermeasures. In this work, we use quantitative information flow analysis to establish a vulnerability metric for hardware circuits under fault injection that measures the severity of an attack in terms of information leakage. Potential use cases range from comparing implementations with respect to their vulnerability to specific fault scenarios to optimizing countermeasures. We automate the computation of our metric by integrating it into a state-of-the-art evaluation tool for physical attacks and provide new insights into the security under an active fault attacker.

9. Gadget-based Masking of Streamlined NTRU Prime Decapsulation in Hardware

Land, G., Marotzke, A., Richter-Brockmann, J. and Güneysu, T., 2024. Gadget-based Masking of Streamlined NTRU Prime Decapsulation in Hardware. IACR Transactions on Cryptographic Hardware and Embedded Systems2024(1), pp.1-26.

Abstract: Streamlined NTRU Prime is a lattice-based Key Encapsulation Mechanism (KEM) that is, together with X25519, the default algorithm in OpenSSH 9. Based on lattice assumptions, it is assumed to be secure also against attackers with access to< large-scale quantum computers. While Post-Quantum Cryptography (PQC) schemes have been subject to extensive research in recent years, challenges remain with respect to protection mechanisms against attackers that have additional side-channel information, such as the power consumption of a device processing secret data. As a countermeasure to such attacks, masking has been shown to be a promising and effective approach. For public-key schemes, including any recent PQC schemes, usually, a mixture of Boolean and arithmetic techniques is applied on an algorithmic level. Our generic hardware implementation of Streamlined NTRU Prime decapsulation, however, follows an idea that until now was assumed to be solely applicable efficiently to symmetric cryptography: gadget-based masking. The hardware design is transformed into a secure implementation by replacing each gate with a composable secure gadget that operates on uniform random shares of secret values. In our work, we show the feasibility of applying this approach also to PQC schemes and present the first Public-Key Cryptography (PKC)–pre-and post-quantum–implementation masked with the gadget-based approach considering several trade-offs and design choices. By the nature of gadget-based masking, the implementation can be instantiated at arbitrary masking order. We synthesize our implementation both for Artix-7 Field-Programmable Gate Arrays (FPGAs) and 45nm Application-Specific Integrated Circuits (ASICs), yielding practically feasible results regarding the area, randomness requirement, and latency. We verify the side-channel security of our implementation using formal verification on the one hand, and practically using Test Vector Leakage Assessment (TVLA) on the other. Finally, we also analyze the applicability of our concept to Kyber and Dilithium, which will be standardized by the National Institute of Standards and Technology (NIST).

10. Dynamic nsNet2: Efficient Deep Noise Suppression with Early Exiting

Miccini, R., Zniber, A., Laroche, C., Piechowiak, T., Schoeberl, M., Pezzarossa, L., Karrakchou, O., Sparsø, J. and Ghogho, M., 2023, September. Dynamic nsNET2: Efficient Deep Noise Suppression with Early Exiting. In 2023 IEEE 33rd International Workshop on Machine Learning for Signal Processing (MLSP) (pp. 1-6). IEEE.

Abstract: Although deep learning has made strides in the field of deep noise suppression, leveraging deep architectures on resourceconstrained devices still proved challenging. Therefore, we present an early-exiting model based on nsNet2 that provides several levels of accuracy and resource savings by halting computations at different stages. Moreover, we adapt the original architecture by splitting the information flow to take into account the injected dynamism. We show the trade-offs between performance and computational complexity based on established metrics.

11. Differentiable Transportation Pruning

Li, Y., van Gemert, J.C., Hoefler, T., Moons, B., Eleftheriou, E. and Verhoef, B.E., 2023. Differentiable Transportation Pruning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 16957-16967).

Abstract: Deep learning algorithms are increasingly employed at the edge. However, edge devices are resource constrained and thus require efficient deployment of deep neural networks. Pruning methods are a key tool for edge deployment as they can improve storage, compute, memory bandwidth, and energy usage. In this paper we propose a novel accurate pruning technique that allows precise control over the output network size. Our method uses an efficient optimal transportation scheme which we make end-to-end differentiable and which automatically tunes the exploration-exploitation behavior of the algorithm to find accurate sparse sub-networks. We show that our method achieves state-of-the-art performance compared to previous pruning methods on 3 different datasets, using 5 different models, across a wide range of pruning ratios, and with two types of sparsity budgets and pruning granularities.

12. CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions

Singh, S., Feliu, J., Acacio, M.E., Jimborean, A. and Ros, A., 2023, October. CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions. In 2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT) (pp. 1-13). IEEE.

Abstract: Efficient Total Store Order (TSO) implementations allow loads to execute speculatively out-of-order. To detect order violations, the load queue (LQ) holds all the in-flight loads and is searched on every invalidation and cache eviction. Moreover, in a simultaneous multithreading processor (SMT), stores also search the LQ when writing to cache. LQ searches entail considerable energy consumption. Furthermore, the processor stalls upon encountering the LQ full or when its ports are busy. Hence, the LQ is a critical structure in terms of both energy and performance. In this work, we observe that the use of the LQ could be dramatically optimized under the guarantees of the datarace-free (DRF) property imposed by modern programming languages. To leverage this observation, we propose CELLO, a software-hardware co-design in which the compiler detects memory operations in DRF regions and the hardware optimizes their execution by safely skipping LQ searches without violating the TSO consistency model. Furthermore, CELLO allows removing DRF loads from the LQ earlier, as they do not need to be searched to detect consistency violations. With minimal hardware overhead, we show that an 8-core 2-way SMT processor with CELLO avoids almost all conservative searches to the LQ and significantly reduces its occupancy. CELLO allows i) to reduce the LQ energy expenditure by 33% on average (up to 53%) while performing 2.8% better on average (up to 18.6%) than the baseline system, and ii) to shrink the LQ size from 192 to only 80 entries, reducing the LQ energy expenditure as much as 69% while performing on par with a mainstream LQ implementation.