Low-Latency Hardware Masking with Application to AES 📺
During the past two decades there has been a great deal of research published on masked hardware implementations of AES and other cryptographic primitives. Unfortunately, many hardware masking techniques can lead to increased latency compared to unprotected circuits for algorithms such as AES, due to the high-degree of nonlinear functions in their designs. In this paper, we present a hardware masking technique which does not increase the latency for such algorithms. It is based on the LUT-based Masked Dual-Rail with Pre-charge Logic (LMDPL) technique presented at CHES 2014. First, we show 1-glitch extended strong noninterference of a nonlinear LMDPL gadget under the 1-glitch extended probing model. We then use this knowledge to design an AES implementation which computes a full AES-128 operation in 10 cycles and a full AES-256 operation in 14 cycles. We perform practical side-channel analysis of our implementation using the Test Vector Leakage Assessment (TVLA) methodology and analyze univariate as well as bivariate t-statistics to demonstrate its DPA resistance level.
Fast Multi-precision Multiplication for Public-Key Cryptography on Embedded Microprocessors
Multi-precision multiplication is one of the most fundamental operations on microprocessors to allow public-key cryptography such as RSA and elliptic curve cryptography. In this paper, we present a novel multiplication technique that increases the performance of multiplication by sophisticated caching of operands. Our method significantly reduces the number of needed load instructions which is usually one of the most expensive operations on modern processors. We evaluate our new technique on an 8-b ATmega128 and a 32-b ARM7TDMI microcontroller and compare the results with existing solutions. For the ATmega128, our implementation needs only 2395 clock cycles for a 160-b multiplication. The number of required load instructions is reduced from 167 (needed for the best known hybrid multiplication) to only 80. On the ARM7TDMI, our implementation needs only 281 clock cycles as opposed to 357. For both platforms, the proposed technique outperforms related work by a factor of about 10–23%. We also show that the method scales very well even for larger integer sizes (required for RSA) and limited register sets. It fully complies with existing multiply–accumulate instructions that are integrated in most of the available processors.
A Side-Channel Assisted Cryptanalytic Attack Against QcBits
QcBits is a code-based public key algorithm based on a problem thought to be resistant to quantum computer attacks. It is a constant-time implementation for a quasi-cyclic moderate density parity check (QC-MDPC) Niederreiter encryption scheme, and has excellent performance and small key sizes. In this paper, we present a key recovery attack against QcBits. We first used differential power analysis (DPA) against the syndrome computation of the decoding algorithm to recover partial information about one half of the private key. We then used the recovered information to set up a system of noisy binary linear equations. Solving this system of equations gave us the entire key. Finally, we propose a simple but effective countermeasure against the power analysis used during the syndrome calculation.
- CHES 2016
- CHES 2015
- CHES 2014