This post is based on the paper https://eprint.iacr.org/2015/337

This paper proposes an instruction-set hardware architecture for all
building blocks required in polynomial ring based fully homomorphic schemes and
uses the hardware architecture to instantiate the somewhat homomorphic
encryption scheme YASHE on a large Virtex 7 FPGA. The building blocks present
in the architecture are sufficiently generic to allow implementation of other
homomorphic schemes based on RLWE or NTRU.

**System Setup**

The computations are performed in a modular polynomial rings of the
form

*R = Z[x]/(f(x))*where*f(x)*is a monic irreducible polynomial*Φ*of degree_{d}(x)*n*. To utilize single instruction multiple data (SIMD) operations*f(x)*is a cyclotomic polynomial. The plaintext space is*R*._{2}
The used parameter supports homomorphic evaluations of SIMON-64/128
(having a multiplicative depth of 44); in particular

*d*= 65535 (and thus the degree of*f(x)*is*n*= 2^{15}), log_{2}(*q*) = 1228 and the discrete Gaussian distribution with parameter is 8. The chosen parameter set meets at least 128-bit security and SIMD in 2048 slots.**High-level optimizations**

Two main operations in the YASHE scheme are the

*YASHE.Add*for homomorphic addition and the*YASHE.Mult*for homomorphic multiplication. While*YASHE.Add*is relatively simple,*YASHE.Mult*is*very costly since one first needs to compute**c*over the integers, then scale the result by_{1}· c_{2}*t/q*and round, before mapping back into the ring*R*. For the polynomial multiplication of such a large degree_{q}*n*= 2^{15}, an NTT is used over other polynomial multiplication algorithms thanks to its almost linear time complexity. Moreover to tackle the problem of long integer arithmetic CRT is used to split the coefficients into 30-bit small coefficients. Application of the CRT helps to achieve parallelization and to utilize the small DSP multipliers available in FPGAs.**High-level Architecture**

The instruction-set architecture shown in the figure has three main
components: a computer, a HE-coprocessor and an external memory. The
HE-coprocessor is implemented on an FPGA and supports NTT of polynomials,
coefficient wise additions, subtractions and multiplications, computation of
the residues using the CRT, computation of the coefficients modulo a large
modulus from the CRT-residues, and scaling of the coefficients. The computer
works in master-mode and instructs the HE-coprocessor to perform the required
operations. Since the FPGA has a limited amount of internal memory in the form
of block RAMs, only few residue polynomials can be kept during a computation. A
large external memory is used to store the set of cipher-text polynomials.

The HE-coprocessor has a set of parallel processors, where each
parallel processor is composed of several cores. The parallel processors are
independent of each other and supports a high degree of parallelism and
scalability. Since the architecture is very large, special care is taken to
reduce the routing complexity. For example, a routing-friendly parallel NTT algorithm
is used in the HE-coprocessor.

**Experimental Results**

The designed HE-coprocessor with 8 parallel processors, each
processor having 16 parallel cores, 16 small-CRT units and 2 large-CRT units was
compiled for the Xilinx Virtex-7 XC7V1140T-2 FPGA, the largest device of the
Virtex-7 FPGA family. The architecture consumes 23 % of the available
registers, 50 % of LUTs, 53 % of DSP slices, and 38 % of block RAMs. This
implementation evaluates SIMON-64/128 in approximately 157.7s (at 143MHz) and
it processes 2048 cipher-texts at once giving a relative time of only 77ms per
block. This is 26.6 times faster than the leading software implementation on a
4-core Intel Core-i7 processor running at 3.4GHz. The timing values do not
consider the overhead of data transfer between the external memory and the
HE-coprocessor.

**Future Work**

1. The architecture does not implement the required interface
between the external memory and the FPGA-based HE-coprocessor. The cost of data
transfer is very important considering the vast amount of data exchange during
various operations; and hence the interface should be properly designed. It
should be noted that most of the memory access can be performed in parallel
with the computation using a ping-pong scheme of two sets of block RAMs in the
FPGA. The FPGA and the master-computer operate on these two sets alternatively
between two consecutive instructions: when the FPGA operates on the first set,
the master-computer operates on the second, and vice versa. Since only 38% of
the available block RAMs are utilized in the proposed architecture, such a
ping-pong styled memory access seems practical in order to reduce the cost of
memory access.

2. The computation time can be reduced by performing low-level
optimizations in the architecture. For example, it might be possible to
increase the operating frequency beyond 200 MHz by optimizing the long critical
paths that are present in the architecture. The cycle requirement of the NTT
algorithm can also be reduced. Moreover, since the implemented architecture
consumes only ~50% of the available resources, it might be possible to put more
number of parallel processors in the FPGA. This will require area reduction of
each processor and a better placement-routing optimization.

3. The results are obtained from a single FPGA. Since the
HE-coprocessor is composed of independent parallel processors, several FPGAs
can be used to scale down the computation time.

Sujoy Sinha Roy