Building FPGA applications on AWS — and yes, for Deep Learning too
Field Programmable Gate Arrays (FPGA) are not shiny new technology: indeed, the first commercial product dates back to 1985. So how could they be relevant to a bleeding edge topic like Deep Learning? Then again, neural networks themselves go back to the late Forties, so... There might be something afoot, then. Read on :)
“Grab onto my arm now. Hold tight. We are going into a number of dark places, but I think I know the way. Just don’t let go of my arm” — Stephen King
The case for non-CPU architectures
Until quite recently, the world of computing has been unequivocally ruled by CPUs. However, for a while now, there have been ever-growing doubts on how sustainable Moore’s Law really is.
To prevent chips from melting, clock speeds have been stagnating for years. In addition, even though lithography processes still manage to carve smaller and smaller features, we’re bound to hit technology limits rather sooner than later: in the latest Intel Skylake architecture, a transistor is 100 atoms wide.
The great man himself publicly declared in 2015: « I guess I see Moore’s Law dying here in the next decade or so, but that’s not surprising ».
Wait, there’s more.
With new workloads come new requirements
Another factor is helping foment the coming coup against King CPU: the emergence of new workloads, such as genomics, financial computing or Deep Learning. As it happens, these involve staggering amounts of mathematical computation which can greatly benefit from massive parallelism (think tens of thousands of cores). Sure, it’s definitely not impossible to achieve this with CPU-based architectures — here’s a mind-boggling example —but in recent years, a very serious contender has emerged: the Graphics Processing Unit (GPU), spearheaded by Nvidia.
The King is dead, long live the King (?)
Equipped with thousands of floating-point cores, a typical GPU is indeed a formidable crunching machine, able to deliver proper hardware parallelism at scale. Soon enough, researchers have understood how these chips could be applied to Machine Learning and Deep Learning at scale.
Patrice Y. Simard, Dave Steinkrau, Ian Buck, “Using GPUs for Machine Learning Algorithms”, 2005
Dan C. Cireşan, Ueli Meier, Jonathan Masci, Luca M. Gambardella, Jürgen Schmidhuber, “High-Performance Neural Networks for Visual Object Classification”, 2011
And thus began the Age of the GPU, leading up to the design of computing monsters such as the Nvidia V100: 21.1 billion transistors, 815 square millimeters (1.36 square inch for my US friends), 5120 CUDA cores, 640 tensor cores. Surely, this should be enough for anyone… right?
A chink in the GPU armor?
When it comes to brute force computing powers, GPUs are unmatched.
However, for some applications, they don’t deliver the most bang for your buck. Here are some reasons why you might not want to use a GPU:
- Power consumption and maybe more importantly, power efficiency (aka TeraOPS per Watt). This does matter a lot in the embedded and IoT worlds.
- The need to process custom data types (not everything is a float)
- Applications exhibiting irregular parallelism (alternating phases of sequential and parallel processing) or divergence (not all cores executing the same code at the same time).
What about Deep Learning specifically? Of course, we know that GPUs are great for training: their massive parallelism allows them to crunch large data sets in reasonable time. To optimize throughput and put all these cores to good use, we don’t forward single samples through the model: we use batches of samples instead.
However, training is only half the story: what about inference? Well, it depends. If your application can live with the latency required to collect enough samples to forward a full batch, then you should be fine. If not, then you’ll have to run inference on single samples and it’s likely that throughput will suffer.
In order to get the best inference performance, the logical step would be to use a custom chip. For decades, the choice has been pretty simple: either build an Application Specific Integrated Circuit (ASIC) or use an FPGA.
The ASIC way
An ASIC is a fully custom design, which is mass-produced and deployed in devices. Obviously, you get to tweak it and optimise it in the way that works best for your application: best performance, best power efficiency, etc. However, designing, producing and deploying an ASIC is a long, expensive and risky process. You’ll be lucky to complete it in less than 18 months.
This is the route that Google took for their TPU chip. They did it in 15 months, which is impressive indeed. Just wonder how long it would take you.
Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson, “In Datacenter Performance Analysis of a Tensor Processing Unit”, 2017
Of course, ASICs are inflexible: if your application requirements change significantly, you have to start all over again.
The FPGA way
As their name implies, FPGAs are (re)programmable logic circuits. A typical FPGA includes hundreds of thousands and sometimes millions of logic cells, thousands of Digital Signal Processing (DSP)“slices” as well as very fast on-chip memory.
Not everyone enjoys digital circuits and Boolean algebra, so let’s keep this simple: FPGAs are Lego digital architecture. By mixing and matching the right logic blocks, a system designer armed with the right tools can pretty much implement anything… even an Apple ][ :)
Building FPGA applications
Historically, building FPGA applications has required the purchase of costly software and hardware tools in order to design, simulate, debug, synthesize and route custom logic designs. Let’s face it, they made it hard to scale engineering efforts.
Designing custom logic for FPGAs also required the mastery of esoteric languages like VHDL or Verilog… and your computer desktop would look something like this. Definitely not for everyone (including myself).
Fortunately, developers now have the option to build FPGA applications in C/C++ thanks to the SDAccel environment and OpenCL. The programming model won’t be unfamiliar to CUDA developers :)
Deploying FPGA applications on AWS
About a year ago, AWS introduced Amazon EC2 F1 instances.
They rely on the Xilinx Ultrascale+ VU9P chip. Here are some of the specs (PDF): over 2.5 million System Logic Cells (specs — PDF) and 6,840 DSP slices (specs — PDF). Yes, it’s a beast!
In order to simplify FPGA development, AWS also provides an FPGA Developer AMI coming with the full Xilinx SDx 2017.1 tool suite… and a free license :) The AMI also includes the AWS SDK and HDK to help you build and manage your FPGA images: both are Open Source and available on Github.
The overall process would look something like this:
- Using the FPGA Developer AMI on a compute-optimized instance (such as a c4), design, simulate and build the Amazon FPGA Image (AFI).
- On an EC2 F1 instance, use the AWS FPGA SDK to load the AFI and access it from a host application running on the CPU.
Building Neural Networks with FPGAs
At the core of Neural Networks lies the “Multiply and Accumulate” operation, where we multiply inputs by their respective weights and add all the results together. This can be easily implemented using a DSP slice. Yes, I know its a very simple example, but more complex operations like convolution or pooling could be implemented as well.
Of course, modern FPGAs have tons of gates and they’re able to support very large models. However, in the interest of speed, latency and power consumption, it would make sense to try to minimize the number of gates.
Optimizing Deep Learning models for FPGAs
There’s a lot of ongoing research to simplify and shrink Deep Learning models with minimal loss of accuray. The three most popular techniques are:
Quantization, i.e. using integer weights (8, 4 or even 2-bit) instead of 32-bit floats. Less power is required: less gates are required to implement the model, and integer operations are cheaper than floating-point operations. Less memory is also required, as we save memory and shrink model size.
Pruning, i.e. removing connections that play little or no role in predicting successfully. Computation speed goes up, latency goes down. Less memory is required, as we save memory and reduce model size.
Compression, i.e. encoding weights, as they’re now integer-based and exhibit a smaller set of possible values. Less memory is required, as we save memory and reduce model size even further.
As a bonus, models may shrink so much that on-chip SRAM could become a viable option. This would help in saving even more power (as SRAM is much more efficient than DRAM) as well as speeding up computation (as on-chip RAM is always faster to access than off-chip RAM).
Using these techniques and more, researchers have obtained spectacular results.
Song Han et al, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, 2016
- Optimizing CNNs on CPU and GPU
- AlexNet 35x smaller, VGG-16 49x smaller
- 3x to 4x speedup, 3x to 7x more energy-efficient
- No loss of accuracy
Song Han et al, “ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA”, 2017
- Optimizing LSTM networks on Xilinx FPGA
- FPGA vs CPU: 43x faster, 40x more energy-efficient
- FPGA vs GPU: 3x faster, 11.5x more energy-efficient
Nurvitadhi et al, “Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks?”, 2017
- Optimizing CNNs on Intel FPGA
- FPGA vs GPU: 60% faster, 2.3x more energy-efficient
- <1% loss of accuracy
The next step: Deep Learning hardware
Some of you may still remember the ill-fated LISP machines and shiver at the thought of AI hardware. However, times have changed and researchers are moving fast here as well.
Song Han, “Deep Learning Tutorial and Recent Trends” (PDF), 2017
The topic picked up even more speed when Nvidia recently announced a new initiative, Nvidia Hardware for Deep Learning. In a nutshell, this includes Open Source hardware blocks implemented in Verilog that may be used to build Deep Learning accelerators for IoT applications:
- Convolution Core — optimized high-performance convolution engine
- Single Data Processor — single-point lookup engine for activation functions
- Planar Data Processor — planar averaging engine for pooling
- Channel Data Processor — multi-channel averaging engine for normalization functions
- Dedicated Memory and Data Reshape Engines — memory-to-memory
transformation acceleration for tensor reshape and copy operations.
Although clearly targeted at IoT devices, these building blocks can be simulated and deployed to F1 instances :)
This FPGA-based initiative is coming from the company that brought us GPUs in the first place, which should definitely raise a few eyebrows. In my humble opinion, we should definitely pay attention: no one would know more than Nvidia about GPUs strengths and weaknesses and about speeding up Deep Learning computations. A exciting and clever move indeed.
“What now? Let me tell you what now”
— Marcellus (Pulp Fiction)
I don’t have a crystal ball, but here are a few closing predictions based on extensive analysis of my gut feelings :-P
- Deep Learning is shaping up to be a major workload for public clouds and IoT. No single hardware architecture can win both battles.
- Much more infrastructure will be used for inference than for training (I’d expect multiple orders of magnitude). Again, no single hardware architecture can win both battles.
- Cloud-based GPUs will dominate training for the foreseeable future.
- Cloud-based inference will be the mother of all battles. ASICs look good, but I just don’t see Nvidia letting go. Grab some popcorn and wait for the showdown.
- CPU inference will still be a thing for smaller IoT devices, which is why software acceleration solutions like Intel MKL or NNPACK are important.
- For larger IoT devices, we may witness an inference-driven FPGA renaissance. Current GPUs are too power-hungry and ASICs too inflexible.
Well, we made it. No code this time, but I hope you still enjoyed this :)
Thanks for reading.