My Approach to Designing an AI Accelerator
Introduction
About a year ago I started my very first System on Chip — SoC that uses a Systolic Array — SA based hardware accelerator to offload GEMM (Matrix Multiplication) workloads. My journey working on this over the year taught me a lot of important lessons and this article is an attempt to share my learnings and my ideology behind the design.
Idea Origin
Having worked on training AI / ML models for the past few years I got to work with various hardware platforms like CPU, GPU, VPU, FPGAs and TPUs. I clearly knew that designing a training accelerator would require a lot of time and effort and as I had a time constraint I opted for an inference accelerator.
As I wanted to build an edge based solution area and power were two key issues that I had to handle. So, I opted for the Systolic Array as the accelerator due to its area and power profile i.e quite simple and power efficient.
I had an ARM Cortex M0 processor available with me so wanted to use that as the CPU in the SoC. With these things in mind I finalized the spec for the SoC i.e. a Systolic Array based accelerator built around a ARM Cortex M0 CPU.
Architecture
Part 1 — Nuts & Bolts
Now I had to work on designing an a system that would satisfy the requirements. Before working on the architecture there were some fundamental parameters to be considered. The first was working with a bus width of 32 bits. The next parameter was the precision of the data. To keep the architecture simple I opted for INT8 data and started working on the architecture.
Once I finalized the bus width and data precision I realized that I can beautifully fit a 4x4 row into 1 data packet and hence opted for a 4x4 SA.
Then I started my design by drawing out the 4x4 SA dataflow to understand what I really needed to build to get the system working.
Very quickly I noticed that the there were patterns in the SA dataflow requirements. There were only 4 patterns required for the data alignment and these patterns were common to both the Left and Top lanes.
So, I listed the generic components I would need for the system :
- A Standard Interface — AHB Lite — 32 Bit Wide
- FIFO to store data sent by the CPU — FIFO as the order of data is important. — Master FIFO
- Data Switch— This would be important to send the data packet to the appropriate SA input lane.
- Alignment Generator — To generate the data alignment pattern required for the Matrix Multiplication.
- FIFOs to store the alignment sequence once it is generated.
- Systolic Array (4x4)
- Another memory to store the data generated by the SA.
Initially I decided that the Master FIFO would be a 64x4 to store the 256 bits required for the operation but due to the handshake requirements internally I opted for a much simpler 32x8 configuration. This made it a lot easier to design the internal handshake mechanism.
With this decision made I started working on the delivery mechanism that would ensure each packet is delivered to the appropriate lane with the correct alignment. This was easily achieved by exporting the read pointer of the Master FIFO as this helped the system switch to the appropriate data lane of the SA.
Each of the 8 locations in the FIFO was mapped to a particular SA lane and the Alignment FIFO status was used as a handshake mechanism to control the switch and the master FIFO.
I also realized that the SA generated 256 bits at the outputs and I could not accomodate all of them on the bus at once. So, I needed to create packets and serially put out packets when the data from the Accelerator was being collected by the host CPU.
So, we then finalized on a basic architecture that tried to encapsulate all the ideas discussed above.
Part 2 — The Glue
Now that the basic nuts and bolts were in place I had to work on designing the core logic to make the components work together.
To build this I had to solve a few problems. They were :
- Restricting the Bus from writing data during an operation.
- Internal lane handshake. To understand why this is an issue let us consider a location in the Master FIFO and a Alignment FIFO. If I used the Alignment FIFO full as an indicator for the switching of the Master FIFO read pointer we would not have a problem. But, when the operation flushes all the data we need a global read enable (GRE) to control all the FIFOs one all the 8 lanes have data. Now if we use full and empty i.e have GRE between full and empty we run into issues as once we read data from a location full is 0 and the system cannot keep GRE high.
- How to serially organize the result for the host to collect.
To solve the above 3 issues I introduced the following :
- A Channel buffer that once a pos edge on Full of the Master FIFO is detected will disable the lane until the end of operation. The end of cycle is computed by a counter counting operating + stabilization cycles once GRE is high. (7 + 4 = 11 for the 4x4 configuration)
- Similarly we used a pos edge based handshake mechanism where pos edge of global full i.e. pos edge of all 8 FIFOs being full to pos edge of all 8 FIFOs being empty was used for the GRE signalling. This solved the problem as once the edge was detected the operation would start rather than having the level.
- To serialize the 16, 16 bit packets I used a serializer that clubbed two packets together into a 32 bit packet. So, we end up with 8 words.
Data Switch
Now that I had solutions for the key problem I worked on the first component the data switch. This had to take the Master FIFO read pointer, the status of the Alignment FIFOs and control both the read enable of the Master FIFO. It also had to keep the data stable during the alignment operation that is only when the Alignment FIFO is full it could move to the next lane.
This was easily achieved using the logic described above. The Figure 4 describes the working of the Switch.
Sequencer
Once we had a lane enabled the alignment process needed to generate the required sequence. We designed a simple block that generated the 4 sequence patterns needed for the SA operation. This block also controlled the write enable of the Alignment FIFO.
Once the SA would perform the computation and the outputs would be ready the serializer would collect the data.
Serializer
This block clubbed two packets into one word. The packets (1, 2), (3, 4), …(15, 16) were clubbed into 8 packets and were stored in a FIFO. The serializer also generated the FIFOs write enable.
With these components working the accelerator was working. Now the next key part was integrating it to the AHB Lite Bus.
Bus Integration
The AHB protocol requires some key output signals like HREADY and HRESP to control the transactions and working on this was the most difficult part.
The accelerator had to maintain the HREADY low once the Master FIFO was full and could again go high once the operation was complete. Also the HRESP would be error if the Output FIFO (OSF) was not empty and the system would initiate a write transfer.
After working on these things we integrated and tested the system.
Testing
To test the system we devised a simple test. If we multiply 2 square matrices with all ones we would end up with a matrix where all the elements having a value equal to the order of the matrix.
The above test was used to debug all the key issues of the system and once we resolved all of them we worked on developing other tests.
Conclusion
We finally end up with an IP that is capable of accelerating Matrix Multiplications on edge devices. The full version of the IP along with the test SoC is available in the GitHub repository below.
Repository : https://github.com/srimanthtenneti/Hell_Fire_SoC_Demo/