Toeplitz Transform

Srimanth Tenneti
5 min readJun 24, 2024

--

Introduction

A Toeplitz matrix in Linear Algebra is a diagonal constant matrix where the elements across the diagonal from left to right do not change. In this article we would explore a use case for a matrix transform that would convert a given matrix into a toeplitz matrix.

The AI Hardware Problem

Traditionally processing focused on general applications where the workload’s compute complexity was reasonably low. This meant that increasing clock speeds, architectural optimization, moving to better technologies, etc were strategies of increasing performance.

The introduction of AI completely changed the profile of the workloads bringing in new and parallel datapath requirements. The chart below depicts the compute requirements of modern day AI models.

Robi Rahman, David Owen and Josh You (2024), "Tracking Large-Scale AI Models". Published online at epochai.org. Retrieved from: 'https://epochai.org/blog/tracking-large-scale-ai-models' [online resource]

Researchers and industry professionals initially attempted to handle this challenge by improving the parallelism of traditional general purpose processors.

This was enough initially but as the models evolved it became quite difficult to scale the compute using this approach. So, to offload the compute from the general purpose processors the idea of hardware accelerators was adopted.

Typically these accelerators offload more computationally complex tasks like Matrix Multiplication, Convolution, etc and the processor would perform it’s normal routine. As accelerators are tailor fit for a particular operation we would need a separate unit for each task. This though offers performance would increase the cost of the hardware considerably.

So, we engineer hardware and software solutions to optimally offload the compute to the smallest set of accelerators attempting to achieve the required performance at a reasonable cost.

Systolic Arrays

Most AI workloads consist of Matrix Multiplication (GEMM) and accelerating this would help the host processor quite a lot since the operation in its vanilla form has a time complexity of O(n3). To offload this operation researchers used a piece of hardware called the Systolic Array which used pipelining and parallel computing to accelerate the GEMM operation.

4x4 Systolic Array performing the GEMM operation

A lot of modern accelerators and Deep Learning Processor Units (DPUs) use SAs to accelerate workloads. Some interesting solutions that use SAs are the Google TPU, NVIDIA H100 (Highly Likely), etc.

Though SAs perform really well on matrix multiplication we would not be able to readily use it for operations like the 2D convolution which is a key operation behind CNNs and a lot of Generative AI workloads.

The Toeplitz Transform

To reuse the SA hardware for operations like 2D convolution we would have to transform the convolution operation into a GEMM operation. We could do this via the toeplitz transform. To understand how this works let us consider a simple example.

Let the input matrix have a dimension of 4x4 and the kernel have a dimension of 2x2 and the stride be 1.

To perform convolution we would go over the 4x4 matrix slide the kernel and perform a point wise operation.

Input -> | a0  a1  a2  a3
a4 a5 a6 a7
a8 a9 a10 a11
a12 a13 a14 a15 |

Kernel -> | k0 k1
k2 k3 |


2D Convolution - | (a0.k0+a1.k1+a4.k2+a5.k3) (a1.k0+a2.k1+a5.k2+a6.k3) (a2.k0+a3.k1+a6.k2+a7.k3)
(a4.k0+a5.k1+a8.k2+a9.k3) (a5.k0+a6.k1+a9.k2+a10.k3) (a6.k0+a7.k1+a10.k2+a11.k3)
(a8.k0+a9.k1+a12.k2+a13.k3) (a9.k0+a10.k1+a13.k2+a14.k3) (a10.k0+a11.k1+a14.k2+a15.k3) |

Using a stride of 1 we get a 3x3 matrix above. Now to convert this into a GEMM operation
we would use the stride and would reorganize the input as follows.

A' -> | a0 a1 a4 a5
a1 a2 a5 a6
a2 a3 a6 a7
a4 a5 a8 a9
a5 a6 a9 a10
a6 a7 a10 a11
a8 a9 a12 a13
a9 a10 a13 a14
a10 a11 a14 a15 |

K' -> | k0
k1
k2
k3 |

A' X K' would be the same as Conv(A, K).

A' x K' -> | a0.k0+a1.k1+a4.k2+a5.k3
a1.k0+a2.k1+a5.k2+a6.k3
a2.k0+a3.k1+a6.k2+a7.k3
.
.
.
a10.k0+a11.k1+a14.k2+a15.k3 |

Now we should reshape the matrix to get the result. The formula (I-K)/S + 1
could be used where I = Input Shape, K= Kernel Shape and S = Stride.

So using

I = 4
K = 2
S = 1

we have (4-2)/1 + 1 -> 2+1 = 3. As the input is a square matrix the other
dimension would also be 3.

A' x K' -> (9x4) x (4x1) -> (9x1).

So we need to reshaoe 9x1 to 3x3 which would be the correct shape for the result.

This is only one way to apply this transform. Once could also transform the matrix as a group of kernel operations and flatten the matrix as follows.

A'' -> | k0 k1 0  0 k2 k3 0 0 0 0 0 .... 0 
0 k0 k1 0 0 k2 k3 0 0 0 0 .... 0
.
.
.
0 0 0. ..... k0 k1 0 0 k2 k3 |

K'' -> | a0
a1
a2
a3
a4
a5
a6
a7
a8
a9
a10
a11
a12
a13
a14
a15 |

A'' x K'' -> | a0.k0+a1.k1+a4.k2+a5.k3
a1.k0+a2.k1+a5.k2+a6.k3
.
.
.
a10.k0+a11.k1+a14.k2+a15.k3 |

This gives us the ability to reuse the SA based hardware for performing 2D convolution.

Implementation

import numpy as np

### Uses Approach 1
def toeplitz_transform (mat, ker, S = 1, P = 0) :
x = mat
y = ker
x_dim = (int((x.shape[0] - y.shape[0] - 2 * P) / S) + 1) * (int((x.shape[1] - y.shape[1] - 2 * P) / S) + 1)
y_dim = y.shape[0] * y.shape[1]

matrix = []
for i in range(x_dim) :
row = []
if i == x.shape[0] - y.shape[0] + 1:
break
else :
for j in range(y_dim) :
if j == x.shape[1] - y.shape[1] + 1 :
break
else :
matrix.append(x[i:i+y.shape[0], j:j+y.shape[1]].flatten())

matrix = np.array(matrix)
return matrix

### Transforming Convolution to Matrix Multiplication
result = np.matmul(toeplitz_transform(x,y), y.flatten()).reshape(y.shape[0], y.shape[1])

>>> [[5 6]
[8 9]]

### Validating the result
import torch

c = torch.nn.Conv2d(1, 1, kernel_size=(2, 2), stride=(1, 1), padding=(0, 0), bias=False)

kernel = torch.tensor(
[[-1, 1],
[1, 0]], dtype=torch.float32)
kernel = kernel.reshape(1, 1, 2, 2)

c.weight = torch.nn.Parameter(kernel)

xp = torch.tensor(x, dtype=torch.float32)
xp = xp.reshape(1, 1, 3, 3)

>>> tensor([[[[5., 6.],
[8., 9.]]]], grad_fn=<ConvolutionBackward0>)

Conclusion

Using matrix transforms we could reuse hardware tailor fit for GEMM operations to perform 2D Convolutions and hence offload this too. But, it also comes with additional overhead.

LinkedIn — https://www.linkedin.com/in/srimanth-tenneti-662b7117b/

--

--

No responses yet