### MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based On Row-Wise Product

Nitish Srivastava<sup>\*</sup>, Hanchen Jin, Jie Liu, David Albonesi and Zhiru Zhang

> School of ECE, Cornell University \*now at Google

1

### **Graph Computing**

(e.g. database searches in graphs <sup>[1]</sup>)



**query:** MATCH (src)-[:friend] $\rightarrow$ (f)  $\rightarrow$ [:friend]  $\rightarrow$ (fof) RETURN fof



**query:** MATCH (src)-[:friend] $\rightarrow$ (f)  $\rightarrow$ [:friend]  $\rightarrow$ (fof) RETURN fof

<u>https://www.youtube.com/watch?v=xnez6tloNSQ&feature=youtu.b</u>e
 J. R. Gilbert, S. Reinhardt, and V. B. Shah, "A Unified Framework for Numerical and Combinatorial Computing," Computing in Science & Engineering

### **Graph Computing** (e.g. database searches in graphs <sup>[1]</sup>)



**query:** MATCH (src)-[:friend] $\rightarrow$ (f)  $\rightarrow$ [:friend]  $\rightarrow$ (fof) RETURN fof

#### Graph Traversals<sup>[2]</sup>



### **Compressed Neural Networks**





Simple Sentence:

John (PN) sees (V) the (Det)

man (N) with (P) telescope (N)

Transitive Closure

[1] <u>https://www.youtube.com/watch?v=xnez6tloNSQ&feature=youtu.b</u>e

0

0

[2] J. R. Gilbert, S. Reinhardt, and V. B. Shah, "A Unified Framework for Numerical and Combinatorial Computing," Computing in Science & Engineering [3] Penn, Gerald. "Efficient transitive closure of sparse matrices over closed semirings.", Theoretical Computer Science, 2006

0

0

0













$$C[i,j] = \sum_{k} A[i,k] \cdot B[k,j]$$

 $C_{00} += \square \times \square = 0$ 

**3** operations to produce a **0** 



Inner Product  $C[i,j] = \sum_{k} A[i,k] \cdot B[k,j]$   $C_{00} += \square \times \square = 0$ 

3 operations to produce a 0

#### Inconsistent formatting Row-major format for matrix A and columnmajor for B



Inner Product  $C[i,j] = \sum_{i} A[i,k] \cdot B[k,j]$   $C_{00} += \square \times \square = 0$ 

3 operations to produce a 0

#### **Inconsistent formatting**

Row-major format for matrix A and columnmajor for B

#### Inefficient index matching

Index matches are required even for zero output value



Inner Product  $C[i,j] = \sum_{k} A[i,k] \cdot B[k,j]$ 

$$C_{00} += \square \times \square = 0$$

3 operations to produce a 0

#### **Inconsistent formatting**

Row-major format for matrix A and columnmajor for B

#### **Inefficient index matching**

Index matches are required even for zero output value

No synchronization



**Inner Product**  $C[i,j] = \sum_{k} A[i,k] \cdot B[k,j]$   $c_{00} += \square \times \square = 0$ 

3 operations to produce a 0

#### **Inconsistent formatting**

Row-major format for matrix A and columnmajor for B

#### Inefficient index matching

Index matches are required even for zero output value

#### No synchronization

### Low on-chip memory requirements



# **Outer Product** $C[:,:] = \sum_{k} A[:,k] \cdot B[k,:]$



# **Outer Product** $C[:,:] = \sum_{k} A[:,k] \cdot B[k,:]$





**Outer Product**  $C[:,:] = \sum_{k} A[:,k] \cdot B[k,:]$ 

cartesian product using only non-zeros

#### **Inconsistent formatting**

Column-major format for A and row-major for B

Improvement in index matching

#### **Synchronization**

High on-chip memory requirements

### **Challenges with Sparse-Sparse MM Acceleration**

# **Challenges with Sparse-Sparse MM Acceleration**

# Challenges

#### Choice Of Algorithm

• Inner & outer product not efficient

#### Memory Bandwidth

Low data reuse

#### Sparse Output Matrix

- Position of non-zeros are unknown
- Needs to be produced in parallel

# **Challenges with Sparse-Sparse MM Acceleration**

# Challenges

- Choice Of Algorithm
  - Inner & outer product not efficient
- Memory Bandwidth
  - Low data reuse
- Sparse Output Matrix
  - Position of non-zeros are unknown
  - Needs to be produced in parallel

# Solution

- **MatRaptor:** An efficient hardware accelerator for sparse-sparse MM
- Key Idea: Co-design <u>algorithm</u> and <u>sparse format</u>



$$C[i,:] = \sum_{k} A[i,k] \cdot B[k,:]$$



$$C[i,:] = \sum_{k} A[i,k] \cdot B[k,:]$$



$$C[i,:] = \sum_{k} A[i,k] \cdot B[k,:]$$



$$C[i,:] = \sum_{k} A[i,k] \cdot B[k,:]$$



$$C[i,:] = \sum_{k} A[i,k] \cdot B[k,:]$$



$$C[i,:] = \sum_{k} A[i,k] \cdot B[k,:]$$



$$C[i,:] = \sum_{k} A[i,k] \cdot B[k,:]$$



$$C[i,:] = \sum_{k} A[i,k] \cdot B[k,:]$$



**Row-wise Product** 

$$C[i,:] = \sum_{k} A[i,k] \cdot B[k,:]$$

**Medium/low on-chip memory requirements** 















**a**<sub>11</sub>

PE1





















**a**<sub>33</sub> PE1

**Compressed Sparse Row** (CSR)





#### **Bandwidth Utilization**

HBM can provide upto 128 GB/s



HBM with 8 pseudo-channels (4 physical channels)

- HBM can provide upto 128 GB/s
- Memory-Level Parallelism
  - Can be achieved by accessing in parallel
    - Memory channels
    - Memory banks
  - Is hampered by
    - Channel conflicts
    - Bank conflicts



HBM with 8 pseudo-channels (4 physical channels)

- HBM can provide upto 128 GB/s
- Memory-Level Parallelism
  - Can be achieved by accessing in parallel
    - Memory channels
    - Memory banks
  - Is hampered by
    - Channel conflicts
    - Bank conflicts

CSR



HBM with 8 pseudo-channels (4 physical channels)



Assuming parallel row access w/ two HBM channels

- HBM can provide upto 128 GB/s
- Memory-Level Parallelism
  - Can be achieved by accessing in parallel
    - Memory channels
    - Memory banks
  - Is hampered by
    - Channel conflicts
    - Bank conflicts







HBM with 8 pseudo-channels (4 physical channels)

- HBM can provide upto 128 GB/s
- Memory-Level Parallelism
  - Can be achieved by accessing in parallel
    - Memory channels
    - Memory banks
  - Is hampered by
    - Channel conflicts
    - Bank conflicts





HBM with 8 pseudo-channels (4 physical channels)

Assuming parallel row access w/ two HBM channels















**a**<sub>11</sub>

PE1

















Cyclic Channel Sparse Row (C<sup>2</sup>SR)



Streaming accesses in each channel



#### C<sup>2</sup>SR

- None of the row split across channels
- Row parallelization does not incur channel conflicts

Cyclic Channel Sparse Row (C<sup>2</sup>SR)



Streaming accesses in each channel



#### C<sup>2</sup>SR

- None of the row split across channels
  - Row parallelization does not incur channel conflicts

Producing the output matrix in C<sup>2</sup>SR format means different PEs can work independently

**a**<sub>33</sub>

PE1

**a**<sub>23</sub>

PE0

Cyclic Channel Sparse Row (C<sup>2</sup>SR)



**Streaming accesses in each channel** 



#### Cyclic Channel Sparse Row (C<sup>2</sup>SR)



**Streaming accesses in each channel** 

#### C<sup>2</sup>SR

- None of the row split across channels
  - Row parallelization
    does not incur channel
    conflicts

Producing the output matrix in C<sup>2</sup>SR format means different PEs can work independently

**a**<sub>33</sub>

PE1

**a**<sub>23</sub>

PE0



#### Bandwidth Utilization

### **MatRaptor Architecture**



- Reads/writes in C<sup>2</sup>SR format to exploit HBM bandwidth
- Two-level load balancing for high compute utilization
  - Round-robin execution across PEs
  - Decoupled access, multiply & merge for each PE



## **Evaluation Methodology**

#### Cycle-level simulation in gem5

- 8 PE array, Memory Width 128 bytes
- 10 4 KB Queues (RAMs) per PE
- HBM: 8 128-bit channels (128 GB/s peak bandwidth)

#### RTL Modeling of a PE using PyMTL

#### Baselines

- CPU: Intel(R) Xeon(R) CPU E7-8867
  - Intel MKL
- GPU: Titan XP
  - CuSparse
- Accelerator:
  - OuterSPACE<sup>[5]</sup>

#### Area and Power Breakdown

| Component        | Area $(mm^2)$ | %       | Power (mW) | %       |
|------------------|---------------|---------|------------|---------|
| PE               | 1.981         | 87.77 % | 1050.57    | 78.11 % |
| – Logic          | 0.080         | 3.54 %  | 43.08      | 3.20 %  |
| - Sorting Queues | 1.901         | 84.22 % | 1007.49    | 74.90 % |
| SpAL             | 0.129         | 5.71 %  | 144.15     | 10.71 % |
| SpBL             | 0.129         | 5.71 %  | 144.15     | 10.71 % |
| Crossbars        | 0.016         | 0.7 %   | 6.067      | 0.45 %  |
| Total            | 2.257         | 100 %   | 1344.95    | 100 %   |

#### Datasets

- SuiteSparse<sup>[6]</sup>

### **Speedup on Sparse-Sparse MM**



MatRaptor is 129.2x, 7.9x and 1.8x faster than CPU, GPU and OuterSPACE

### **Energy Efficiency on Sparse-Sparse MM**



MatRaptor is 480x, 570x and 12x more energy-efficient than CPU, GPU and OuterSPACE

### **Conclusions & Future Work**

- MatRaptor, a new sparse-sparse MM accelerator, exploiting the codesign of hardware and sparse storage format
  - First accelerator that uses row-wise product approach
  - A novel sparse storage format C<sup>2</sup>SR to achieve high-memory bandwidth
  - Significant speedup and energy efficiency over CPU, GPU, and OuterSPACE

### Future Work

- Integrate MatRaptor into a many-tiny-core system
- Demonstrate a real prototype on FPGAs or ASICs

# **Thank you! Questions?**

### MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based On Row-Wise Product

Nitish Srivastava<sup>\*</sup>, Hanchen Jin, Jie Liu, David Albonesi and Zhiru Zhang

> School of ECE, Cornell University \*now at Google