In Int’l Symp. on Microarchitecture (MICRO), 2020

In Int’l Symp. on High-Performance Computer Architecture (HPCA), 2020

In Int’l Symp. of Field-Programmable Custom Computing Machines (FCCM), 2019

In FPGA, 2018

In FPGA, 2017

In IEEE Comm Letters, 2013


MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product

Sparse-sparse matrix multiplication (SpGEMM) is a computation kernel widely used in numerous application domains such as data analytics, graph processing, and scientific computing. In this work we propose MatRaptor, a novel SpGEMM accelerator that is high performance and highly resource efficient. Unlike conventional methods using inner or outer product as the meta operation for matrix multiplication, our approach is based on row-wise product, which offers a better tradeoff in terms of data reuse and on-chip memory requirements, and achieves higher performance for large sparse matrices. We further propose a new hardware-friendly sparse storage format, which allows parallel compute engines to access the sparse data in a vectorized and streaming fashion, leading to high utilization of memory bandwidth. We prototype and simulate our accelerator architecture using gem5 on a diverse set of matrices

Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations

Tensor factorizations are powerful tools in many machine learning and data analytics applications. Tensors are often sparse, which makes sparse tensor factorizations memory bound. In this talk, I present a hardware accelerator, Tensaurus, that can accelerate both dense and sparse tensor factorizations. We co-design the hardware and a sparse storage format, which allows accessing the sparse data in vectorized and streaming fashion and maximizes the utilization of the memory bandwidth. We also extract a common computation pattern that is found in numerous matrix and tensor operations and implement it in the hardware.

T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations

We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. It decouples a functional specification from a spatial mapping, allowing programmers to quickly explore various spatial optimizations for the same function. The actual implementation of these optimizations is left to a compiler. Thus, productivity and performance are achieved at the same time.


Pointer-Chase Prefetcher

Pointer-Chase Prefetcher for Linked Data Structures


I have been TA for following courses at Cornell University:

  • CS3420: Embedded Systems