Nitish

PhD candidate at Computer System Laboratory

Cornell University

Biography

I am a software engineer in the edge TPU group at Google. I completed my Ph.D. at Cornell University where I developed domain-specific language, T2S-Tensor, to productively generate high-performance accelerators for dense tensor computations and a domain-specific hardware, Tensaurus, to accelerate mixed sparse-dense tensor computations. I was advised by Prof. David Albonesi and Prof. Zhiru Zhang. Before coming to Cornell, I obtained my Bachelor of Technology degree in Electrical Engineering from Indian Institute of Technology, Kanpur, where I was awarded with the President’s Gold Medal. You can find my CV here.

I am interested in re-thinking algorithm, language and hardware design to accelerate sparse and dense tensor algebra.

Interests

Computer Architecture
High Level Synthesis
VLSI

Education

PhD in Computer Architecture

Cornell University
BTech in Electrical Engineering, 2014

Indian Institute of Technology, Kanpur

Publications

Design and Generation of Efficient Hardware Accelerators for Sparse and Dense Tensor Computations

Nitish Srivastava

Ph.D. Dissertation

Details PDF Slides

MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product

Nitish Srivastava, Hanchen Jin, Jie Liu, David Albonesi, Zhiru Zhang

In Int’l Symp. on Microarchitecture (MICRO), 2020

Details PDF Slides Video

Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations.

Nitish Srivastava, Hanchen Jin, Shaden Smith, Hongbo Rong, David Albonesi, Zhiru Zhang

In Int’l Symp. on High-Performance Computer Architecture (HPCA), 2020

Details PDF Slides Video

T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations

Nitish Srivastava, Hongbo Rong, Zhiru Zhang, Pradeep Dubey, et al.

In Int’l Symp. of Field-Programmable Custom Computing Machines (FCCM), 2019

Details PDF Slides Video

Operation Dependent Frequency Scaling Using Desynchronization

Nitish Srivastava, Rajit Manohar

In TVLSI, 2019

Details PDF

Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs

Yuan Zhou, Udit Gupta, Steve Dai, Ritchie Zhao, Nitish Srivastava, et al.

In FPGA, 2018

Details PDF Code

Accelerating Face Detection on Programmable SoC Using C-Based Synthesis

Nitish Srivastava, Steve Dai, Rajit Manohar, Zhiru Zhang

In FPGA, 2017

Details PDF Code

Flexible and Dynamic Power Allocation in Broadband Multi-Beam Satellites

Nitish Srivastava, Ajit Chaturvedi

In IEEE Comm Letters, 2013

Details PDF

Talks

MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product

Sparse-sparse matrix multiplication (SpGEMM) is a computation kernel widely used in numerous application domains. MatRaptor, a novel SpGEMM accelerator, is high performance and highly resource efficient. Unlike conventional methods using inner or outer product as the meta operation for matrix multiplication, our approach is based on row-wise product, which offers a better tradeoff in terms of data reuse and on-chip memory requirements, and achieves higher performance for large sparse matrices. We further propose a new hardware-friendly sparse storage format, which allows parallel compute engines to access the sparse data in a vectorized and streaming fashion, leading to high utilization of memory bandwidth.

Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations

Tensor factorizations are powerful tools in many machine learning and data analytics applications. Tensors are often sparse, which makes sparse tensor factorizations memory bound. In this talk, I present a hardware accelerator, Tensaurus, that can accelerate both dense and sparse tensor factorizations. We co-design the hardware and a sparse storage format, which allows accessing the sparse data in vectorized and streaming fashion and maximizes the utilization of the memory bandwidth. We also extract a common computation pattern that is found in numerous matrix and tensor operations and implement it in the hardware.

T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations

We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. It decouples a functional specification from a spatial mapping, allowing programmers to quickly explore various spatial optimizations for the same function. The actual implementation of these optimizations is left to a compiler. Thus, productivity and performance are achieved at the same time.

Recent News

Inaccel announces record-breaking speed on facial detection test using an FPGA cluster

InAccel, a world-pioneer in the domain of FPGA-based accelerators, has released an integrated framework that allows to utilize the power of an FPGA cluster for face detection, see news Specifically, InAccel has presented a demo in which a cluster of 8 FPGAs are used to provide up to 1700 fps (supporting up to 56 cameras with 30 fps in a single server). The HLS implementation used unded the hood is the face detection design proposed in our FPGA 2017 and FPGA 2018 papers, see this. See the demo from inaccel below:

Projects

Pointer-Chase Prefetcher

Pointer-Chase Prefetcher for Linked Data Structures

Teaching

I have been TA for following courses at Cornell University:

CS3420: Embedded Systems

Contact

nks45@cornell.edu

Nitish

PhD candidate at Computer System Laboratory

Cornell University

Biography

Interests

Education

Publications

Design and Generation of Efficient Hardware Accelerators for Sparse and Dense Tensor Computations

MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product

Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations.

T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations

Operation Dependent Frequency Scaling Using Desynchronization

Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs

Accelerating Face Detection on Programmable SoC Using C-Based Synthesis

Flexible and Dynamic Power Allocation in Broadband Multi-Beam Satellites

Talks

MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product

Tensaurus: A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations

T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations

Recent News

Inaccel announces record-breaking speed on facial detection test using an FPGA cluster

Posts

Should I Buy a House?

A Tutorial on the Gem5 Minor CPU Model

Adding custom instruction to RISCV ISA and running it on gem5 and spike

Adding custom statistics to gem5

Common tips for technical writing

Gem5 Simulation Framework

Gem5 miscellaneous

Projects

Pointer-Chase Prefetcher

Teaching

Contact