#### **RISC-V Near-Memory Processing** Accelerators

Peter Hsu & Associates, S.L. Barcelona, Spain

Email: <u>peter.hsu@phaa.eu</u> LinkedIn: www.linkedin.com/in/peter-hsu-122a315

©2022 Peter Hsu & Associates, S.L.

Peter Hsu, Ph.D.

### Introduction

- RISC-V opens many new opportunities for innovation and enables smaller organizations to design computer hardware
- But no matter how creative, a new chip design using advanced process technology faces enormous development cost
- We propose a "mix and match" paradigm combining state-of-the-art memory technology with mature ASIC logic to reduce development cost of near-memory computing architecture accelerators
- Machine learning and supercomputing accelerators examples

# Why Stacked Memory?

#### **Technical**

 Shorter SRAM access path, node communication distance

#### **Business**

Reduce development cost







Multiple accelerators specialize for different applications



Save cost by building custom logic on mature ASIC technology

#### Reuse stacked memory for different designs







- 1. Economics & technology
- 2. Accelerator architecture
- 3. Examples
- 4. Programming model & evaluation
- 5. Memory technologies
- 6. Conclusion

#### Agenda

Many interesting accelerator architectures are possible

We propose a technical solution for near-memory computing accelerators that is economically feasible for smaller organizations with limited resources

More details in paper "In-Memory Accelerators Using Stacked Memory" (PDF)



## Why Specialize?

- Performance
- Energy







| Hardware                      | ſ         | - CPUs -  |         | · · · ·  | GPU                | s ———    |          |
|-------------------------------|-----------|-----------|---------|----------|--------------------|----------|----------|
| Manufacturer                  | Intel     | Intel     | Intel   | NVIDIA   | Sony, IBM, Toshiba | NVIDIA   | NVIDIA   |
| Model                         | Q9450     | Q9450     | Q9450   | 7900 GTX | PlayStation 3      | 8800 GTX | GTX 280  |
| # cores used                  | 1         | 4         | 4       | 4x96     | 2+6                | 4x128    | 4x240    |
| Implementation                | MATLAB    | MATLAB    | SSE2    | Cg       | Cell SDK           | CUDA     | CUDA     |
| Year                          | 2008      | 2008      | 2008    | 2006     | 2007               | 2007     | 2008     |
| Performance / Cost            |           |           |         |          |                    |          |          |
| Full System Cost<br>(approx.) | \$1,500** | \$2,700** | \$1,000 | \$3,000* | \$400              | \$3,000* | \$3,000* |
| Relative Speedup              | 1x        | 4x        | 80x     | 544x     | 222x               | 1544x    | 2712x    |
| Relative Perf. / \$           | 1x        | 2x        | 120x    | 272x     | 833x               | 772x     | 1356x    |

### **Economic Considerations**

#### Cost of silicon device

- Recurring cost (RE)
- Non-recurring cost (NRE)

5nm chip \$500M NRE

- 50M devices  $\rightarrow$  \$10/device
- 100K → \$5000/device



### Problem

- RISC-V enables hardware innovation by smaller organizations
  - No architecture licensing fee, no limits on customizing ISA
  - Open source software ecosystem with compilers, OS...
- But creating commercially competitive accelerator is challenging
  - Leading edge semiconductor design is extremely expensive
  - Novel ideas take time to gain momentum and volume in market

#### Solution

- "
  End of Moore's Law"  $\rightarrow$  rapid advances in packaging technology
  - Multiple chips on substrate (Open Chiplet Initiative)
  - Chip/wafer stacking (TSMC 3DFabric<sup>™</sup>)
  - Example: GPU die with stacked HBM memories on substrate
- Mix and match advanced memory and mature logic technology
  - Multiple accelerators share design cost of stacked memory

#### Some Recent Accelerators





## In-, At-, Near-Memory Computing

**Global In-Memory Computing Market** Is Expected to Reach USD 41.53 Billion by 2028 : Fior Markets





# Memory Chip Different from ASIC

FLASH, DDR, LPDDR, even HBM are high volume chips because

- Same chip used in multiple products
- Many chips in a single product

Memory cell density/flexibility tradeoff

- Dedicated memory fab  $\rightarrow$  lowest cost, inflexible, standard parts  $\bullet$
- Embedded logic fab  $\rightarrow$  medium cost, customizable like ASIC  $\bullet$

# Mix and Match Paradigm

Monolithic in-memory computing chips use a lot of area for memory

- Expensive advanced logic process not fully utilized
- Long wires over memory area between logic islands waste energy

We propose separating memory and logic onto stacked dies

- Differently optimized process for memory and logic gates
- Contiguous logic reduces wire length, improves energy efficiency

 $\stackrel{W}{\longleftrightarrow}$ 



3w

NODE

- Same size dies, yield = one big die
- $\bullet$



## Wafer Stacking

Fin 4 Description of four this second second

# Chip Stacking

- Accommodate different size dies, different processes
- Test before stacking (KGD)  $\rightarrow$  large logic chip is possible
- Short vertical interconnect using TSV  $\rightarrow$  good performance, power
- Higher cost  $\bullet$

#### Direct copper-to-copper bond

AMD 3D CHIPLET TECHNOLOGY

Through Silicon Vias (TSVs) for licon-to-silicon communication

Up to 8-core "Zen 3" CCD

Structural silicon

64MB L3 cache die

#### A PACKAGING BREAKTHROUGH FOR HIGH-PERFORMANCE COMPUTING



InFO: Integrated Fan-Out CoWoS: Chip on Wafer on Substrate

## **Memory Stack**

TSMC 7nm SRAM optimized process  $\approx 2$  MB/mm<sup>2</sup>

- Chose 66 mm<sup>2</sup> stack area for reasonable yield
- 1 GB in 8 layers = 531 mm<sup>2</sup> silicon area
- 4 TB/s bandwidth using 10% TSV area overhead

Multiple stacks on large logic chip

More capacity, bandwidth 



Compute-in-Memory IC



# Logic Chip

Mature technology

- GlobalFoundries 22FDX, TSMC 22ULP...
- Low CAD tools cost, many IP available

66 mm<sup>2</sup> stack  $\rightarrow$  128 logic tiles each 0.52 mm<sup>2</sup>

- 1.8 MGE (million gate equivalent) per tile  $\approx$  12 IEEE 64-bit FP multiply-add units
- 230 MGE per stack (1500 DP FPMAC)



Compute-in-Memory IC



### **Platform Architecture**

8 SRAM stacks on 672 mm<sup>2</sup> logic chip (24×28 mm)

- 1024 tiles
- 1 GHz logic in 22 nm
- 2D torus NoC, 32-bit links

Tile is a complete computer

- Shared memory multicore architecture (SMP)
- Eight 32-bit memory accesses per cycle (32 GB/sec)





# Manufacturing Ecosystem

 TSMC has been developing stacked memory technology for some time



 Mixing mature logic chip is a business decision





PUBLISHED WED, MAR 2 2022 10:12 PM CST

#### **GLink-3D Application: SRAM on Top of Processor**

- SRAM is separated from integrated chip and located on top of Processor
- GLink-3D interface allows low area/power/latency connection



#### AMD Ryzen 7 5800X3D shipped out of factory, first CPU with 3D V-Cache

AMD's first consumer CPU with its next-gen 3D V-Cache technology is now shipping, Ryzen 7 5800X3D will be in-hands this month.





- 1. Economics & technology
- 2. Accelerator architecture
- 3. Examples
- 4. Programming model & evaluation
- 5. Memory technologies
- 6. Conclusion

#### Agenda

#### We illustrate with examples chosen for simplicity of explanation

**Real commercial designs** could do better

# HPC Example (SpMV++)

Specialized sparse matrix accelerator

- 8 GB, 4 TFLOPS (DP), 250 W PCIe card
- 8 bytes/FLOP memory bandwidth
- Logic 147.5 W (144 pJ per tile, 22nm)
- SRAM 25.6 W (25.6 pJ per tile, 7nm)

Evaluate using HPCG benchmark



Tile

# SpMV++ vs. GPU

#### NVIDIA A100 is today's premier HPC accelerator

- 7 nm
- 826 mm<sup>2</sup>
- HBM2E
- 250 W

| Technology       | A100                 | SpMV++ | Improve |  |  |
|------------------|----------------------|--------|---------|--|--|
| Capacity (GB)    | 80                   | 8      | 0.1x    |  |  |
| Peak TFLOPS (DP) | 19.5                 | 4.1    | 0.21x   |  |  |
| Bandwidth (TB/s) | 2                    | 32.8   | 16x     |  |  |
| Bytes/FLOP       | 0.10                 | 8.0    | 78x     |  |  |
| HPCG TFLOPS      | 0.227                | 2.9    | 13x     |  |  |
| FPU Utilization* | 1.16%                | 70%    | 60x     |  |  |
| Power (W)        | 250                  | 250    | 1x      |  |  |
| GFLOPS/W         | 0.91                 | 11.5   | 13x     |  |  |
|                  | *estimate for SpMV++ |        |         |  |  |



**NVIDIA A100 GPU** 



SpMV++

### ML Example

Specialized machine learning accelerator

- 4-bit precision (logarithmic numbers) "Ultra-Low Precision 4-bit Training of Deep Neural Network," IBM Research
- 8 GB, 262 TOPS 4-bit multiply, higher precision accumulate
- 500 MHz, low VDD 0.45V vs. 0.85V for 1 GHz operation
- 3.9 TOPS/W energy efficiency



## Mainstream Programming Model

Cluster of computers  $\rightarrow$  array of tiles

Multicores with shared coherent memory (SMP)

Network  $\rightarrow$  network on chip (NoC)

- Application specific interconnection topology
- Front-end computers → host processors

Same application programming interface (API)

• Linux threads, sockets, RDMA...



# Codesign Methodology

- 1. Develop algorithm on standard cluster of SMP servers
  - Standard Linux API, but respecting target node memory size
- 2. Simulate near-memory RISC-V SMP cluster
  - Parallel "thread per core, process per node" simulator (next slide)
- 3. Develop co-processor with custom instructions
  - Refine codesign and validate performance improvement

#### Cavatools

- Caveat RISC-V user-mode Linux virtual machine
  - Thread-per-core execution-driven simulator,  $\approx 100$  MIPS
  - Shared memory (eg. OpenMP), multiple nodes (eg. MPI)

Custom instruction definition

• Spec  $\rightarrow$  compiler intrinsic, asm, sim

**Open source** (this work was partially supported by BSC)

More details in ICS Conference presentation "Cavatools: Parallel Architecture Simulator for RISC-V" (PDF)





### **Caveat Simulation Paradigm**

Array of tiles  $\rightarrow$  Linux processes

- SMP cores  $\rightarrow$  Linux threads
- RISC-V AMO  $\rightarrow$  x86 CMPXCHG

Network on Chip → Linux sockets

Messages → Linux read(), write()

**Erised** – realtime visualization





Compute-in-Memory IC

Accelerator





#### **Erised** Performance Visualizer

#### Within a tile F. 🔻 Global Local • Pipeline stalls • Instruction Buffer Misses Across chip • System calls (IPC) Message queues

|                        |              | peterhs | u@DELL-LAPTOP: ~/TRY                             |                                 |                                       |                          |
|------------------------|--------------|---------|--------------------------------------------------|---------------------------------|---------------------------------------|--------------------------|
| C[0] 4(070             | 4 TOC A AC T |         |                                                  | 4004072524/                     |                                       | 7252007504               |
|                        |              | · · ·   | .294%(269423093) insn=<br>.300%(269160514) insn= | 1884873524(<br>1882201662(      | 36 ecalls) cycle=<br>9 ecalls) cycle= |                          |
|                        |              |         | .300%(269160514) insn=                           | 1882201697(                     | 9 ecalls) cycle=                      |                          |
|                        |              |         | .300%(269160517) insn=                           |                                 | 15 ecalls) cycle=                     |                          |
|                        | CPI #ssi     |         | 1.9B insns CPI= 3.85                             | 100110100.1                     |                                       |                          |
|                        | 1            |         | cannonomp_fn.0+125                               | 12713 3b0289                    | 87 mulw                               | a4,s3,s0                 |
|                        | 7.00 1       |         | cannonomp_fn.0+129                               | 12717 560187                    | 07 addw                               | a4,a4,s8                 |
|                        | 1            |         | cannonomp_fn.0+133                               | 1271b 33                        | 07 c.slli                             | a4,21 [0x15]             |
|                        | 1            |         | cannonomp_fn.0+135                               | 1271d b300eb                    | 0e add                                | t3,s6,a4                 |
| 1024                   | 1            |         | cannonomp_fn.0+139                               | 12721 4200eb                    | 86 add                                | a3,s7,a4                 |
| 1024                   |              |         | cannonomp_fn.0+143                               |                                 | 86 c.mv                               | a2,a6                    |
| 524288                 |              | 12.5%   | cannonomp_fn.0+145                               |                                 | 22 c.fld                              | fa5,0(a2) [0x0]          |
| 524288                 |              |         | cannonomp_fn.0+147                               | 12729 be0066                    |                                       | a1,a3,t1                 |
| 524288                 |              |         | cannonomp_fn.0+151                               |                                 | 88 c.mv                               | a7,a5                    |
| 268435456              |              | .025%   | cannonomp_fn.0+153                               | 1272f 980008                    |                                       | fa3,0(a7) [0x0]          |
| 268435456              |              | 100%    | cannonomp_fn.0+157                               |                                 | 21 c.fld                              | fa4,0(a1) [0x0]          |
| 268435456              |              |         | cannonomp_fn.0+159                               |                                 | 95 c.add                              | a1,s5                    |
| 268435456              |              |         | cannonomp_fn.0+161                               |                                 | 08 c.addi<br>57 feedd dwe d           |                          |
| 268435456              |              |         | cannonomp_fn.0+163                               |                                 | f7 fmadd_dyn.d                        | fa5,fa3,fa4,fa5          |
| 268435456<br>268435456 |              |         | cannonomp_fn.0+167                               | <b>1273d e3</b><br>1273f a1feb6 |                                       | ▶ fa5,0(a2) [0x0]        |
| 524288                 |              | .000%   | cannonomp_fn.0+169<br>cannonomp_fn.0+173         |                                 | 96 c.addi                             | a3,a1,-16 [0xffffff      |
| 524288                 |              | .000%   | cannonomp_fn.0+175                               |                                 | 06 c.addi                             | a3,8 [0x8]<br>a2,8 [0x8] |
| 524288                 |              |         | cannonomp_fn.0+177                               | 12747 13fede                    |                                       | t3,a3,-32 [0xffffff      |
| 1024                   |              |         | cannonomp_fn.0+181                               | 1274b d62005                    |                                       | a0,a0,512 [0x200]        |
| 1024                   |              |         | cannonomp_fn.0+185                               |                                 | 97 c.add                              | a5,s5                    |
| 1024                   |              |         | cannonomp_fn.0+187                               |                                 | 98 c.add                              | a6,s5                    |
| 1024                   |              |         | cannonomp_fn.0+189                               | 12753 05fcad                    |                                       | s11,a0,-50 [0xfffff      |
|                        | 1            |         | cannonomp_fn.0+193                               |                                 | 24 c.addiw                            | s0,1 [0x1]               |
|                        | 1            |         | cannonomp_fn.0+195                               |                                 | 2d c.addiw                            | s10,1 [0x1]              |
|                        | 1            |         | cannonomp_fn.0+197                               | 1275b e30334                    | 64 remw                               | s0,s0,s3                 |
|                        | 1.50         |         | cannonomp_fn.0+201                               | 1275f a6fba9                    | 90 bne                                | s3,s10,-96 [0xfffff      |
|                        |              | 100%    | cannonomp_fn.0+205                               | 12763 06                        | 70 c.ldsp                             | га,104(sp) [0x68]        |
|                        | 1            |         | cannonomp_fn.0+207                               | 12765 e6                        | 74 c.ldsp                             | s0,96(sp) [0x60]         |
|                        | 1            |         | cannonomp_fn.0+209                               | 12767 46                        | 64 c.ldsp                             | s1,88(sp) [0x58]         |
|                        | 1            |         | cannonomp_fn.0+211                               | 12769 a6                        | 69 c.ldsp                             | s2,80(sp) [0x50]         |
|                        | 1            | 100%    | cannonomp_fn.0+213                               |                                 | 69 c.ldsp                             | s3,72(sp) [0x48]         |
|                        | 1            |         | cannonomp_fn.0+215                               |                                 | 6a c.ldsp                             | s4,64(sp) [0x40]         |
|                        | 1            |         | cannonomp_fn.0+217                               |                                 | 7a c.ldsp                             | s5,56(sp) [0x38]         |
| 1                      | 1            |         | cannonomp_fn.0+219                               | 12771 a2                        | 7b c.ldsp                             | s6,48(sp) [0x30]         |





- 1. Economics & technology
- 2. Accelerator architecture
- 3. Examples
- 4. Programming model & evaluation
- 5. Memory technologies
- 6. Conclusion

#### Agenda

**Optimized SRAM process is less** expensive than ASIC, but 7nm SRAM main memory is still quite expensive

We need a path to more affordable memory technology

# **Magnetoresistive Memory**

- Leverage mature process for low cost
  - MTJ extra module in standard CMOS
  - Similar to RF, Analog process modules

Attractive characteristics

- Read time, energy  $\approx$  SRAM (but write 1)
- Wear-out is no longer a problem

#### **SOT-MRAM To Challenge SRAM**



Spin-orbit torque memory adds endurance and faster write speeds, but displacing existing memories is still not easy.

JANUARY 13TH, 2022 - BY: BRYON MOYER



Conferences > 2018 IEEE International Solid...

A 1Mb 28nm STT-MRAM with 2.8ns read access time at 1.2V VDD using single-cap offset-cancelled sense amplifier and in-situ self-write-termination

# **Reducing Memory Cost**

- SRAM 7nm ≈ 2 MB/mm<sup>2</sup> (TSMC)
  - Scaling less than logic gates
- MRAM 28nm > 1 MB/mm<sup>2</sup> today
  - Nonvolatile → mobile devices
  - 3D (like XPoint<sup>™</sup>) in future

#### Samsung Demonstrates the World's First MRAM Based In-Memory Computing

Korea on January 13, 2022

Audio 🔌 Sh



## Summary

We provide RISC-V platform architecture High bandwidth, low latency SRAM memory Standard Linux programming environment Simulation tools to validate performance

BRAVA



You design arithmetic, interconnect, software for your application

**SRAM BANKS** 

**CUSTOMIZED** 

CORES

IN

STUNT

BOX

### Conclusion

- RISC-V enables hardware innovation but development cost limits creativity
- Proposed paradigm for near-memory accelerator using state-of-the-art memory with mature logic technology
- Enable commercially competitive accelerators based on novel ideas by smaller organizations with limited resources



LET A THOUSAND FLOWERS BLOOM



### Thank You

**Abstract:** The RISC-V architecture has opened new opportunities for many people to innovate in computer design. However to design a chip that can compete in the marketplace against veteran industry computer designers with their vast resources is still a formidable challenge. We propose a solution for specialized accelerators with near-memory processing architectures. We observe the critical technology is the embedded memory because it consumes most of the silicon area and determines the power/bandwidth of the chip. If instead memory is stacked on top of the logic chip, then a less dense, lower cost mature technology can be used for the logic. Communication wire power will be lower because the through-silicon via (TSV) interconnect traverses a much smaller distance, offsetting the lower power efficiency of mature logic technology. Design cost of the re-useable hi-tech memory chip is amortized across multiple accelerators. We believe this approach can help smaller organizations with limited resources design commercially competitive novel accelerators.

**Bio:** Peter Hsu received his Ph.D. from University of Illinois Urbana-Champaige. He started work at IBM T. J. Watson Research Center on the 801 Project. He joined SGI in 1990 as architect of MIPS R8000 TFP microprocessor; the chip powered fifty TOP500 supercomputers in 1994. Peter co-founded ArtX in 1997 to develop the Nintendo GameCube video game console. He joined Oracle Labs in 2011 as Architect and built a fifty thousand core parallel SQL database accelerator. Peter moved to Europe in 2018 and was visiting professor at EPFL University in Switzerland, then senior researcher at the Barcelona Supercomputing Center. In 2022 he started a consulting company in Spain, Peter Hsu & Associates, S.L.U.