

# SPIN and DRAIN: A new approach to address deadlocks in interconnection networks

#### Paul V. Gratz

Computer Engineering and Systems Group Department of Electrical and Computer Engineering Department of Computer Science and Engineering Texas A&M University



#### ELECTRICAL & COMPUTER ENGINEERING TEXAS A&M UNIVERSITY

## First, a history lesson...



- 4-way Out-of-Order Superscalar
- High performance
  branch predictor
- Execution trace cache
- "Hyper"-pipelining
- "Hyper"-threading

Intel Pentium 4 Circa 2001

- Double-pumped ALU
- 126 instructions in flight
- 6-uOps issue/cycle
- SSE Instructions
- 96 (48)-entry Load (Store) buffer
- 2\*128-entry regfile



#### ELECTRICAL & COMPUTER ENGINEERING TEXAS A&M UNIVERSITY

## First, a history lesson...



• 4-way Out-of-Order Superscalar

Intel Pentium 4 Circa 2001

- High performance
  branch predictor
  DeepioSpeculation
  - "Hyper"-pipelining
  - "Hyper"-threading

- Double-pumped ALU
- 126 instructions in flight
- 6-uOps issue/cycle



• 2\*128-entry regfile







Pentium 4

#### 2006



Core 2 Duo



2008

Core i7 (Nehalem)

2014



Core i7 (Haswell)

- Dual-core
- Pentium III-like
  microarchitecture
- 2-level memory hierarchy
- Simple bus
  interconnect

- Six-core
- Similar microarchitecture to Core-2
- 3-level memory hierarchy
- Crossbar interconnect

Eight-core

٠

- Incrementally advanced microarchitecture
- 3-level memory
- Hierarchical interconnect





2001

Pentium 4

#### 2006



Core 2 Duo

### 2008



Core i7 (Nehalem)

2014



### Core i7 (Haswell)

### VLSI Trends:

- Continued transistor scaling
- Constrained Power and Energy budgets

- Six-core
- Similar microarchitecture to Core-2
- 3-level memory hierarchy
- Crossbar interconnect

- Eight-core
- Incrementally advanced microarchitecture
- 3-level memory
- Hierarchical interconnect

۰





### Pentium 4

Core 2 Duo

Core i7 (Nehalem)

Core i7 (Haswell)

### **VLSI** Trends:

- Continued transistor scaling
- Constrained Power and \_ Energy budgets

### **Outcomes:**

- Core counts increasing
  - Core complexity stalled (mostly)
- Memory hierarchy system complexity increasing
- Interconnect between cores increasingly critical



Era of Chip-multiprocessors (CMPs): Complexity moves from the cores up the memory system hierarchy and interconnect.





- Increasing core counts
- Multi-level hierarchies
  - Private lower levels
  - Large, shared last-level
- Multi-threaded apps
  - Data shared via caches
  - Synchronization critical

- All is not well:
  - Caches don't help for first access to a particular location
    - ( A focus of much of my other work)
  - Huge latencies for shared data/synchronization due to coherence
    - Interconnect between cores an increasingly critical design component



## **Network Routing**





#### LECTRICAL & COMPUTER NGINEERING EXAS A&M UNIVERSITY

# **Routing Deadlocks**

 A Routing Deadlock is a cyclic buffer dependency chain that renders forward progress impossible.





# **Routing Deadlocks**

- A Routing Deadlock is a cyclic buffer dependency chain that renders forward progress impossible.
  - Renders the chip non-functional.
- Deadlocks are a fundamental problem in both off-chip and on-chip interconnection networks.
  - Deadlocks are hard to detect during functional verification.
    - Manifest after a long use time.
    - Depend on : traffic pattern, injection rate, congestion.
- Existing approaches to prevent deadlocks require high hardware costs or impose significant performance penalties
  - Need a low cost solution for functional correctness !!

# Focus of this work is to achieve low-cost, high performance deadlock freedom



#### LECTRICAL & COMPUTER NGINEERING EXAS A&M UNIVERSITY

# Outline

- Introduction
- Background: Routing Deadlocks
  - Dally's Theory
  - Duato's Theory
  - Flow Control Routing
  - Deflection Routing
- SPIN : Synchronized Progress in Interconnection Networks
- DRAIN : Deadlock Removal for Arbitrary Irregular Networks
- Conclusion



### **Solution I: Dally's Theory**

• A *strict order* in acquisition of links and/or buffers ensures a cyclic dependency can never created.





### Solution I: Dally's Theory

- A *strict order* in acquisition of links and/or buffers ensures a cyclic dependency can never created.
- Implementations: Turn model [Glass and Ni, ISCA'92] XY routing, Up-Down routing [Schroeder et al, ICPP'91].
- Limitations:
  - Routing Restrictions: Increased Latency, Throughput loss, Energy overhead
  - Require large no. of VCs for fully adaptive routing.



### **Solution II: Duato's Theory**

- Adds buffers to create a deadlock free escape path that can be used to avoid/recover from deadlocks.
- *Implementation:* turn restrictions in escape-VC.





### **Solution II: Duato's Theory**

- Adds buffers to create a deadlock free escape path that can be used to avoid/recover from deadlocks.
- Implementation: Requires an extra "escape" VC in each router. Turn restrictions in escape-VC.
- Limitations:

& COMPUTER

- *Energy* and *Area overhead* of escape VCs.
- Additional routing tables/logic for routing within escape-VC.



#### LECTRICAL & COMPUTER NGINEERING EXAS A&M UNIVERSITY

# **Other Solutions**

- Solution III: Flow Control
  - Restrict injection when no. of empty buffers fall below a threshold
  - Implementation: Bubble Flow Control [Carrion et al., HIPC'97]
  - <u>Limitation</u>: Implementation Complexity, Throughput Loss.
- Solution IV: Deflection Routing
  - Assign every flit to some output port even if they get *misrouted*.
  - Implementation: BLESS [Moscibroda and Mutlu, ISCA'09]
     CHIPPER [Fallin et al., HPCA'11]
  - Limitation: Livelocks, non-minimal routing



#### ELECTRICAL & COMPUTER ENGINEERING TEXAS A&M UNIVERSITY

### **Deadlock Freedom Theories**





# Outline

- Introduction
- Background: Routing Deadlocks
- SPIN : Synchronized Progress in Interconnection Networks
  - Key idea
  - Case study: implementation/microarch
  - FAVORS: Fully Adaptive, One-vc Routing with SPIN
  - Evaluation
- DRAIN : Deadlock Removal for Arbitrary Irregular Networks
- Conclusion



# **SPIN : Key Idea**

Deadlocks are the result of a *lack of coordination* not a lack of resources.



• *Simultaneous Synchronized Movement* of all deadlocked packets in the loop is called a *spin*.



# SPIN : Key Idea

- *Simultaneous Synchronized Movement* of all deadlocked packets in the loop is called a *spin*.
  - Each spin leads to one hop of forward movement for all deadlocked packets.
  - One spin may not resolve the deadlock. If so, spin can be repeated
  - Deadlock is *guaranteed to be resolved* in a *finite* number of spins [proof in paper, Sec.
    III]



ELECTRICAL & COMPUTER ENGINEERING TEXAS A&M UNIVERSITY

# SPIN : Key Idea





ELECTRICAL & COMPUTER ENGINEERING TEXAS A&M UNIVERSITY

# SPIN : Key Idea





## **SPIN: Implementation**

- SPIN is a generic deadlock freedom theory that can have *multiple implementations*.
- We choose a *recovery approach* as *deadlocks* are *rare* scenarios (See Sec. II-F).
- **Our Implementation**:
  - Detect the Deadlock.

& COMPUTER

- Coordinate a time for spin.
- Execute the spin.



### **Example : Detect Deadlocks**

- Use counters.
- Placed at every node at design time.
  - Optimize by exploiting topology symmetry (See Static Bubble).
- If packet does not leave in *threshold time* (configurable), it indicates a *potential deadlock*.
- Counter expired? Send probe to verify deadlock.



#### ELECTRICAL & COMPUTER ENGINEERING TEXAS A&M UNIVERSITY

## Example : Probe Msg.





## Example : Probe Msg.

- **Probe** is a special message that **tracks** the buffer dependency.
  - Uses "blocked" links, no extra network required
- **Probe returns** to sender:

& COMPUTER

- Cyclic buffer dependence, hence *deadlock*.
- Next, send a *move* msg. to convey the *spin time*
  - Upon receiving move msg., router sets its counter to count to spin cyle.



#### ELECTRICAL & COMPUTER ENGINEERING TEXAS A&M UNIVERSITY

## Example : Move Msg.





#### ELECTRICAL & COMPUTER ENGINEERING TEXAS A&M UNIVERSITY

# Example : spin





# Example : spin





# **SPIN Optimization**

- Resolving a deadlock may require *multiple spins*
  - After spin, router can resume normal operation.
  - Counter expires again, process repeated.
- <u>Optimization</u>: send <u>probe\_move</u> after spin is complete.
  - probe\_move checks if deadlock still exists and if so, sets the time for the next spin.
- Details in paper (Sec. IV-B).



#### ELECTRICAL & COMPUTER ENGINEERING DEXAS A&M UNIVERSITY

# Microarchitecture

- *No additional links:* Spl. Msgs. use the same links as regular flits.
  - Spl. Msgs. have higher priority in link usage over regular flits.
  - Links are idle during deadlocks (by definition).
- **Bufferless Forwarding:** Spl. Msgs. are not buffered anywhere (either forwarded or dropped).
- **Distributed Design:** any router can initiate the recovery.
- 4% area overhead compared to traditional mesh router in 15nm.



# **FAvORS** Routing

- SPIN is the *first scheme* that enables *true one-VC* fully adaptive deadlock-free routing for *any topology*.
- FAVORS : Fully Adaptive One-vc Routing with SPIN.
  - Algorithm has two flavors:
    - Minimal Adaptive
    - Non-minimal Adaptive
  - Route Selection Metrics:
    - Credit turn-around time
    - Hop Count
  - More details in paper (Sec. V).





# **Evaluation**

## **Network Configuration**

| Simulator       | gem5 simulator + Garnet 2.0 Network model |                                              |  |  |
|-----------------|-------------------------------------------|----------------------------------------------|--|--|
| Topologies      | 8x8 Mesh                                  | 1024 node Off-chip<br>Dragon-fly             |  |  |
| Link<br>Latency | 1-cycle                                   | Inter-group: 3-cycle<br>Intra-group: 1-cycle |  |  |
| Traffic         | Synthetic +<br>Multi-threaded (PARSEC)    | Synthetic                                    |  |  |



**Baselines** 

### 8x8 Mesh

| Design                | Routing<br>Adaptivity | Minimal | Theory       | Deadlock<br>Freedom Type |
|-----------------------|-----------------------|---------|--------------|--------------------------|
| West-first<br>Routing | Partial               | Yes     | Dally        | Avoidance                |
| Escape-VC             | Full                  | Yes     | Duato        | Avoidance                |
| Static-Bubble [6]     | Full                  | Yes     | Flow-Control | Recovery                 |

### 1024 Node Off-chip Dragon-fly

| Design    | Routing<br>Adaptivity | Minimal | Theory | Deadlock Freedom<br>Type |
|-----------|-----------------------|---------|--------|--------------------------|
| UGAL [37] | Full                  | No      | Dally  | Avoidance                |



# Throughput

### 1024-node Off-chip Dragon-fly





Throughput

### 8x8 On-chip Mesh





#### **LECTRICAL & COMPUTER NGINEERING** YEXAS A&M UNIVERSITY

# **Energy Delay**



## 3VC\_Duato vs. 2VC\_SPIN Adaptive

- Runtime effectively equivalent
- 2VC\_SPIN 18% less energy and EDP



- **Deadlocks** are a **fundamental problem** in Interconnection Networks.
  - **SPIN** is a new deadlock freedom theory
  - Simultaneous packet movement for deadlock recovery
  - No routing restrictions or escape-VCs required
  - Enables *true one-VC fully adaptive routing* for any topology
- Salient Features of our Implementation:
  - Scalable: Distributed Deadlock Resolution
  - *Plug-n-Play:* topology agnostic
  - 68% higher (Mesh) & 62% higher (dragon-fly) saturation throughput.
- Can we do better?



# Outline

- Introduction
- Background: Routing Deadlocks
- SPIN : Synchronized Progress in Interconnection Networks
- DRAIN : Deadlock Removal for Arbitrary Irregular Networks
- Conclusion



- SPIN drawbacks
  - Somewhat complicated deadlock detection hardware

DRAIN

- Built around Static Bubble<sub>[Ramrakhyani and Krishna,HPCA'17]</sub> framework
  - Difficult to adapt to wear-out induced network topology changes
- DRAIN : Deadlock Removal for Arbitrary Irregular Networks
  - No need for deadlock detection
  - Adapt to changing network topology



# **DRAIN: Key Idea**

- DRAIN : Deadlock Removal for Arbitrary Irregular Networks
  - No need for deadlock detection
    - Pre-emptively SPIN (*DRAIN*) the whole network
      Deadlocks are rare so spin every >100K cycles
    - Deadlocks will be broken even if packets are mis-routed
  - Adapt to changing network topology
    - Link/router breaks change network topology to be spun
    - Developed algorithm to find *minimum set of DRAIN paths* which cover whole network
    - *Regen DRAIN paths* when link/router breaks
    - Route adaptively w/o worry of network deadlock



# Conclusion

- **Deadlocks** are a **fundamental problem** in Interconnection Networks.
  - SPIN and DRAIN represent a new approach to deadlock freedom
    - Deadlocks not a lack of resources, a lack of coordination
    - *Simultaneous packet movement* for deadlock recovery
  - Low Overheads: No routing restrictions or escape-VCs required
  - High Performance: Equal or better performance with less hardware cost



## Acknowledgments

- SPIN Collaborators:
  - Aniruddh Ramrakhyani Apple/Georgia Tech
  - Tushar Krishna Georgia Tech
- **DRAIN** Collaborators:
  - Mayank Parasar Georgia Tech
  - Joshua San Miguel University of Wisconsin
  - Tushar Krishna Georgia Tech
  - Natalie Enright-Jerger University of Toronto