# Low Voltage Design and Production in 28HPM of Bitcoin Mining ASICs

Authors: Assaf Gilboa (Spondoolies) Amnon Parnass (Verisense) Michael Chen (GUC) Igor Elkanovich (Verisense) - presenter



Spondooliestech Verisense

### Spondoolies, Verisense & GUC

- Spondoolies:
  - Develops power-efficient Bitcoin mining equipment, Kiryat-Gat
- Verisense:
  - ASIC and FPGA development services, Jerusalem
- GUC:
  - IC implementation and manufacturing services, rich IP portfolio, Hsinchu, Taiwan
- Bitcoin ASICs responsibility:

**1 2** 2015

- Spondoolies: system, SW, ASIC design
- Verisense: ASIC design, verification, synthesis, STA, netlist handoff

**GUC** May 6, 2015

ooliestee

- GUC: IPs, libraries, backend, package, test, production

## **Bitcoin Mining Application**

- Architecture:
  - Bitcoin calculation is based on double SHA256
  - Many 128-stage pipelined engines, each generate result every clock
  - Random data: high toggle rate
- Optimization: system cost/performance
  - Chip cost/performance: mostly silicon area
  - Power/performance: power affects system cost
    - Dynamic power is dominant
  - Performance: GigaHash/sec

**ÍDEX**2015

• Short lifetime: a new generation every 6 months

**GUC** May 6, 2015



spondooliestech

**One Pipeline stage** 

## 2<sup>nd</sup> gen Mining Chip: RockerBox

- Process: TSMC 28HPM
- 246 Mgates, no SRAMs
- Power: 80W (typical, 0.63V)
- Voltage range: 0.55V...0.8V
- Die: 116 mm<sup>2</sup>
- Package: FCBGA 19mmx19mm
- High volume production since July/2014

GUC

May 6, 2015

**E**×2015

I/Os, PLL, Temperature Sensor management logic



193 Double SHA-256 Engines No I/Os on sides ESDs are spread through the die

4

polieste

### Key Development

- Optimization of a whole system of 30 chips
- Cost efficiency:
  - Logic redundancy for high yield
  - Proprietary logic BIST instead of Scan
  - Process shift for higher performance
- Power efficiency:

**E**×2015



- Operating voltage 30% below 28HPM nominal
- Triple-loop Dynamic Voltage Frequency Scaling (DVFS)
- Accurate dynamic power analysis and toggle rate spreading

**GUC** May 6, 2015

oolieste

### Logic Redundancy for High Yield

- System is tolerant to faulty SHA-256 engines
- Proprietary logic BISTs to identify faulty engines
  - The BIST uses SHA-256 pipeline itself
  - LSFR-based BIST for a strict test
  - Vector-based BIST for statistical system test
- Scan wasn't inserted to reduce area/power overhead

**GUC** May 6, 2015

- Fault coverage tool was developed by Ilia Greenblat to check BIST coverage
- Final product yield: 99%

2015

- Natural yield is about 90% (Die: 116 mm2)

polieste

### Dynamic Voltage Frequency Scaling (DVFS)

- Voltage regulator (DC2DC) per ASIC
- Slow and Fast corners are compensated by voltage adjustment



### **DVFS Voltage Target Definition**

- Trends:
  - − Frequency vs. Voltage → linear
  - Power vs. Voltage  $\rightarrow$  V<sup>2</sup>
  - − Power/frequency vs. Voltage → linear
  - Conclusion: use lowest possible voltage
- Linearity range low limit:
  - At around Vtl N + Vtl P
- Selected DVFS target at TT/125C: 0.63V





### Triple Loop DVFS

- DVFS loops
  - Frequency loop per chip: searching for max frequency
  - Temperature loop per chip: at 125°C voltage is reduced
  - Total system power loop: increase/decrease chips voltages to meet total system power budget
- DVFS performance

**E**×2015

- Speed sensor correlation vs. critical path is a key
  - Full correlation is achieved by using logic BIST (pipeline itself)
- Voltage granularity: 1 mV, frequency granularity: 10 MHz

**GUC** May 6, 2015

oolieste

Hysteresis at every action point

### **DVFS** Operation in System

### Achieved robust and stable DVFS system operation

GUC May 6, 2015

### Every chip and its DC2DC report: voltage, frequency, power, temperature

### .5: DC2DC/-1/: [vlt1:693 vlt2:698(DC1:794 T1:794 U1:727) 31A 33c] ASIC:[ .6: DC2DC/-1/: [vlt1:693 vlt2:698(DC1:794 T1:794 U1:727) 36A 37c] ASIC: L7: DC2DC/-1/:[vlt1:695 vlt2:698(DC1:794 T1:794 U1:727) 21W 31A 36c] ASIC:[ 705hz DC2DC/-1/: [vlt1:693 vlt2:698(DC1:794 T1:794 U1:727) 31A 32c] ASIC:[ DC2DC/-1/:[vlt1:695 vlt2:698(DC1:794 T1:794 U1:727) 23W 33A 35c] ASIC:[ 795hz 20: DC2DC/-1/:[vlt1:695 vlt2:698(DC1:794 T1:794 U1:727) 36A 37c1 ASTC. 21: DC2DC/-1/:[vlt1:697 vlt2:698(DC1:794 T1:794 Ul:727) 23W 33A 32cl ASIC: [ 22: DC2DC/-1/:[vlt1:695 vlt2:698(DC1:794 T1:794 U1:727) 18W 27A 34c] ASIC: 23: DC2DC/-1/:[vlt1:691 vlt2:698(DC1:794 T1:794 Ul:727) 34A 34c1 ASIC: DC2DC/-1/: [vlt1:697 vlt2:698(DC1:794 T1:794 U1:727) 31A 31c] ASIC:[ 705hz 32A -1/:[vlt1:693 vlt2:698(DC1:794 T1:794 U1:727) 29c] DC2DC/-1/:[vlt1:695 vlt2:698(DC1:794 T1:794 U1:727) 23W 34A 32c1 ASTC: 735hz 2DC/-1/:[vlt1:695 vlt2:698(DC1:794 Tl:794 Ul:727) 28A 27c] ASIC: -1/:[vlt1:697 vlt2:698(DC1:794 T1:794 U1:727) 26W 37A 26c] ASIC vlt1:695 vlt2:698(DC1:794 T1:794 Ul:727)

2015 C

### Correlation: production test vs. system test



spondooliestec

### Library Selection for Low Voltage

May 6, 2015

- 7T Libraries were selected
  - 20% area/power reduction
  - Negligible performance impact
- Dynamic vs. leakage vs. performance trade-off:
  - SVT, 35 nm: 85% (Synthesis)
  - LVT, 40 nm: 14% (Synthesis)
  - LVT, 35 nm: 1% (Timing closure)
- Only 18% pre-layout to postlayout area growth
  - 18%: Clock tree, hold, set up, transitions fix

GI





### **Timing Closure**

- P&R optimization corner: TT, 0.63V, 125C, Cmax
- Set up corners: SS, 0.72V, 0C/125C, Cmax/RCmax
   5 corners
- Hold time corners: full matrix 0.63V-0.88V
   13 corners
- OCV and uncertainty: defined for every corner by Monte-Carlo spice simulations
- All used libraries were re-characterized for all defined corners
- Production tests were defined according to timing closure corners





### Low Voltage Methodology

- 4-3 transistors in series cells were excluded from libraries
- Max Xtalk glitch and max transition parameters were tightened
- Extracted LO spice simulations:
  - All clock trees to check transitions
  - Critical path to check correlation vs. STA
  - Libraries' FFs were simulated to check metastability convergence
- Separate 0.9V power domain for PLL, TS and I/Os



### **Dynamic Power Analysis**

- Accurate dynamic power estimation flow was developed
  - 10% accuracy vs. post-silicon measurements
- For power analysis accuracy:
  - Representative activity from simulation
  - GL simulation resolution = gate delay (20-30 ps)
  - State dependent SAIF
- Allows accurate comparison of arithmetic architectures



### **Current Peak Challenge**

- Original toggle rate: FFs 50%, arithmetics 200%-300%
   Random data flows through arithmetic pipeline
- New architecture:
  - Reduced FFs toggle to 34% (Spondoolies patent)
  - Divided toggle rate to 4 clock phases
- Master-slave DLL was developed to spread clock edges



### **Spreading Current Peaks**

Current peaks were reduced to acceptable level



### Metal Stack for Low IRdrop

- At low voltage and high supply current (130A) low IRdrop is critical
- Traditional power grid metal stack:
  - X direction: Z layer (8.5 KÅ copper)
  - Y direction: AP layer (14 KÅ aluminum)
- We added U layer (4x lower resistance than Z layer):
  - X direction: U layer (35 KÅ copper)
  - Y direction: Z layer (8.5 KÅ copper) + UT-AP layer (28 KÅ Al)

**GUC** May 6, 2015

- TSMC provided tech files for 5x1z1u1UT-AP stack
- Disadvantage: U layer metal density is limited to 50%
  - Z and AP layer densities are up to 70%

**E**×2015

• Achieved static IRdrop 2%, dynamic IRdrop 5%

### **Process Shift**

- 28HPM was shifted by 2 sigma to fast corner
  20% performance increase
- 98% yield due to redundancy and hold time margins
- More than 300 Ku were produced in 6 months



### Summary

- Optimization for entire multi-chip system
- For cost and power efficiency:
  - Redundancy, logic BIST, triple-loop DVFS, process shift
- 28HPM process was used at low voltage and wide DVFS range
  - Methodology was proven in high volume production

