

UCLA

Samueli

School of Engineering

## Signaling for wafer-scale systems

Subramanian S. Iyer (s.s.iyer<u>@ucla.edu</u>)



Center for Heterogeneous Integration and Performance Scaling chips.ucla.edu

Discussion with ODSA group on November 20,2020



# UCLA CHIPS

A UCLA Led partnership to develop Applications, Enablement and Core technologies and the eco-system required for continuing Moore's Law at the Package and System Integration levels and <u>develop our students & scholars to lead this effort</u>

Simplify hardware development through novel architectures, integration methods, technologies, and devices.







## What we do @UCLA CHIPS



### Silicon and Package scaling



## Why is heterogeneity assuming sudden importance ?

 Packaging has always been about assembling heterogeneous dies/chips onto a Printed Circuit Board



**High-Performance Disk** 

• The problem with PCBs has to do with Latency and Bandwidth between the chips as well as energy per bit transferred



### Packaging and AI - A one-page illustrative primer



Neural networks are central to Al Accuracy requires these networks to be extremely deep (many hidden layers) Eg. Residual Net (ResNet) has ~1000+ layers Also the width of these hidden layers can also be quite large



Vector multiplications are a key operation in neural networks And the vector multiply and accumulate (MAC) function is central The bit precision of the inputs, weights and outputs can exceed 16 bit, leading to unprecedented computational complexity.

Even with today's very powerful processors, processors need to time multiplex, **constantly** moving inputs, weights and outputs of each layer between the processor and memories So the memory bottleneck is quite severe.

This is where packaging comes in ! - BW, energy-per-bit Xferred



### Some observations

- If Moore's law has enabled miniaturization, why have chips gotten larger ?
  - More complex problems
  - More cores @ higher clock speeds
  - More cache memory
- Main memory capacity and access limits performance
- Power density challenges more "dark" silicon
- I/Os take up more space and power as system size increases >30%





NVidia A100: 54 Billion Xtors - 826 mm<sup>2</sup> (2020) In TSMC 7 node



17,000 more transistors

Intel Pentium cpu ~300mm<sup>2</sup> -3.1 Million Xtors (1993) 0.8 μm technology



#### Can this be Done practically ?



Some more observations:

- Interposers are getting bigger
- 3D stacks are getting taller

• Interposers are an additional level in the packaging hierarchy

Going to a silicon-like board With fine pitch interconnect and short die to die spacings will allow us to build massive systems

But many issues need to be addressed



# The "Right" Rigid Interconnect Fabric

#### **Requirements:**

- Mechanically robust (flat, stiff, tough...)
- Processability: fine pitch wiring, & interconnects
- Thermally conductive

UCLA

• Can have passive (and active) built-in components

Samueli

School of Engineering



Silicon





#### Hybrid approaches(EMIB by Intel

| Material | Young's Modulus | tensile strength | CTE   | Thermal Conductivity |
|----------|-----------------|------------------|-------|----------------------|
| 8        | Mpa             | Mpa              | ppm   | W/m-K                |
| Organic  | 0.1 to 20       | 2000-3000        | 14-70 | 0.3 - 1              |
| Glass    | 50-90           | 33-3500          | 4-9   | 1-2                  |
| Silicon  | 130-185         | 5000-9000        | 3-5   | 148                  |
| Steel    | 190-200         | 400-500          | 11-13 | 16-25                |
| Copper   |                 |                  |       | 400                  |



#### Going to a silicon wafer scale is not new - there is a



## Important Questions

- What is the optimal pitch at which dies should be interconnected ?
- What is the optimal dielet size
- How close should we assemble dies
- What level of heterogeneity should we aim for

#### Hint: how do we make a SOW look like an ginonormous SOC





# What is the optimal I/O pitch ?

| Chip                       | Area<br>(mm²) | Transistor<br>count<br>(x10 <sup>9</sup> ) | Technology<br>node (nm) | 7<br>6<br>(ਘਈ) | Actual pitches ~100s of μm<br>•                                                                                                |       |       |               |
|----------------------------|---------------|--------------------------------------------|-------------------------|----------------|--------------------------------------------------------------------------------------------------------------------------------|-------|-------|---------------|
| IBM POWER9 [26]            | 695           | 8                                          | 1 /                     |                |                                                                                                                                | •     |       |               |
| AMD Zen [27]               | 44            | 1.4                                        | 14                      | pitch<br>+     | •                                                                                                                              |       |       |               |
| IBM POWER8 [28]            | 649           | 4.2                                        | 22                      | 0/I 3          |                                                                                                                                |       |       |               |
| Intel Xeon Haswell E5 [29] | 663           | 5.56                                       | 22                      | 980<br>2       | $P^{x} = \frac{4\sqrt{A}}{t \cdot \left(\frac{A}{A_{T}^{x}}\right)^{p}} = \frac{4(A_{T}^{x})^{p}}{t \cdot A^{p-\frac{1}{2}}}.$ |       |       | p<br><u>-</u> |
| IBM POWER7 + 80 MB [30]    | 567           | 2.1                                        | 32                      | Average<br>1 5 |                                                                                                                                |       |       | •             |
| Intel Itanium Poulson [31] | 544           | 3.1                                        | 52                      | ⊲ 0            |                                                                                                                                |       |       |               |
| IBM POWER7 + 32 MB [32]    | 567           | 1.9                                        | 45                      |                | 14 nm                                                                                                                          | 22 nm | 32 nm | 45 nm         |
| Intel Xeon 7400 [33]       | 503           | 1.9                                        | 40                      |                | Technology node                                                                                                                |       |       |               |



School of Engineering

UCLA



# Practical limits in heterogeneous integration

#### • Fine pitch ?

- like "fat wires" on a Silicon wafer 2-10  $\mu$ m this is the bump pitch (BGA pitch is >500 $\mu$ m)
- Trace pitch < 1  $\mu$ m (compared to ~30  $\mu$ m on PCB)
- Precision alignment ?
  - similar to fat wire alignment <0.2  $\mu$ m (bumps alignment accuracy is several  $\mu$ m)
- Close Spacing
  - As close as possible <20 μm (dies on a PCB are spaced at least a few 10's of mm away)
- Typical block sizes on an SoC are typically a few ~100  $\mu$ m on a side
  - So dielets should be small (1 to 100 mm<sup>2</sup> in area)
- Heterogeneity:
  - multiple nodes use the node that is optimal from a performance, area and cost perspective
  - multiple technologies logic, DRAM, sensors etc.
  - multiple materials Si , III-Vs.....



## A versatile Fine pitch wafer-scale assembly (Si IF)



Direct Cu-Cu Thermal Compression Bonding using formic acid vapor

X-Ray Tomograph of 10µm Cu-Cu pitch die to wafer connects







#### $55 \ \mu m$ inter-die spacing



#### DARPA Established CHIPS metrics using SuperCHIPS macros

Active Chain

Passive

Chain

SuperCHIPS Macro 1

Prog Ring Osc Clock

PRNG

Pad

SuperCHIPS Macro 2

PRNG

...

Si-IF

- Continuity check
- Latency characterization
  - -Reference & Si-IF ring oscillator: 3-4 GHz
  - -On-chip frequency divider (2<sup>12</sup>) & cycle counter
- High-speed data transfer & Bit error rate (BER)
  - -Programmable ring oscillator clock: 0.5-3 GHz
  - -Pseudo Random Number Generator (PRNG)
  - -On-chip comparator and error counter





#### Results: SuperCHIPS macros (GF 22FDX, TSMC 16FF)

**ODSA 2020** 

- •Successfully passed continuity tests of both passive and active daisy chains
- •Measured latency verified with on-chip counter —Latency comparable to on-chip buffer delays —Overall latency is <30 ps
- •Demonstrated data transfer up to 3 Gbps
  - -Bandwidth: 1200 Gbps/mm for 2-layer Si-IF
  - -No errors were observed even after 43 hrs
  - -BER: <10<sup>-14</sup> with 99% confidence (Estimate: <10<sup>-25</sup>)
- •Measured energy/bit: 0.028 pJ/b
- •No electrostatic discharge protection (ESD) used

-For ESD protection of 50 fF : Latency & Energy increase by





|            |                                                                                                      | Latency<br>of Si-IF<br>links [ps]                                                                                                                                |
|------------|------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| · · · ·    | · · ·                                                                                                |                                                                                                                                                                  |
| 921.1      | 3.77                                                                                                 | NA                                                                                                                                                               |
| 836.8      | 3.43                                                                                                 | 6.67                                                                                                                                                             |
| 762.3      | 3.12                                                                                                 | 13.80                                                                                                                                                            |
| GF 22FDX I | Die                                                                                                  |                                                                                                                                                                  |
| 1033.9     | 4.23                                                                                                 | NA                                                                                                                                                               |
| 877.6      | 3.59                                                                                                 | 10.51                                                                                                                                                            |
| 760.3      | 3.11                                                                                                 | 21.26                                                                                                                                                            |
|            |                                                                                                      | easured                                                                                                                                                          |
|            | wa<br>TSI                                                                                            | veforms for<br>VC 16FF die<br>sembly                                                                                                                             |
| SHF        | Ring Osc (500 µm)                                                                                    |                                                                                                                                                                  |
|            | frequency<br>[kHz]<br>TSMC 16FF<br>921.1<br>836.8<br>762.3<br>GF 22FDX 1<br>1033.9<br>877.6<br>760.3 | frequency<br>[kHz] frequency<br>[GHz]   TSMC 16FF Die 921.1   921.1 3.77   836.8 3.43   762.3 3.12   GF 22FDX Die 1033.9   1033.9 4.23   877.6 3.59   760.3 3.11 |

### SuperChips - a versatile communication protocol



Micrograph of the fabricated SuperCHIPS interface



Schematic of the SuperCHIPS I/O

|   |       | 8 mm |                 |
|---|-------|------|-----------------|
| C | Die 2 |      | Die 3<br>1acros |
|   | Die 1 |      | Die 4           |

| _ /           | Technology/ Interface             | Si-IF/ SuperCHIPS |                     |  |  |
|---------------|-----------------------------------|-------------------|---------------------|--|--|
| Burum         | protocol                          | Async             | Sync                |  |  |
|               | Interconnect pitch                | 10 µm             |                     |  |  |
| in the second | Overall Latency (ps)              | 30                | 1 clock cycle       |  |  |
| annin a       | Data-rate/link (Gbps)             | 10                | 4                   |  |  |
|               | Energy/bit (pJ/b)                 | <0.03             | <0.15               |  |  |
| - Turning     | Maximum Bandwidth/mm<br>(Gbps/mm) | 8000ª             | 2560 <sup>a,b</sup> |  |  |

8111 I/O interdie Connections 22291 power Connections

> Longer Range connections can be done daisy chaining through Intervening dies using porosity rules and multiple buffer stages - for a few die over or

using pico-SerDes for longer (~ cms) lengths.

Using "utility dies" which may also provide redundant routing options to manage assembly defects





#### Technology Comparison using s-FOM<sub>k</sub>

| Tech/<br>Interface<br>protocol | Si-IF/<br>SuperCHIPS<br>Async Sync |                     | Interpose<br>r/ <mark>AIB</mark> | PCB/S                  | SerDes              | Improv<br>ement |
|--------------------------------|------------------------------------|---------------------|----------------------------------|------------------------|---------------------|-----------------|
| Reach                          |                                    | ghbor               | Neighbor                         | Neighbor Long<br>Reach |                     |                 |
| Overall<br>Latency (ps)        | 30                                 | 500                 | 1500 <sup>[1]</sup>              | ~2000                  | ~6000               | 3-65X           |
| Energy/bit<br>(pJ/b)           | < 0.03                             | < 0.15              | 0.8-<br>0.85 <sup>[3,4]</sup>    | $1.17^{[7]}$           | 6.9 <sup>[13]</sup> | 5-40X           |
| Bandwidth/<br>mm<br>(Gbps/mm)  | 8000ª                              | 2560 <sup>a,b</sup> | 707.7 <sup>ь</sup>               | 354                    | 149-298°            | 4-23X           |

<sup>a</sup> 4 wiring levels, <sup>b</sup> Assuming 20% overhead, <sup>c</sup> Estimated from data in [10-13]





Jangam & Iyer T-CPMT (2020)





#### Technology Comparison s-FoM<sub>ucla</sub> - shows the benefit of technology





#### CHIPS Project Goals and Milestones

| Metric                                        | Phase 1                   | Phase 2                | Phase 3                                 | SuperCHIPS on Si-IF<br>(current)                                |  |  |  |
|-----------------------------------------------|---------------------------|------------------------|-----------------------------------------|-----------------------------------------------------------------|--|--|--|
| Design level                                  |                           |                        |                                         |                                                                 |  |  |  |
| IP reuse                                      | > 50% public IP<br>blocks | > 50% public IP blocks | >50% public IP blocks                   | Feasible                                                        |  |  |  |
| Modular design -                              |                           | -                      | > 80% reused, > 50%<br>prefabricated IP | Feasible                                                        |  |  |  |
| Access to IP                                  | > 2 sources of IP         | > 2 sources of IP      | > 3 sources of IP                       | 2 sources of IP                                                 |  |  |  |
| Heterogeneous<br>integration                  | > 2 technologies          | > 2 technologies       | > 3 technologies                        | Feasible                                                        |  |  |  |
| NRE reduction                                 | -                         | > 50%                  | > 70%                                   | Feasible                                                        |  |  |  |
| Turnaround time<br>reduction                  | -                         | > 50%                  | > 70%                                   | Feasible                                                        |  |  |  |
| Performance benchmarks<br>(performer defined) | -                         | > 95% benchmark        | > 100% benchmark                        | See s-EoMa                                                      |  |  |  |
|                                               |                           | Digital interfaces     | 5                                       |                                                                 |  |  |  |
| Data-rate (scalable)                          | 10 Gbps                   | 10 Gbps                | 10 Gbps                                 | 10 Gbps                                                         |  |  |  |
| Energy efficiency                             | < 1 pJ/bit                | < 1 pJ/bit             | < 1 pJ/bit                              | < 0.4 pJ/bit                                                    |  |  |  |
| Latency                                       | $\leq 5$ nsec.            | $\leq 5$ pases         | $\leq 5$ mass.                          | $\leq 0.1$ nsec                                                 |  |  |  |
| Bandwidth density                             | > 1,000 Gbps/mm           | > 1,000 Gbps/mm        | > 1,000 Gbps/mm                         | > 1,000 Gbps/mm                                                 |  |  |  |
| Analog interfaces                             |                           |                        |                                         |                                                                 |  |  |  |
| Insertion loss (across full<br>bandwidth)     | < 1 dB                    | < 1 dB                 | < 1 dB                                  | < 0.6 dB at 30 GHz (measured)<br>< 0.8 dB at 50 GHz (estimated) |  |  |  |
| Bandwidth                                     | $\geq$ 50 GHz             | $\geq$ 50 GHz          | $\geq$ 50 GHz                           | $\geq$ 50 GHz                                                   |  |  |  |
| Power handling                                | $\geq$ 20 dBm             | $\geq$ 20 dBm          | $\geq$ 20 dBm                           | ≥ 20 dBm (EM limited)                                           |  |  |  |

UCLA Samueli School of Engineering

ODSA 2020



# So What are the issues ?

- Developing the assembly technology: fine pitch, close spacing tight alignment etc...
- Establishing a communication protocol for both near and far dielets
- Communicating with the outside world
- Delivering power huge amounts of power !
- Extracting heat huge amounts of heat !
- Making such system reliable
- Ensuring the costs are economical



## Communicating with the outside world

 Flexible high speed wired connectors (FlexTrate<sup>tm</sup>)







 RF links using embedded fused quartz or PDMS and III-V drivers



Antenna on PDMS substrate





Photonic Interconnect:

UCLA Samueli School of Engineering

ODSA 2020



# Summary

- Packaging has scaled significantly in the last few years
  - Driven by need, more investment, More silicon-like processing
  - Silicon as a base packaging material has significant potential
- The challenges are
  - Assembly especially at high throughput
  - Connections to the outside world
  - Power delivery and heat extraction
  - Reliability and yield
  - Supply chain for bare dies
- We can extend this concept to flexible hybrid electronics (did not talk about it much today)



### Selected Bibliography (more here)

- S. Jangam and S. S. Iyer, "<u>A Signaling Figure of Merit (s-FoM) for Advanced Packaging</u>," in IEEE Transactions on Components, Packaging and Manufacturing Technology, doi: 10.1109/TCPMT.2020.3022760
- S. Jangam, U. Rathore, S. Nagi, D. Markovic and S. S. Iyer, "Demonstration of a Low Latency (<20 ps) Fine-pitch (≤10 μm) Assembly on the Silicon Interconnect Fabric," 2020 IEEE 70th Electronic Components and Technology Conference (ECTC), Orlando, FL, USA, 2020, pp. 1801-1805, doi: 10.1109/ECTC32862.2020.00281.
- S. S. Iyer, S. Jangam, and B. Vaisband, <u>"Silicon interconnect fabric: A versatile heterogeneous integration platform for AI systems</u>," in IBM Journal of Research and Development, vol. 63, no. 6, pp. 5:1-5:16, 1 Nov.-Dec. 2019.
- Boris Vaisband and S. S. Iyer, "Global and Semi-Global Communication on Silicon Interconnect Fabric", Proceedings of the IEEE/ACM International Symposium on Networks-on-Chip, pp. 15:1-15:5, October 2019.
- P. Gupta and S. S. Iyer, <u>"Goodbye, motherboard. Bare chiplets bonded to silicon will make computers smaller and more powerful: Hello, silicon-interconnect fabric,</u>" in IEEE Spectrum, vol. 56, no. 10, pp. 28-33, Oct. 2019, doi: 10.1109/MSPEC.2019.8847587.
- Boris Vaisband and S. S. Iyer, <u>"Communication Considerations for Silicon Interconnect Fabric,</u>" Proceedings of the ACM/IEEE International Workshop on System Level Interconnect Prediction, June 2019.
- Kannan K. Thankappan, B. Vaisband, S. S. Iyer, <u>"On-Chip ESD Monitor</u>", IEEE 69th Electronic Components and Technology Conference (ECTC), May 28-31, 2019, Las Vegas, NV.
- Saptadeep Pal, D. Petrisko, M. Tomei, P. Gupta, S. S. Iyer, and R. Kumar, <u>"Architecting Waferscale Processors: A GPU Case Study</u>", in 25th IEEE International Symposium on High-Performance Computer Architecture (HPCA), February 16-20, 2019, Washington D.C., USA.
- SivaChandra Jangam, A. Bajwa, K. K. Thankappan, P. Kittur, and S. S. Iyer, "Electrical Characterization of High Performance Fine Pitch Interconnects in Silicon-Interconnect Fabric," IEEE 68th IEEE Electronic Components and Technology Conference (ECTC), May 29-June 1, 2018, San Diego, CA.
- Saptadeep Pal, D. Petrisko, A. Bajwa, P. Gupta, S. S. Iyer, and R. Kumar <u>"A Case for Packageless Processors</u>", 24th IEEE International Symposium on High-Performance Computer Architecture (HPCA), February 24-28, 2018, Vienna, Austria.
- Saptadeep Pal, S. S. Iyer, and P. Gupta, "Advanced packaging and heterogeneous integration to reboot computing," in IEEE International Conference on Rebooting Computing (ICRC), November 8-9, 2017, Washington, DC, USA. (Invited)
- Arvind Kumar, Z. Wan, W. Wilcke, and S. S. Iyer, <u>"Towards Human-Scale Brain Computing Using 3D Wafer Scale Integration,</u>" ACM Journal of Emerging Technologies in Computing, vol. 13, no. 3, article no. 45, Apr. 2017.
- Subramanian S. Iyer, "Heterogeneous Integration for Performance and Scaling," in IEEE Transactions on Components, Packaging and Manufacturing Technology, vol. 6, no.7, pp. 973-982, Jul. 2016. doi: 10.1109/TCPMT.2015.2511626

UCLA Samueli School of Engineering

ODSA 2020







UCLA Samueli School of Engineering

ODSA 2020

CHIPS CHIPS CENTER FOR HETEROGENEOUS INTEGRATION AND PERFORMANCE SCALING

©S.S. lyer 2020