

### Beyond *Just* Hardware

Full-stack Optimization Towards Efficient AI Inference

Hyunsik Choi, Head of SW Platform, Jihoon Yoon, Product Marketing Manager



2017-2021

FuriosaAI founded & Launch Gen 1 vision NPU

2021

GPT3 inspired RNGD

2022

RNGD Development Kick off 2024 May

RNGD raw silicon sample arrival

2024 July

First LLM demo

## **Key Points**

01 Mass AI adoption is bottlenecked

02 Energy efficient AI inference

03 Full-stack optimization for achieving efficiency



#### AI has broken energy efficiency













Data center cooling infrastructure (2024)



"Average server rack densities are increasing but remain below 8 kW. The majority of facilities do not have racks above 30 kW, and those that do have only a few."

- Uptime Institute Global Datacenter Summary 2024



## FuriosaAI's Mission Make AI computing sustainable, enabling access to powerful AI for everyone on Earth



#### RNGD: Powerfully Efficient AI Inference

Data center AI accelerator built for the era of LLM and other generative AI models



512 TFLOPS

64 TFLOPS (FP8) x 8 Processing Elements

48 GB

Memory Capacity

256 MB SRAM
384 TB/s On-chip Bandwidth

1.5 TB/s

Memory Bandwidth

150 W TDP

targeting air-cooled data centers

2 x HBM3

CoWoS-S

INT8 (512 TOPS), BF16 (256 TFLOPS), INT4 (1 POPS), FP8 (512 TFLOPS)

PCIe P2P support For LLMs

Features For Cloud

Multiple-Instance support Virtualization

Secure boot & model encryption



## Early performance numbers: 60% higher perf/watt than current inference solutions

GPT-J 6B MLPerf Benchmark Scenario (99% accuracy)

|                                | <b>F</b> RNGD | NVIDIA L40S   | Intel Gaudi 2 | Google TPU v5e |
|--------------------------------|---------------|---------------|---------------|----------------|
| Performance<br>(queries / sec) | 11.5<br>(FP8) | 12.3<br>(FP8) | 10.51         | 2.5            |
| Power<br>(watt)                | 185           | 320           | Unknown       | Unknown        |
| Data source                    | measured      | measured      | MLPerf 3.1    | MLPerf 4.0     |

Disclaimer: As of Aug 2024, unverified by MLPerf

3.5X compute per rack

Lower total cost of ownership, with less energy usage and fewer racks. Compatible with air-cooled data centers of today



Most data center racks today are below 15kW data above is for running Llama 3 70 B

#### Beyond hardware

Full-stack innovation and optimization for maximized efficiency in AI inference



Model Execution
Serving
Utilization



#### Tensor Contraction, The Core Computation in Deep Learning





Flop analysis for BERT\*

"Data movement is the major bottleneck for efficiency"

Source: "Data Movement is All You Need," MLSYS'21

#### TCP: A Tensor Contraction Processor for AI Workloads

ISCA 2004 Submission #28 - Contidental Draft - Do NOT Distribute?

Afterno-We introduce a neval featur contraction processor TCPs probitecture that effers a paradigm shift from traditional architectures that only on Stand-stay matrix multiplications, TCF aims at eightiting the rich parallelism and data becality inherent in tensor contractions, thereby enhancing both efficiency and performance of All workloads.

to simplify software development. In order to efficiently process: some encoupulating programs into efficient executables. operations with diverse tensor shapes, the PEs are designed to be Besible causely to be utilized as a large-scale single unit or a set of independently running small ecopate units. We aim at manimizing data ressures built breefs of later and later compute folia reson. We also exploit input broadcast to scallight contraction engines and input builtier hunof reuse to Jurither exploit reuse behavior in tomor contraction. Our compiler explores the design space of tensor contraction considering tensor shapes and the order of their associated loop operations as well as the underlying

A TCP chip was designed and followated in Sun technology. effecting DNATED DOES TOPS OF SAFPE or PATRICIAL with 15th

there is inhurcedly high complically in pseufolicing operation while being NoC-aware. Mostower, potenting programs for minutesian wimpo conce in known to be more challenging than for a few brawny corre [6]; Except for GPUs, their are few incremelyl communical chips with a large number of untall PEx TCF is composed of course-grained processing elements affice. that provide a software stack capable of compiling arbitrary

Inference in other characterized by alverse some shapes and thus it is essential to explore the parallelism and data resou derived from the tenore shapes as well as the banch size. Thus, with. To do that, we propose a circuit withit haved firth network in case of large matrix, one busid chips, it is challenging to in Brildy common compute units to enable inter-compute unit. fully unities the large marrix units across nations shapes and types of tenur operation 17), 1221, 1241.

feered of matrix multiplications, we use traver commercion as a primitive. This approach not only stubles inscrively purallal operations but also incorporates psycholog over the tone axis, similar to vector progressors. We have designed large coarse-grained processing elements (PEs) which can be split into smaller company units called alives, as illustrated in ng for more flexible configurations for diverse

Personalises on the series of the Sach series of slion, the entire set of slices can function as using chrowns or individual olicas can operate worker, and parallel compute sains.

in the case of the attention layer of transforms PC's alices can be continued to operate in ach head, e.g., 16 shors per head (see Fig. 10) ususly fetched in a psyclinial number through ork, allow the operation antits to be utilized at nd. This problem on to adopt earloss data repor efficiently utilize the limited input/weight/output es as demonstrated in the case study of the model encounter effection VIII

setation units perform computations determ and by the software. TCP achieves producable This enables us to develop accurate cost models or and energy commensures. Our computer hour skels when exploring possible configurations of or disper and their contraction orders.

nder of the paper, we first explain our low-level on of travers and operations in Section II. Then, the objection level architecture and microscardinomos II and IV. We explain our programming ameriace mark in Section V. Finally, we show how a New Science model such as EE,aMA-2 78 can on our chip, discous the performance tendts VL and share our known learned during the

AI workloads." TCP: A Tensor Contraction Processor for AI Workloads Presented at ISCA: International Symposium on Computer Architecture, 2024

"TCP aims at exploiting the rich

inherent in tensor contractions.

efficiency and performance of

parallelism and data locality

thereby enhancing both

TCP (Tensor Contraction Processor)



#### Tensor Contraction, not Matmul, as a Primitive

**Tensor contraction** is a higher dimensional generalization of matrix multiplication.

#### Tensor contraction is declarative

No explicit memory layout for data
No explicit scheduling for computation



#### DNN Graph Compiler: End-to-End Model Efficiency

- Optimal memory layout and operation scheduling for maximum data reusability
- Temporal pipeline opportunities
- Operator fusion and memory allocation, split/merge scheduling



#### Quantization Becomes More Critical as Model Sizes Grow

#### Efficiency gains through quantization

- Inference latency
- Computation time
- Memory footprint
- Energy consumption

#### **Energy Consumption**

(Numbers are rough approximations for 45nm)

| Operation:          | Energy<br>(pJ) | Relative Energy Cost |
|---------------------|----------------|----------------------|
| 8b Add              | 0.03           |                      |
| 16b Add             | 0.05           |                      |
| 32b Add             | 0.1            |                      |
| 16b FP Add          | 0.4            |                      |
| 32b FP Add          | 0.9            |                      |
| 8b Mult             | 0.2            |                      |
| 32b Mult            | 3.1            |                      |
| 16b FP Mult         | 1.1            |                      |
| 32b FP Mult         | 3.7            |                      |
| 32b SRAM Read (8KB) | 5              |                      |
| 32b DRAM Read       | 640            |                      |

#### Furiosa Quantizer: Graph-Based Automated Tool

#### **End-to-end automated quantization**

Supports arbitrary customized LLM models using graph pattern search

BF16, INT8 Weight-Only (W8A16), FP8 (W8A8), INT8 SmoothQuant (W8A8), INT4 Weight-Only (W4A16 AWQ / GPTQ)



Model Execution
Serving
Utilization



#### **Generative Inference Basics**



#### Challenges in Generative Model Serving



Challenges of auto-regressive execution in serving

#### Furiosa LLM: High-throughput Serving Engine for LLMs

#### High throughput serving with SOTA optimization

- Continuous batching allows immediately starting incoming requests when resource is available.
- PagedAttention eliminates compute and IO waste
- Blocked KV cache reduces significantly memory wastes

**6** X

Increase in inference performance



Furiosa Generator

Furiosa Runtime



# Model Execution Serving Utilization



A single RNGD has 8 Processing Elements (PEs)

An RNGD can be spatially partitioned into many individual NPUs



A single RNGD has 8 Processing Elements (PEs)

An RNGD can be spatially partitioned into many individual NPUs



A single RNGD has 8 Processing Elements (PEs)

An RNGD can be spatially partitioned into many individual NPUs

Up to 4 PEs can operate together as a single NPU



A single RNGD has 8 Processing Elements (PEs)

An RNGD can be spatially partitioned into many individual NPUs

Up to 4 PEs can operate together as a single NPU



A single RNGD has 8 Processing Elements (PEs)

An RNGD can be spatially partitioned into many individual NPUs

Up to 4 PEs can operate together as a single NPU

Furiosa RNGD supports **SR-IOV** (Single Root IO Virtualization) **for multiple isolated access from VMs** 



#### Furiosa Software Stack Key Features

PyTorch 2.0 integration

Quantization toolkit (FP8, INT8, INT4, ..)

3D model parallelism support

Graph compiler for DNN models

Performance profiling tools

LLM serving framework compatible with vLLM

Kubernetes device plugin and NPU operator

Virtual machine support



#### In summary

## Delivering peak AI performance with high efficiency requires

#### Maximized model efficiency

The RNGD Chip, Compiler, and Furiosa Quantizer deliver peak performance with low-precision inference for speed and efficiency.

#### Enhanced serving capabilities

Boost throughput and reduce latency in production with PagedAttention, Blocked KV cache, and continuous batching.

#### Flexible resource utilization

RNGD's spatial partitioning and SR-IOV ensure optimal resource allocation, maximizing NPU utilization in virtualized and containerized environments.



In order to solve for mass AI adoption, We have to think beyond *just* hardware