Resilient AI: Building Fault-Tolerant AI Systems | Kisaco Research

Building ever larger scale AI clusters hinges on addressing the challenge of creating reliable and fault-tolerant systems. This keynote will explore how Meta work on Llama 3: Herd of Models informs strategies for building robust AI infrastructure. We will highlight the Open AI Systems Initiative and Rack-scale Alignment for accelerator diversity focusing on areas of power, compute, and liquid cooling. Our discussion will emphasize the critical role of community engagement in setting open standards and interoperability. This session will provide a roadmap for developing AI systems that can withstand the unpredictability of real-world applications.

Session Topics: 
Infrastructure
Hardware
Systems
Speaker(s): 

Author:

Dan Rabinovitsj

VP, Infrastructure
Meta

Dan has 30+ years’ experience in developing technology that connects people, with a particular focus on market disruption and innovation. Dan has served in executive leadership roles in Silicon Labs, NXP, Atheros, Qualcomm, Ruckus Networks and Facebook/Meta.  Dan joined Meta in 2018 to lead Facebook Connectivity, a team focused on bringing more people online at faster speeds and changing the telecom industry through the Telecom Infra Project. Dan is now supporting a team developing and sustaining data center hardware and AI systems.

Dan Rabinovitsj

VP, Infrastructure
Meta

Dan has 30+ years’ experience in developing technology that connects people, with a particular focus on market disruption and innovation. Dan has served in executive leadership roles in Silicon Labs, NXP, Atheros, Qualcomm, Ruckus Networks and Facebook/Meta.  Dan joined Meta in 2018 to lead Facebook Connectivity, a team focused on bringing more people online at faster speeds and changing the telecom industry through the Telecom Infra Project. Dan is now supporting a team developing and sustaining data center hardware and AI systems.