BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000712Z
LOCATION:503-504
DTSTART;TZID=America/Denver:20231116T110000
DTEND;TZID=America/Denver:20231116T113000
UID:submissions.supercomputing.org_SC23_sess254_exforum128@linklings.com
SUMMARY:Strong Scaling of State-of-the-Art LLM Inference with Groq Softwar
 e-Scheduled Deterministic Networks
DESCRIPTION:Igor Arsovski (Groq Inc)\n\nIn this talk, we will demonstrate 
 Groq’s approach to synchronous, software-scheduled AI accelerator networks
  and showcase how we use it to unlock state-of-the-art performance and lat
 ency on Large Language Models (LLMs), including Llama-2 70B, scaled to ove
 r 500 GroqChip™ Language Processors™.\n\nTraditional HPC systems and data 
 centers use dynamic time- and space-sharing, where platforms dynamically c
 oordinate the use of compute, memory, and network resources among threads 
 or workloads. This is a natural solution for arbitrary compute workloads, 
 whose unpredictability makes such mediation a prerequisite. Unfortunately,
  this results in compounding inefficiency and complexity at all layers of 
 the stack: processor architecture, memory, networking, and more. Modern AI
  workloads, however, have a predictable structure allowing for efficient s
 tatic scheduling of compute and network resources. \n\nGroq is turning thi
 s theory to practice by making components deterministic from the ground-up
  to stand-up large-scale synchronous compute platforms and empower softwar
 e to make more orchestration decisions statically. Unlike traditional netw
 orks where packets can collide and congestion can develop, all traffic in 
 the Groq network is completely pre-planned by Groq™ Compiler with zero net
 work collisions. This maximizes not only the utilization of the links, but
  the number of minimal paths that can be taken between chips. Deterministi
 c compute and static orchestration does introduce new software and hardwar
 e challenges and co-optimization opportunities, which we will discuss in t
 his talk.  \n\nOvercoming these challenges unlocks the opportunity for gre
 ater compute and power efficiency on AI workloads. Groq’s software-schedul
 ed networks offer key advantages including: (1) a global network load bala
 ncing via compiler-driven network traffic scheduling; (2) high network ban
 dwidth efficiency via low control overhead; and (3) low latency chip-to-ch
 ip communication via a router-less, handshake-less direct topology. We sho
 wcase these advantages by demonstrating state-of-the-art performance on LL
 M models, including Llama-2 70B, scaled to over 500 Language Processors.\n
 \nTag: Accelerators, Artificial Intelligence/Machine Learning, Architectur
 e and Networks, Hardware Technologies\n\nRegistration Category: Tech Progr
 am Reg Pass, Exhibits Reg Pass\n\nSession Chair: Jay Lofstead (Sandia Nati
 onal Laboratories, University of New Mexico)\n\n
END:VEVENT
END:VCALENDAR