BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000713Z
LOCATION:503-504
DTSTART;TZID=America/Denver:20231116T113000
DTEND;TZID=America/Denver:20231116T120000
UID:submissions.supercomputing.org_SC23_sess254_exforum134@linklings.com
SUMMARY:From Stencils to Tensors: Running 3D Finite Difference Seismic Ima
 ging on the Groq AI Inference Accelerator
DESCRIPTION:Tobias Becker (Groq Inc)\n\nGroqChip™ is an AI accelerator opt
 imized for running large-scale inference workloads with high throughput an
 d ultra-low latency. It features a Tensor Streaming architecture optimized
  for matrix-oriented operations commonly found in AI, but the chip can als
 o efficiently compute other applications such as HPC workloads that can be
  expressed as large-scale matrix multiplication. GroqChip uses a determini
 stic dataflow execution model that results in predictable and repeatable p
 erformance without runtime variation, and its RealScale™ chip-to-chip inte
 rconnect technology makes it possible to scale applications across cards i
 n a node, or nodes in a rack, without hitting the bottlenecks of PCIe or t
 he network.\n\nHere, we explore how GroqChip and its architecture can be u
 sed to deliver high performance for linear algebra-based applications in H
 PC. Seismic imaging typically involves a 3D finite difference solver, whic
 h involves 3D stencil computations on a volume of data. The original stenc
 il algorithm is not well-suited to run on a tensor-based architecture, but
  we outline how stencil operation can be transformed into tensor operation
 s by decomposing the stencil and recomposing it into matrices. The finite 
 difference step can now be solved by matrix multiplications and matrix tra
 nspositions. A single GroqChip can run the finite difference step for a su
 b-cube of data which is fully kept in on-chip memory, while larger volumes
  are computed by mapping the computation to a full rack or several racks. 
 Halo data is exchanged between GroqChip processors via RealScale interconn
 ect, enabling the scaling of the application’s domain size without PCIe or
  internode communication becoming the bottleneck. The deterministic datafl
 ow model supports efficient orchestration of data movements within the chi
 p and between chips without ever stalling the compute units. Finally, nume
 rical analysis and optimization allows us to leverage of Groq TruePoint™ a
 rithmetic to satisfy the numerical requirements of seismic imaging.\n\nTag
 : Accelerators, Artificial Intelligence/Machine Learning, Architecture and
  Networks, Hardware Technologies\n\nRegistration Category: Tech Program Re
 g Pass, Exhibits Reg Pass\n\nSession Chair: Jay Lofstead (Sandia National 
 Laboratories, University of New Mexico)\n\n
END:VEVENT
END:VCALENDAR
