BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000713Z
LOCATION:503-504
DTSTART;TZID=America/Denver:20231116T133000
DTEND;TZID=America/Denver:20231116T140000
UID:submissions.supercomputing.org_SC23_sess255_exforum137@linklings.com
SUMMARY:Accelerating Scientific Workflows with the NVIDIA Grace Hopper Pla
 tform
DESCRIPTION:Mathias Wagner (NVIDIA Corporation)\n\nNVIDIA Grace Hopper Sup
 erchips are a scale-up architecture ideal for scientific computing workflo
 ws involving CPUs and GPUs. Building on a decade of GPU acceleration, Grac
 e-Hopper realizes NVIDIA NVLink C2C, a 900 GB/s interconnect between the G
 race CPU and the Hopper H100 GPU. C2C enables coherent memory at 7x the ba
 ndwidth of PCIe across Hopper’s 96GB HBM3 and Grace’s up to 480GB LPDDR5X.
  This removes the conceptual CPU/GPU memory divide and lowers barriers for
  scientists accelerating their applications with ever faster GPUs, e.g., H
 100 delivering up to 67 FP64 teraflops and 4 TB/s memory bandwidth.    Wit
 h more application code executing on GPUs, workload performance becomes in
 creasingly susceptible to non-GPU limiters like data movement and CPU perf
 ormance (Amdahl’s Law). C2C and the Grace CPU, ideal for single-thread or 
 multi-core CPU workloads, restore the required balance   . Grace combines 
 72 Arm Neoverse-V2 cores with NVIDIA Scalable Coherency Fabric, a distribu
 ted cache and mesh fabric with 3.2 TB/s bi-section bandwidth. This high ba
 ndwidth mesh enables one NUMA node for all 72 CPU cores, simplifying multi
 -core programming. Each core implements a 512-bit SVE2 SIMD pipeline for a
  total CPU FP64 theoretical peak of 7.1 teraflops. When combined with the 
 up to 500 GB/s memory bandwidth of the LPDDR5X DRAM, Grace delivers twice 
 the performance-per-Watt of conventional x86-64 CPUs. This session present
 s HPC and AI workload performance results with a technical deep-dive into 
 the specific features of Grace-Hopper that accelerate each workload. We di
 scuss how Grace-Hopper's distinctive coupling of the CPU/GPU hardware and 
 the accompanying software stack create a platform which increases develope
 r productivity, accelerates existing applications, and facilitates new sta
 ndard programming models in C++, Fortran, and Python. Attendees will gain 
 a deeper understanding of how to extract the performance offered by Grace-
 Hopper and realize the potential of this innovative, energy-efficient plat
 form for science and industry.\n\nTag: Artificial Intelligence/Machine Lea
 rning, Architecture and Networks, Hardware Technologies\n\nRegistration Ca
 tegory: Tech Program Reg Pass, Exhibits Reg Pass\n\nSession Chair: Eishi A
 rima (Technical University of Munich)\n\n
END:VEVENT
END:VCALENDAR
