BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000712Z
LOCATION:503-504
DTSTART;TZID=America/Denver:20231114T143000
DTEND;TZID=America/Denver:20231114T150000
UID:submissions.supercomputing.org_SC23_sess249_exforum106@linklings.com
SUMMARY:Supercluster-Scale ML Training with Oracle Cloud Infrastructure
DESCRIPTION:Kevin Jorissen (Oracle)\n\nWe have seen a substantial increase
  in the use of the Oracle Cloud Infrastructure (OCI) for training of large
 -scale language models (LLM), as more and more startups and established co
 mpanies seek to gain an edge with increasingly large and more accurate mod
 els. These models share a need for efficient GPU cluster computing – that 
 is, the ability to scale training to hundreds or thousands of GPUs for an 
 extended period of time while maintaining performance and efficiency. Perf
 ormance is crucial both at the level of individual GPU and of scaling effi
 ciently across the network. Scaling these large training models can be ver
 y complex and certainly difficult to tune, requiring a cost-effective infr
 astructure that can provide availability, resiliency, and performance at s
 cale.  \n\nIn this talk, we will discuss our approach to support the needs
  of these large-scale language models, building on years of experience run
 ning HPC on a bare-metal instances with a very low latency network. We wil
 l present Oracle’s “SuperCluster”, which scales to thousands or tens of th
 ousands of Nvidia A100 and H100 GPUs with low latency and high inter-node 
 bandwidth of up to 3,200Gbps. This time-tested bare-metal instance platfor
 m is combined with intelligent job placement, locality awareness, and addi
 tional tuning to make ML work at the largest scales. Oracle’s SuperCluster
 s have been rigorously tested on well-known public benchmarks such as Mega
 tron, where it reaches very high throughputs, as well as on proprietary cu
 tting-edge models that are commonly used in machine learning. We will show
  examples of use from various companies and will discuss the challenges th
 at were addressed to run these models at such scale. We will finish the pr
 esentation with a discussion of some of the open research problems that st
 ill need to be addressed in this area.\n\nTag: Accelerators, Artificial In
 telligence/Machine Learning\n\nRegistration Category: Tech Program Reg Pas
 s, Exhibits Reg Pass\n\nSession Chair: Jane Herriman (Lawrence Livermore N
 ational Laboratory (LLNL))\n\n
END:VEVENT
END:VCALENDAR
