BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000712Z
LOCATION:710
DTSTART;TZID=America/Denver:20231112T165000
DTEND;TZID=America/Denver:20231112T170000
UID:submissions.supercomputing.org_SC23_sess427_misc322@linklings.com
SUMMARY:Lightning Talk:  Toward Efficient Asynchronous Checkpointing for L
 arge-Language Models
DESCRIPTION:Avinash Maurya (Rochester Institute of Technology)\n\nLarge-la
 nguage models (LLMs) have been rapidly and widely adopted across research,
  academia, and enterprises for exploring different endeavors ranging from 
 scientific and educational pursuits to financial and legal assistance. Uns
 urprisingly, training such sophisticated LLMs requires large-scale infrast
 ructures, typically consisting of accelerators such as GPUs, and spans acr
 oss multiple months depending on the size of the model and training data. 
 Unfortunately, the GPU memory is in the range of tens of GBs and cannot ho
 ld multi-billion parameter models which are typically hundreds of GBs in s
 ize. Therefore, a combination of data, model, and tensor parallel techniqu
 es are applied to enable training such LLMs, which shard and distribute th
 e model and its associated states across different GPUs. In this context, 
 there is a frequent need to roll back the training of such LLMs to past st
 able states. This can happen for various reasons: failure of components wh
 en running at scale, the need to fine-tune the model and try a different t
 raining direction, inspect the evolution of the training to understand how
  it converges, etc.  To this end, state-of-the-art LLM training runtimes s
 uch as Deepspeed, PyTorch, etc. use synchronous or partially synchronous c
 heckpointing strategies, which lead to runtime overheads of up to 41% due 
 to I/O bottlenecks. In this talk, we discuss the challenges of adopting ex
 isting multi-level checkpointing libraries for distributed LLM training an
 d present novel strategies to perform efficient asynchronous multi-level c
 heckpointing of distributed LLMs to minimize the checkpointing overheads. 
 In particular, our approach is driven by key design ideas such as (1) bloc
 k training only when attempting to overwrite unflushed tensors; (2) use pi
 nned host memory for faster device-to-host transfers using GPU copy engine
 s; (3) consistently capturing and serializing model states distributed acr
 oss device and host memory; and (4) selectively flushing checkpoints to mi
 nimize storage and I/O bottlenecks.\n\nTag: Fault Handling and Tolerance\n
 \nRegistration Category: Workshop Reg Pass\n\nSession Chairs: Gene Cooperm
 an (Northeastern University); Donglai Dai (Advanced Micro Devices (AMD)); 
 Rebecca Hartman-Baker (National Energy Research Scientific Computing Cente
 r (NERSC), Lawrence Berkeley National Laboratory (LBNL)); and Bogdan Nicol
 ae (Argonne National Laboratory (ANL), Illinois Institute of Technology)\n
 \n
END:VEVENT
END:VCALENDAR
