BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000713Z
LOCATION:401-402
DTSTART;TZID=America/Denver:20231115T133000
DTEND;TZID=America/Denver:20231115T140000
UID:submissions.supercomputing.org_SC23_sess168_pap262@linklings.com
SUMMARY:EasyScale: Elastic Training with Consistent Accuracy and Improved 
 Utilization on GPUs
DESCRIPTION:Mingzhen Li (Beihang University), Wencong Xiao (Unaffiliated),
  Hailong Yang and Biao Sun (Beihang University), Hanyu Zhao and Shiru Ren 
 (Unaffiliated), Zhongzhi Luan (Beihang University), Xianyan Jia (Unaffilia
 ted), Yi Liu (Beihang University), Yong Li and Wei Lin (Unaffiliated), and
  Depei Qian (Beihang University)\n\nDistributed synchronized GPU training 
 is commonly used for deep learning. The resource constraint of using a fix
 ed number of GPUs makes large-scale training jobs suffer from long queuing
  time for resource allocation, and lowers the cluster utilization. Adaptin
 g to resource elasticity can alleviate this but often introduces inconsist
 ent model accuracy, due to lacking of capability to decouple model trainin
 g procedure from resource allocation. We propose EasyScale, an elastic tra
 ining system that achieves consistent model accuracy under resource elasti
 city for both homogeneous and heterogeneous GPUs. EasyScale preserves the 
 data-parallel training behaviors strictly, traces the consistency-relevant
  factors carefully, utilizes the deep learning characteristics for EasySca
 leThread abstraction and fast context-switching. To utilize heterogeneous 
 cluster, EasyScale dynamically assigns workers based on the intra-inter-jo
 b schedulers, minimizing load imbalance and maximizing aggregated job thro
 ughput. Deployed in an online serving cluster, EasyScale powers the traini
 ng jobs to utilize idle GPUs opportunistically, improving overall cluster 
 utilization by 62.1%.\n\nTag: Artificial Intelligence/Machine Learning\n\n
 Registration Category: Tech Program Reg Pass\n\nReproducibility Badges: Ar
 tifact Available, Artifact Functional, Results Reproduced\n\nSession Chair
 : Shuaiwen Leon Song (Microsoft Corporation)\n\n
END:VEVENT
END:VCALENDAR
