BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000712Z
LOCATION:605
DTSTART;TZID=America/Denver:20231112T163300
DTEND;TZID=America/Denver:20231112T164000
UID:submissions.supercomputing.org_SC23_sess434_ws_ftxs105@linklings.com
SUMMARY:Dynamic Selective Protection of Sparse Iterative Solvers via ML Pr
 ediction of Soft Error Impacts
DESCRIPTION:Zizhao Chen (University of Kansas); Thomas Verrecchia (Nationa
 l Institute of Advanced Technology (ENSTA Paris)); Hongyang Sun (Universit
 y of Kansas); Joshua Booth (University of Alabama, Huntsville); and Padma 
 Raghavan (Vanderbilt University)\n\nSoft errors occur frequently on large 
 computing platforms due to the increasing scale and complexity of HPC syst
 ems. Various resilience techniques have been proposed to protect scientifi
 c applications from soft errors. Among them, system-level replication ofte
 n involves duplicating or triplicating the entire computation, resulting i
 n high resilience overhead. This paper proposes dynamic selective protecti
 on for sparse iterative solvers, in particular for the Preconditioned Conj
 ugate Gradient (PCG) solver, at the system level to reduce the resilience 
 overhead.  We leverage machine learning (ML) to predict the impact of soft
  errors that strike different elements of a key computation at different i
 terations of the solver. Based on the result of the prediction, we design 
 a dynamic strategy to selectively protect those elements that result in a 
 large performance degradation if struck by soft errors. An experimental ev
 aluation demonstrates that our dynamic protection strategy reduces the res
 ilience overhead compared to existing algorithms.\n\nTag: Fault Handling a
 nd Tolerance, Large Scale Systems\n\nRegistration Category: Workshop Reg P
 ass\n\nSession Chairs: John Daly (US Department of Defense), Scott Levy (S
 andia National Laboratories), and Keita Teranishi (Oak Ridge National Labo
 ratory (ORNL))\n\n
END:VEVENT
END:VCALENDAR
