BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000605Z
LOCATION:DEF Concourse
DTSTART;TZID=America/Denver:20231116T100000
DTEND;TZID=America/Denver:20231116T170000
UID:submissions.supercomputing.org_SC23_sess300_spostg104@linklings.com
SUMMARY:Job Level Communication-Avoiding Detection and Correction of Silen
 t Data Corruption in HPC Applications
DESCRIPTION:Laslo Hunhold (University of Cologne)\n\nDetecting and correct
 ing Silent Data Corruption (SDC) is of high interest for many HPC applicat
 ions due to the dramatic consequences such undetected computation errors c
 an have. Additionally, going into the exascale era of computing, SDC error
  rates are only increasing with growing system sizes. State of the art met
 hods based on instruction duplication suffer from only partial error cover
 age, significant synchronization overhead and strong coupling of computati
 on and validation.\n\nThis work proposes a novel communication-avoiding ap
 proach of detecting and mitigating SDCs at the job level within the worklo
 ad manager, assuming a directed acyclic graph (DAG) job model. Each job on
 ly communicates a locally generated output data hash. Computation and vali
 dation are decoupled as separately schedulable jobs and dependency stallin
 g is avoided with a special error recovery method. The implementation of t
 his project within the SLURM workload manager is in progress and key desig
 n aspects are outlined.\n\nRegistration Category: Tech Program Reg Pass, E
 xhibits Reg Pass\n\n
END:VEVENT
END:VCALENDAR
