BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000713Z
LOCATION:403-404
DTSTART;TZID=America/Denver:20231115T103000
DTEND;TZID=America/Denver:20231115T110000
UID:submissions.supercomputing.org_SC23_sess161_pap436@linklings.com
SUMMARY:Understanding the Effects of Permanent Faults in GPU’s Parallelism
  Management and Control Units
DESCRIPTION:Juan David Guerrero Balaguera and Josie Esteban Rodriguez Cond
 ia (Politecnico di Torino); Fernando Fernandes dos Santos (University of R
 ennes, Inria Rennes - Bretagne Atlantique Research Centre); Matteo Sonza R
 eorda (Politecnico di Torino); and Paolo Rech (University of Trento)\n\nMo
 dern Graphics Processing Units (GPUs) demand life expectancy extended to m
 any years, exposing the hardware to aging (i.e., permanent faults arising 
 after the end-of-manufacturing test). Hence, techniques to assess permanen
 t fault impacts in GPUs are strongly required, especially in safety-critic
 al domains.\n\nThis paper presents a method to evaluate permanent faults i
 n the GPU's scheduler and control units, together with the first figures t
 o quantify these effects. We inject 5.83x10^5 permanent faults in the gate
 -level units of a GPU model. Then, we map the observed error categories as
  software errors by instrumenting 13 applications and two convolutional ne
 ural networks, injecting more than 1.65x10^5 permanent errors (1,000 error
 s per application), reducing evaluation times from several years to hundre
 ds of hours. Our results highlight that faults in GPU parallelism manageme
 nt units impact software execution parameters. Moreover, errors in resourc
 e management or instructions codes hang code, while 45% of errors induce s
 ilent data corruption.\n\nTag: Accelerators, Architecture and Networks, Da
 ta Analysis, Visualization, and Storage, Fault Handling and Tolerance\n\nR
 egistration Category: Tech Program Reg Pass\n\nAward Finalist: Best Studen
 t Paper Finalist\n\nReproducibility Badges: Artifact Available, Artifact F
 unctional, Results Reproduced\n\nSession Chair: Ignacio Laguna (Lawrence L
 ivermore National Laboratory (LLNL))\n\n
END:VEVENT
END:VCALENDAR
