BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000713Z
LOCATION:401-402
DTSTART;TZID=America/Denver:20231114T140000
DTEND;TZID=America/Denver:20231114T143000
UID:submissions.supercomputing.org_SC23_sess180_pap374@linklings.com
SUMMARY:Prodigy: Toward Unsupervised Anomaly Detection in Production HPC S
 ystems
DESCRIPTION:Burak Aksar (Boston University, Sandia National Laboratories);
  Efe Sencan (Boston University); Benjamin Schwaller, Omar Aaziz, Vitus J. 
 Leung, and Jim Brandt (Sandia National Laboratories); and Brian Kulis, Man
 uel Egele, and Ayse K. Coskun (Boston University)\n\nPerformance variation
 s caused by anomalies in modern High Performance Computing (HPC) systems l
 ead to decreased efficiency, impaired application performance, and increas
 ed operational costs. While machine learning (ML)-based frameworks for aut
 omated anomaly detection (often based on time series telemetry data) are g
 aining popularity in the literature, practical deployment challenges are o
 ften overlooked. Some ML-based frameworks require extensive customization,
  while others need a rich set of labeled samples, none of which are feasib
 le for a production HPC system.\n\nThis paper introduces a variational aut
 oencoder-based anomaly detection framework, Prodigy, that outperforms the 
 state-of-the-art alternatives by achieving a 0.95 F1-score when detecting 
 performance anomalies. The paper also provides a real system implementatio
 n of Prodigy that enables easy integration with monitoring frameworks and 
 rapid deployment. We deploy Prodigy on a production HPC system and demonst
 rate 88% accuracy in detecting anomalies. Prodigy involves an interface to
  provide job- and node-level analysis and explanations for anomaly predict
 ions.\n\nTag: Architecture and Networks, Performance Measurement, Modeling
 , and Tools, Resource Management\n\nRegistration Category: Tech Program Re
 g Pass\n\nReproducibility Badges: Artifact Available, Artifact Functional,
  Results Reproduced\n\nSession Chair: Ann Gentile (Sandia National Laborat
 ories)\n\n
END:VEVENT
END:VCALENDAR
