BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000712Z
LOCATION:401-402
DTSTART;TZID=America/Denver:20231114T153000
DTEND;TZID=America/Denver:20231114T160000
UID:submissions.supercomputing.org_SC23_sess179_pap535@linklings.com
SUMMARY:Scalable Tuning of (OpenMP) GPU Applications via Kernel Record and
  Replay
DESCRIPTION:Konstantinos Parasyris and Giorgis Georgakoudis (Lawrence Live
 rmore National Laboratory (LLNL)), Esteban Rangel (Argonne National Labora
 tory (ANL)), and Ignacio Laguna and Johannes Doerfert (Lawrence Livermore 
 National Laboratory (LLNL))\n\nHPC is a heterogeneous world in which host 
 and device code are interleaved throughout the application. Given the sign
 ificant performance advantage of accelerators, device code execution time 
 is becoming the new bottleneck. Tuning the accelerated parts is consequent
 ly highly desirable but often impractical due to the large overall applica
 tion runtime which includes unrelated host parts.\n\nWe propose a Record-R
 eplay (RR) mechanism to facilitate auto-tuning of large (OpenMP) offload a
 pplications. RR dissects the application, effectively isolating GPU kernel
 s into independent executables. These comparatively small code-lets are am
 enable to various forms of post-processing, including elaborate auto-tunin
 g. By eliminating the resource requirements and application dependencies, 
 massively parallel and distributed auto-tuning becomes feasible. \n\nUsing
  RR, we run scalable Bayesian Optimization to determine optimal kernel lau
 nch parameters. LULESH showcases an end-to-end speedup of up to 1.53x, whi
 le RR enables 102x faster tuning compared to existing approaches using the
  entire application.\n\nTag: Accelerators, Distributed Computing, Middlewa
 re and System Software, Performance Measurement, Modeling, and Tools, Post
 -Moore Computing\n\nRegistration Category: Tech Program Reg Pass\n\nAward 
 Finalist: Best Paper Finalist\n\nReproducibility Badges: Artifact Availabl
 e, Artifact Functional, Results Reproduced\n\nSession Chair: Hari Subramon
 i (The Ohio State University)\n\n
END:VEVENT
END:VCALENDAR
