BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000602Z
LOCATION:E Concourse
DTSTART;TZID=America/Denver:20231114T100000
DTEND;TZID=America/Denver:20231114T170000
UID:submissions.supercomputing.org_SC23_sess290_drs103@linklings.com
SUMMARY:Overcoming the Gap between Compute and Memory Bandwidth in Modern 
 GPUs
DESCRIPTION:Lingqi Zhang (Tokyo Institute of Technology)\n\nThe imbalance 
 between compute and memory bandwidth has been a long-standing issue. Despi
 te efforts to address it, the gap between them is still widening. This has
  led to the categorization of many applications as memory-bound kernels.	\
 n\nThis dissertation centers on memory-bound kernels, with a particular em
 phasis on Graphics Processing Units (GPUs), given their rising prevalence 
 in High-Performance Computing (HPC) systems. \n\nIn this dissertation, we 
 initially focus on the evolution trend of GPU development in the last deca
 des. Examples include cooperative groups (i.e., device-wide barriers), asy
 nchronous copy of shared memory (i.e., hardware prefetching), low(er) late
 ncy of operations, and larger volume of on-chip resources (register files 
 and L1 cache).\n\nThis dissertation seeks to utilize the latest GPU featur
 es to optimize memory-bound kernels. Specifically, we propose extending th
 e kernel's lifetime across the time steps and taking advantage of the larg
 e volume of on-chip resources (i.e., register files and scratchpad memory)
  in reducing or eliminating traffic to the device memory. Furthermore, we 
 champion a minimum level of parallelism to maximize the available on-chip 
 resources.	\n\nBased on the strategies, we propose a general execution mod
 el for running memory-bound iterative GPU kernels: PERsistent KernelS (PER
 KS) and a novel temporal blocking method, EBISU. Evaluations have shown ou
 tstanding performance in the latest GPU architectures compared with counte
 rpart state-of-the-art implementations.\n\nTag: Accelerators, Artificial I
 ntelligence/Machine Learning, Applications, Cloud Computing, Distributed C
 omputing, Data Analysis, Visualization, and Storage, Data Compression, Het
 erogeneous Computing, I/O and File Systems, Quantum Computing, Reproducibi
 lity, Security, Software Engineering\n\nRegistration Category: Tech Progra
 m Reg Pass, Exhibits Reg Pass\n\n
END:VEVENT
END:VCALENDAR
