BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000711Z
LOCATION:403-404
DTSTART;TZID=America/Denver:20231116T103000
DTEND;TZID=America/Denver:20231116T110000
UID:submissions.supercomputing.org_SC23_sess178_pap582@linklings.com
SUMMARY:Optimizing Direct Convolutions on ARM Multi-Cores
DESCRIPTION:Pengyu Wang, Weiling Yang, Jianbin Fang, Dezun Dong, Chun Huan
 g, Peng Zhang, and Tao Tang (National University of Defense Technology (NU
 DT), China) and Zheng Wang (University of Leeds, School of Computing, UK)\
 n\nConvolution kernels are widely seen in deep learning workloads and are 
 often responsible for performance bottlenecks. Recent research has demonst
 rated that a direct convolution approach can outperform the traditional co
 nvolution implementation based on tensor-to-matrix conversions. However, e
 xisting approaches for direct convolution still have room for performance 
 improvement. We present NDIRECT, a new direct convolution approach that ta
 rgets ARM-based multi-core CPUs commonly found in smartphones and HPC syst
 ems. NDIRECT is designed to be compatible with the data layout formats use
 d by mainstream deep learning frameworks but offers new optimizations for 
 the computational kernel, data packing, and parallelization. We evaluate N
 DIRECT by applying it to representative convolution kernels and demonstrat
 ing its performance on four distinct ARM multi-core CPU platforms. We comp
 are NDIRECT against state-of-the-art convolution optimization techniques. 
 Experimental results show that NDIRECT gives the best overall performance 
 across evaluation scenarios and platforms.\n\nTag: Artificial Intelligence
 /Machine Learning, Codesign, Performance Optimization, Programming Framewo
 rks and System Software\n\nRegistration Category: Tech Program Reg Pass\n\
 nReproducibility Badges: Artifact Available, Artifact Functional, Results 
 Reproduced\n\nSession Chair: Aparna Chandramowlishwaran (University of Cal
 ifornia, Irvine)\n\n
END:VEVENT
END:VCALENDAR
