BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000713Z
LOCATION:507
DTSTART;TZID=America/Denver:20231113T165000
DTEND;TZID=America/Denver:20231113T171000
UID:submissions.supercomputing.org_SC23_sess455_ws_risc101@linklings.com
SUMMARY:Automatic Generation of Micro-Kernels for Performance Portability 
 of Matrix Multiplication on RISC-V Vector Processors
DESCRIPTION:Francisco Igual and Luis Piñuel (Complutense University of Mad
 rid); Sandra Catalán (Jaume I University, Spain); Héctor Martínez (Univers
 idad de Córdoba); and Adrián Castelló and Enrique Quintana-Ortí (Universid
 ad Politecnica de Valencia)\n\nIn this paper, we propose and evaluate seve
 ral optimized implementations of the general matrix multiplication (Gemm) 
 on two different RISC-V architecture cores implementing the RISC-V vector 
 extension (RVV): C906 and C910 from T-HEAD. Specifically, we address the p
 erformance portability problem across these processor cores by means of an
  automatic assembly code generator, written in Python, capable of emitting
  RVV code for high performance computing (HPC), with a variety of combinat
 ions of specific and general optimizations.\n\nOur experimental results us
 ing a number of automatically-generated micro-kernels for Gemm, on both RI
 SC-V architectures, reveal different impact of each optimization, dependin
 g on the target architecture, and highlight the importance of automaticall
 y generating HPC RVV code to achieve performance portability while reducin
 g the developers' effort. In addition, these optimizations show important 
 performance gains with respect to to a state-of-the-art tuned BLAS library
  (OpenBLAS), reaching 3x and 1.3x speed-ups for the C910 and C906, respect
 ively.\n\nTag: Architecture and Networks, Hardware Technologies\n\nRegistr
 ation Category: Workshop Reg Pass\n\nSession Chairs: Nick Brown (Edinburgh
  Parallel Computing Centre (EPCC); University of Edinburgh, Scotland); Joh
 n Davis (Openchip); Andy Gothard (Siemens); John Leidel (Tactical Computin
 g Laboratories LLC, Texas Tech University); and Michael Wong (Codeplay Sof
 tware Ltd, Khronos Group Inc)\n\n
END:VEVENT
END:VCALENDAR
