BEGIN:VCALENDAR
VERSION:2.0
PRODID:Linklings LLC
BEGIN:VTIMEZONE
TZID:America/Denver
X-LIC-LOCATION:America/Denver
BEGIN:DAYLIGHT
TZOFFSETFROM:-0700
TZOFFSETTO:-0600
TZNAME:MDT
DTSTART:19700308T020000
RRULE:FREQ=YEARLY;BYMONTH=3;BYDAY=2SU
END:DAYLIGHT
BEGIN:STANDARD
TZOFFSETFROM:-0600
TZOFFSETTO:-0700
TZNAME:MST
DTSTART:19701101T020000
RRULE:FREQ=YEARLY;BYMONTH=11;BYDAY=1SU
END:STANDARD
END:VTIMEZONE
BEGIN:VEVENT
DTSTAMP:20260422T000712Z
LOCATION:506
DTSTART;TZID=America/Denver:20231112T170000
DTEND;TZID=America/Denver:20231112T171500
UID:submissions.supercomputing.org_SC23_sess430_ws_cafcw116@linklings.com
SUMMARY:Scalable Lead Prediction with Transformers Using HPC Resources
DESCRIPTION:Archit Vasan, Rick Stevens, Arvind Ramanathan, and Venkatram V
 ishwanath (Argonne National Laboratory (ANL))\n\nA promising direction in 
 cancer drug discovery is high-throughput screening of extensive compound d
 atasets to identify advantageous properties, including their ability to in
 teract with relevant biomolecules such as proteins. However, traditional s
 tructural approaches for assessing binding affinity, such as free energy m
 ethods or molecular docking, pose significant computational bottlenecks wh
 en dealing with such vast datasets. To address this, we have developed a d
 ocking surrogate called the SMILES transformer (ST), which learns molecula
 r features from the SMILES representation of compounds and approximates th
 eir binding affinity. SMILES data is first tokenized using a well-establis
 hed SMILES-pair tokenizer and fed into a BERT-like Transformer model to ge
 nerate vector embeddings for each molecule, effectively capturing the esse
 ntial information. These extracted embeddings are then fed into a regressi
 on model to predict the binding affinity. Leveraging the high-performance 
 computing resources at Argonne National Lab, we devised a workflow to scal
 e model training and inference across multiple supercomputing nodes. To ev
 aluate the performance and accuracy of our workflow, we conducted experime
 nts using molecular docking binding affinity data on multiple receptors, c
 omparing ST with another state-of-the-art docking surrogate. Impressively,
  both surrogates yielded comparable val-r2 measurements of between 70 and 
 90%, affirming the capability of ST to learn molecular features directly f
 rom language-based data. Furthermore, one significant advantage of the ST 
 approach is its notably faster tokenization preprocessing compared to the 
 alternative method, which requires generating molecular descriptors using 
 Mordred. Our workflow facilitated screening of ~ 3 billion compounds on 48
  nodes of the Polaris supercomputer in approximately an hour. In summary, 
 our approach presents an efficient means to screen extensive compound data
 bases for potential molecular properties that could serve as lead compound
 s targeting cancer. Looking ahead, an important future direction for our w
 orkflow involves integrating de-novo drug design, enabling us to scale our
  efforts to explore the limits of synthesizable compounds within chemical 
 space.\n\nTag: Applications, State of the Practice\n\nRegistration Categor
 y: Workshop Reg Pass\n\nSession Chairs: Lynn Borkon (Frederick National La
 boratory for Cancer Research); Sally Ellingson (University of Kentucky); S
 ean Hanlon (National Institutes of Health (NIH), National Cancer Institute
  (NCI)); Patricia Kovatch (Icahn School of Medicine at Mount Sinai); and E
 ric Stahlberg (MD Anderson Cancer Center, University of Texas)\n\n
END:VEVENT
END:VCALENDAR
