Efficient Execution of OpenMP on GPUs

Huber, Joseph; Cornelius, Melanie; Georgakoudis, Giorgis; Tian, Shilei; Diaz, Jose M. Monsalve; Dinel, Kuter; Chapman, Barbara

Efficient Execution of OpenMP on GPUs

dc.authorid	Georgakoudis, Giorgis/0000-0001-6542-3555
dc.authorwosid	Georgakoudis, Giorgis/AAF-6033-2020
dc.contributor.author	Huber, Joseph
dc.contributor.author	Cornelius, Melanie
dc.contributor.author	Georgakoudis, Giorgis
dc.contributor.author	Tian, Shilei
dc.contributor.author	Diaz, Jose M. Monsalve
dc.contributor.author	Dinel, Kuter
dc.contributor.author	Chapman, Barbara
dc.date.accessioned	2023-07-26T11:50:35Z
dc.date.available	2023-07-26T11:50:35Z
dc.date.issued	2022
dc.department	Rektörlük, Rektörlüğe Bağlı Birimler	en_US
dc.description	20th IEEE/ACM International Symposium on Code Generation and Optimization (CGO) -- APR 02-06, 2022 -- Seoul, SOUTH KOREA	en_US
dc.description.abstract	OpenMP is the preferred choice for CPU parallelism in High-Performance-Computing (HPC) applications written in C, C++, or Fortran. As HPC systems became heterogeneous, OpenMP introduced support for accelerator offloading via the target directive. This allowed porting existing (CPU) code onto GPUs, including well established CPU parallelism paradigms. However, there are architectural differences between CPU and GPU execution which make common patterns, like forking and joining threads, single threaded execution, or sharing of local (stack) variables, in general costly on the latter. So far it was left to the user to identify and avoid non-efficient code patterns, most commonly by writing their OpenMP offloading codes in a kernel-language style which resembles CUDA more than it does traditional OpenMP. In this work we present OpenMP-aware program analyses and optimizations that allow efficient execution of the generic, CPU-centric parallelism model provided by OpenMP on GPUs. Our implementation in LLVM/Clang maps various common OpenMP patterns found in real world applications efficiently to the GPU. As static analysis is inherently limited we provide actionable and informative feedback to the user about the performed and missed optimizations, together with ways for the user to annotate the program for better results. Our extensive evaluation using several HPC proxy applications shows significantly improved GPU kernel times and reduction in resources requirements, such as GPU registers.	en_US
dc.description.sponsorship	IEEE,Assoc Comp Machinery,ACM SIGPLAN,ACM SIGMICRO,IEEE Comp Soc,Arm,Meta,Huawei,Microsoft,Google,Samsung,Seoul Natl Univ	en_US
dc.description.sponsorship	Exascale Computing Project, U.S. Department of Energy organization (Office of Science) [17-SC-20-SC]; Lawrence Livermore National Security, LLC (LLNS) via MPO [B642066]; DOE Office of Science User Facility [DE-AC05-00OR22725]; LLNL under LLNL-LDRD Program [DE-AC52-07NA27344 (LLNL-CONF-826728), 21-ERD-018]; Exascale Computing Project, U.S. Department of Energy organization (National Nuclear Security Administration) [17-SC-20-SC]	en_US
dc.description.sponsorship	Part of this research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation's exascale computing imperative. Part of this research was supported by the Lawrence Livermore National Security, LLC (LLNS) via MPO No. B642066. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory. This research used resources of the Oak Ridge Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC05-00OR22725. This work was partially supported by LLNL under Contract DE-AC52-07NA27344 (LLNL-CONF-826728) through the LLNL-LDRD Program Project No. 21-ERD-018.	en_US
dc.identifier.doi	10.1109/CGO53902.2022.9741290
dc.identifier.endpage	52	en_US
dc.identifier.isbn	978-1-6654-0584-3
dc.identifier.issn	2164-2397
dc.identifier.scopus	2-s2.0-85128418491	en_US
dc.identifier.scopusquality	N/A	en_US
dc.identifier.startpage	41	en_US
dc.identifier.uri	https://doi.org/10.1109/CGO53902.2022.9741290
dc.identifier.uri	https://hdl.handle.net/20.500.12684/12381
dc.identifier.wos	WOS:000827636600004	en_US
dc.identifier.wosquality	N/A	en_US
dc.indekslendigikaynak	Web of Science	en_US
dc.indekslendigikaynak	Scopus	en_US
dc.institutionauthor	Dinel, Kuter
dc.language.iso	en	en_US
dc.publisher	Ieee Computer Soc	en_US
dc.relation.ispartof	Cgo '22: Proceedings of The 2022 Ieee/Acm International Symposium on Code Generation and Optimization (Cgo)	en_US
dc.relation.publicationcategory	Konferans Öğesi - Uluslararası - Kurum Öğretim Elemanı	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.snmz	$2023V1Guncelleme$	en_US
dc.subject	Openmp; Offloading; Optimization; Llvm; Gpu	en_US
dc.title	Efficient Execution of OpenMP on GPUs	en_US
dc.type	Conference Object	en_US

Dosyalar

Orijinal paket

Listeleniyor 1 - 1 / 1

İsim:: 12381.pdf
Boyut:: 241.23 KB
Biçim:: Adobe Portable Document Format
Açıklama:: Tam Metin / Full Text

İndir

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu