3rd OpenMP Users Developer Conference

Conference Archive: 1-2 December 2020

Serving the European OpenMP Community

The 3rd OpenMP Users Conference took place on 1-2 December 2020 as an online event and included a tutorial, several technical talks plus a panel discussion and Q&A session, all aimed at furthering the collaboration and knowledge sharing among the growing community of high-performance computing specialist using OpenMP.

Program of Events

Tuesday 1st December – Invited Tutorial

OpenMP for Computational Scientists: From serial Fortran to thousand-way parallelism on GPUs using OpenMP

Presented by: Tom Deakin, Lecturer and Senior Research Associate, Department of Computer Science, University of Bristol, UK

This two-part tutorial will introduce OpenMP 4.5 to Fortran programmers. OpenMP is an open standard with widespread support from compilers and vendors alike. As such the OpenMP parallel programming model is one viable way of writing performance portable programs for heterogeneous systems. OpenMP supports C, C++ and Fortran, and in this tutorial we will learn to write OpenMP programs in Fortran.  A working knowledge of writing simple Fortran programs for HPC will be required for the course. Those already familiar with OpenMP who wish only to join for the GPU part are welcome to do so.

9:30 – 12:30 OpenMP for Computational Scientists – Part One
In the first part, we’ll introduce shared memory programming for multi-core CPUs using OpenMP. The most common parts of OpenMP will be explained alongside hands-on exercises for attendees to try for themselves. We’ll discuss some important performance optimisations to consider when writing shared memory programs.

13:30 – 16:30 OpenMP for Computational Scientists – Part Two
In the second part, we’ll have a whistle-stop tour of the features in OpenMP for writing programs for heterogeneous nodes with GPUs. We’ll walk through the target directive for offloading both data and parallel execution to GPUs. At the end, you’ll be able to write programs using OpenMP for massively parallel GPUs.

All times are shown for GMT

Wednesday 2nd December – Talks and Panel Discussion

13:30 Introduction

13:35 Automatic Generation of OpenMP for Weather and Climate Applications

The Met Office is developing a new weather and climate model to exploit exascale computing. This new model, named LFRic, is being written using a Domain Specific Language API and this allows a Domain specific compiler, PSyclone, to generate parallel code for different programming models, such as OpenMP. The DSL approach allows the science developer to be productive by writing code that looks “like the maths”. The code generation then allows the model to be portable and by exploiting different programming models. How OpenMP is being used for CPU shared memory parallelism and how accelerators and other architectures can be targeted will be described. An analysis of performance presented.

14:00 LB4OMP: A Load Balancing Library for OpenMP

The OpenMP standard specifies only three loop scheduling techniques. This hinders research on novel scheduling techniques since a fair comparison with other existing techniques from the literature is not feasible for multithreaded applications using OpenMP. The three scheduling techniques available in OpenMP may not achieve the highest performance for arbitrary application-system pairs. We will start the talk by introducing LB4OMP, an extended LLVM OpenMP runtime library that implements fourteen dynamic loop self-scheduling (DLS) techniques in addition to those in the standard OpenMP runtime implementations. Via the eighteen algorithms, LB4OMP offers improved performance to application-system pairs in the presence of unpredictable variations in the execution environment. LB4OMP also includes performance measurement features to collect detailed information of the execution of parallel loops with OpenMP. These features facilitate understanding the impact of load balancing of OpenMP loops on application performance. The features support the measurement of the parallel loop execution time, each thread’s execution time, the mean and standard deviation of the iterations execution time, and the chunk of iterations self-scheduled by each thread in every scheduling step. LB4OMP is open-source (available at https://github.com/unibas-dmi-hpc/LB4OMP) and can easily be used (and even extended) by OpenMP developers. We will illustrate the use of the scheduling techniques implemented in LB4OMP, the library features, and present the performance results and their analysis for a selection of the experiments we conducted using 1 application and 3 systems. We will show that the newly implemented scheduling techniques outperform the standard ones by up to 13.33% (with default chunk size). For a given application-system pair we will use LB4OMP to identify the highest performing combination of scheduling techniques. Also, we will show the performance improvements that are otherwise unachievable by the OpenMP standard scheduling options alone by bridging the gap between the state of the art according to the literature and state of the practice of load balancing with LB4OMP. We will conclude the talk by discussing potential extensions of this work, such as automatically load balancing OpenMP applications.

14:25 Short Break – Discussion
14:30 Hybrid Parallelisation Challenges in Complex CFD Application

In the development of parallel applications that make use of multiple parallelisation techniques including the use of third party parallel libraries, the correct use of OpenMP can be very challenging. We make use of MPI, threads, OpenMP, SIMD, CUDA and SYCL to provide efficient parallelism and the widest hardware support. We will present the challenges we faced developing zCFD to operate efficiently at scale for multiple architectures.

14:55 Exploring Functional OpenMP Performance Across Arm Based Platforms

Exploring Functional OpenMP Performance Across Arm Based Platforms In this work we shall discuss the usefulness of OpenMP across several readily available Arm HPC architectures. Two well-known benchmark suites will be considered: the NAS Parallel Benchmarks and the EPCC OpenMP micro-benchmark suite.

The NAS Parallel Benchmarks (https://www.nas.nasa.gov/publications/npb.html) were designed to help evaluate both MPI and OpenMP parallel performance. They consist of eight kernels, representative of those which are used commonly throughout computational fluid dynamics applications. The kernels cover memory access, Conjugate Gradient, Multi-Grid, Discrete Fourier Transforms and Linear Solvers.

The EPCC OpenMP micro-benchmark suite (https://www.epcc.ed.ac.uk/research/computing/performance-characterisation-and-benchmarking/epcc-openmp-micro-benchmark-suite) is intended to measure the overheads of synchronisation, loop scheduling and array operations in the OpenMP runtime library. As such it consists of four kernels: arraybench, schedbench, syncbench and taskbench.

By choosing these benchmarks we can demonstrate the effectiveness of OpenMP run-times for HPC applications across several Arm HPC based architectures, including the Fujitsu A64FX and Marvell® ThunderX2®, and Neoverse N1 platform: AWS Graviton2.

The Arm HPC ecosystem has matured significantly over the last few years and this includes the number of available Compilers and their associated OpenMP run-time libraries. For this work, we will consider GCC, the Arm Compiler for Linux (ACfL), the NVIDIA HPC Software Development Kit (SDK) and the HPE/Cray Compiling Environment (CCE).

For the main results, we will discuss performance for the OpenMP only variant of the NAS Parallel benchmarks across the different architectures. In particular we will investigate the effectiveness of the compilers and OpenM. Further investigation will be aided by the EPCC OpenMP micro-benchmarks to look closer at the thread-level behaviour.

On the Marvell® ThunderX2® (Figure 1), using 64 OpenMP threads (pinned to the physical cores in a node ordering manner) we compare three compilers: NVHPC,  GCC and ACfL since they generate the most interesting comparison. It’s clear that there is no overall winner: GCC wins four (CG.C, EP.C, IS.C, LU.C), ACfL wins three (BT.C.BLK,FT.C, MG.C) and NVHPC wins on SP.C.BLK. Performance difference is not solely down to the OpenMP runtimes and other compiler optimizations help to varying degrees between benchmarks. However, the compiler’s ability to generate optimized code within OpenMP regions does play an important role.

On the Fujitsu FX700 (HPE/Cray Apollo 80) (Figure 2), this time using 48 OpenMP threads, we compare three compilers: ACfL, GCC and CCE. The first noticeable difference between the previous results is that BT.C.BLK, EP.C, LU.C and SP.C.BLK exhibit much lower Mops/s than on the ThunderX2 whereas CG.C, FT.C, IS.C and MG.C are better (this is mostly due to the increased memory bandwidth which can be utilized by each OpenMP thread). This time GCC wins on five (BT.C.BLK, IS.C, LU.C, MG.C, SP.C.BLK) and CCE wins on the remaining three (CG.C, EP.C,FT.C). Again, a compiler’s ability to handle OpenMP regions is what helps.

The overall aim of this presentation is to guide a user’s choice of compiler and Arm HPC based platform, given they have a certain algorithm in mind which has already been parallelized with OpenMP.

15:20 Best Practices for OpenMP on NVIDIA GPUs

The OpenMP target constructs extend OpenMP beyond the realm of CPU computing and allow codes to run on a variety of heterogenous devices. Running on diverse devices, however, requires writing code in a manner that will be scalable to a variety of devices. In this talk we will present an update on NVIDIA’s OpenMP compiler and best practices on how to write OpenMP code that will perform well on NVIDIA GPUs.

15:45 Short Break – Discussion
15:50 OpenMP API Version 5.1 – Features and Roadmap

16:05 Panel Discussion and Q&A Session

Members from our panel of experts will share their views on the key challenges facing HPC developers using OpenMP and will then answer questions from those attending and the chair.

16:45 Close