OpenMP

8 min readDec 10, 2022

Introduction to OpenMP

OpenMP (Open Multi-Processing) is an application programming interface (API) that supports cross-platform shared-memory multiprocessing programming in C, C++, and Fortran on most platforms, including our own HPC. The shared memory programming model is based on the concept of threads. Private data can only be accessed by the owning thread.
Different threads can follow different control flows through the same program. Each thread has its own program counter. Typically runs one thread per CPU/core Supports multiple threads per core Below is an example of how this might be implemented in computer memory.

PROGRAMMING API

OpenMP is designed for multi-processor/core, shared memory machines. The underlying architecture can be shared memory UMA or NUMA.It is an Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism. Comprised of three primary API components:

Compiler Directives
Runtime Library Routines
Environment Variables

OpenMP compiler directives are used for various purposes:

Spawning a parallel region
Dividing blocks of code among threads
Distributing loop iterations between threads
Serializing sections of code
Synchronization of work among threads

PROGRAMING MODEL

Within the shared memory model idea, we use the idea of a thread that can share memory with all other threads. They also have the following properties:

Private data can only be accessed by the thread that owns it.
Each thread can run concurrently with other threads, but can also run asynchronously, so be careful of race conditions.
Typically there is one thread per processor core, but there may be more hardware support (e.g. hyperthreading)

THREAD SYNCHRONIZATION

As already mentioned, threads run asynchronously. That is, each thread executes program instructions independently of other threads. This makes for a very flexible system, but you have to be very careful about performing actions on shared variables in the correct order. For example, if thread 1 accesses and reads a variable before thread 2 writes to it, the program crashes. Similarly, if updates to a shared variable are accessed concurrently by different threads, one of the updates may be overwritten. To avoid this, you should either use independent variables from different threads (i.e. different parts of the array) or have some kind of synchronization in your code so that different threads reach the same point at the same time. there is.

FIRST THREAD PROGRAM

Writing the most basic C program looks like this:

#include<stdio.h>
int main()
{
    printf(“ hello world\n”);
    return 0;   
}

To thread this we must tell the compiler which parts of the program to make into threads

#include<omp.h>
#include<stdio.h>
int main()
{
    #pragma omp parallel
    {
        printf(“hello ");
        printf("world\n”);
    }
    return 0;   
}

Let us look at the extra components to make this a parallel-threaded program

We have an openMP include file (#include <omp.h>)
We use the #pragma omp parallel which tells the compiler the following region within the { } is the going to be executed as threads

To compile this we use the command:

$ gcc -fopenmp myprogram.c -o myprogram ( for the gcc compiler), or

$ icc -fopenmp myprogram.c -o myprogram (for the Intel compiler)

And when we run this we would get something like the following

$ ./myprogram
hello hello world
world
hello hello world
world

It’s not very consistent, but remember that the threads are all running at different times. Since this is asynchronous, you have to be very careful regarding communication between different threads of the same program.

SECOND THREADED PROGRAM

Although the previous program is threaded it does not represent a real-world example

#include<omp.h>
#include<stdio.h>
int main()
{
    double A[1000];
    omp_set_num_threads(4);
    #pragma omp parallel
   {
        int ID = omp_get_thread_num();
        pooh(ID, A);
    }
}

Here each thread runs the same code independently. The only difference is that the omp thread ID is also passed to each call to pooh(ID,A). All threads wait for the end of the closing brace of a parallel region before continuing (that is, a synchronization barrier). This program always expects 4 threads provided by the underlying operating system. Unfortunately this is not possible and you will be assigned what the planner is happy to provide. This can cause serious problems for programs that rely on a fixed number of threads each time. Need to call openMP library (at runtime) with number of threads actually obtained. This is done with the following code:

#include<omp.h>
#include<stdio.h>
int main()
{
    double A[1000];
    omp_set_num_threads(4);
    #pragma omp parallel
   {
        int ID = omp_get_thread_num();
        int nthrds = omp_get_num_threads();
        pooh(ID, A);
    }
}

Each thread calls pooh(ID,A) for ID = 0 to nthrds-1.
This program hardcodes the requested number of threads to 4. This is not good programming practice. You have to recompile every time you change this. A better way to do this is to set the OMP_NUM_THREADS environment variable and remove omp_set_num_threads(4).

$ export OMP_NUM_THREADS=4
$ ./myprogram

PARALLEL LOOPS

Loops are the primary source of parallelism in many applications. If the loop iterations are independent (can be executed in any order), the iterations can be split into different threads. OpenMP has native calls to do this efficiently.

#pragma omp parallel
{
    #pragma omp for
    for (loop=0;loop<N;loop++)
    {
        do_threaded_task(loop);
    }
}

This is a much cleaner method and allows the compiler to automatically optimize for you (unless otherwise stated). The variable loop is set private by all threads by default. Also, all threads must wait at the end of the parallel loop before passing through the end of this region. An OpenMP shortcut is to wrap the parallel pragma in a parallel for part. This just makes the code easier to read.

#define MAX = 100;

double data[MAX]; 
int loop;
#pragma omp parallel for
    for (loop=0;loop< MAX; loop++) 
    {
        data[loop] = process_data(loop);
    }

There is a side effect of threading called false sharing that can scale poorly. If there are independent data items in the same cache line, the cache line will “slosh” between threads on every update, requiring the variable to be reloaded from main memory each time. One solution is to pad the array so that the elements are on different cache lines. Another way is to allow each individual thread to work entirely on her one cache line. A better approach is to rewrite the program to avoid this effect. An example is shown below.

#include <omp.h>
static long num_steps = 100000; double step;

#define NUM_THREADS 2
void main ()
{ 
    int nthreads; double pi=0.0; 
    step = 1.0/(double) num_steps;
    omp_set_num_threads(NUM_THREADS);

    #pragma omp parallel
    {
        int i, id, nthrds; double x, sum;
        id = omp_get_thread_num();
        nthrds = omp_get_num_threads();

        if (id == 0) 
               nthreads = nthrds;
        for (i=id, sum=0.0;i< num_steps; i=i+nthrds) 
       {
               x = (i+0.5)*step;
               sum += 4.0/(1.0+x*x);
       }
    #pragma omp critical
    pi += sum * step;
    }
}

Here we build a local scalar for each thread and accumulate the partial sums. No Arrays, No Pi Mispartition This solution scales much better than using array-based designs. Since the above code is often found in loops, openMP also provides reduction tools.

double ave=0.0, A[MAX]; 
int i;

#pragma omp parallel for reduction (+:ave)
    for (i=0;i< MAX; i++) 
    {
        ave + = A[i];
    }
ave = ave/MAX;

Other reduction operators are provided as well including subtraction, multiplication, division etc.

BARRIERS

#pragma omp parallel
{
    somecode();
    // all threads wait until all threads get here
    #pragma omp barrier
    othercode();
}

These are very useful (and sometimes essential) to ensure that all threads reach the same place. However, it comes at a cost in terms of efficiency as it forces parallelism back to a single thread up to the final barrier.
An assumed barrier at the end of a parallel for loop may not be needed.

#pragma omp parallel
{
    #pragma omp for nowait
    for(i=0;i<N;i++)
    { 
        B[i]=big_calc2(C, i); 
    }
   A[id] = big_calc4(id);    // this get called with a thread as we have indicated a nowait
}

PERFORMANCE THREADING

Avoid false sharing opportunities. This leads to maximum scaling limitations. When a program justifies the overhead of threads, parallelizing only a small for.. loop does not justify the overhead and increases processing time. Optimal schedule selection may vary by system

C EXAMPLE

#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>

/* compile with gcc -o test2 -fopenmp test2.c */

int main(int argc, char** argv)
{
    int i = 0;
    int size = 20;
    int* a = (int*) calloc(size, sizeof(int));
    int* b = (int*) calloc(size, sizeof(int));
    int* c;

    for ( i = 0; i < size; i++ )
    {
        a[i] = i;
        b[i] = size-i;
        printf("[BEFORE] At %d: a=%d, b=%d\n", i, a[i], b[i]);
    }

    #pragma omp parallel shared(a,b) private(c,i)
    {
        c = (int*) calloc(3, sizeof(int));

        #pragma omp for
        for ( i = 0; i < size; i++ )
        {
            c[0] = 5*a[i];
            c[1] = 2*b[i];
            c[2] = -2*i;
            a[i] = c[0]+c[1]+c[2];

            c[0] = 4*a[i];
            c[1] = -1*b[i];
            c[2] = i;
            b[i] = c[0]+c[1]+c[2];
        }

        free(c);
    }

    for ( i = 0; i < size; i++ )
    {
        printf("[AFTER] At %d: a=%d, b=%d\n", i, a[i], b[i]);
    }
}

FORTRAN LANGUAGE

program omp_par_do
  implicit none

  integer, parameter :: n = 100
  real, dimension(n) :: dat, result
  integer :: i

  !$OMP PARALLEL DO
  do i = 1, n
     result(i) = my_function(dat(i))
  end do
  !$OMP END PARALLEL DO

contains

  function my_function(d) result(y)
    real, intent(in) :: d
    real :: y

    ! do something complex with data to calculate y
  end function my_function

end program omp_par_do

TIPS FOR PROGRAMING

Typing the wrong sentinel (such as !OMP or #pragma opm ) usually does not cause an error message.
Note that private variables are not initialized when entering a parallel region.You may run out of stack space if you have large private data structures, The size of thread stacks other than the master thread can be controlled by the OMP_STACKSIZE environment variable. Writing code that works without OpenMP. The _OPENMP macro is defined when code is compiled with the OpenMP switch. Note that the overhead of executing a parallel region is typically tens of microseconds. Compiler, hardware dependent. Number of Threads It can be difficult to adjust the chunk size for static or dynamic schedules because the optimal chunk size is highly dependent on the number of threads. Make sure the timer is actually timing the wall clock. Use omp_get_wtime().

COMPILATION

The program would be compiled in the following way, optional Intel compiler available too:

For C
[username@login01 ~]$  module add gcc/10.2.0
[username@login01 ~]$  gcc -o test2 -fopenmp test2.c

For Fortran
[username@login01 ~]$  module add gcc/10.2.0
[username@login01 ~]$  gfortran -o test2 -fopenmp test2.c

MODULES AVAILABLE

The following modules are available for OpenMP:

module add gcc/10.2.0 (GNU compiler)

module add intel/2018 (Intel compiler)

USAGE EXAMPLES

For openMP programs add in the line export OMP_NUM_THREADS=<number of threads> because SLURM cannot determine how many you need before runtime

#!/bin/bash

#SBATCH -J openmpi-single-node
#SBATCH -N 1
#SBATCH --ntasks-per-node 28
#SBATCH -o %N.%j.%a.out
#SBATCH -e %N.%j.%a.err
#SBATCH -p compute
#SBATCH --exclusive

echo $SLURM_JOB_NODELIST

module purge
module add gcc/10.2.0

export I_MPI_DEBUG=5
export I_MPI_FABRICS=shm:tmi
export I_MPI_FALLBACK=no

export OMP_NUM_THREADS=28

/home/user/CODE_SAMPLES/OPENMP/demo

[username@login01 ~]$ sbatch demo.job
Submitted batch job 289552

HYBRID OPENMP/MPI CODES

This programming model runs on one node but can be programmed as a hybrid model as well with MPI. An application built with the hybrid model of parallel programming can run on a computer cluster using both OpenMP and Message Passing Interface (MPI), such that OpenMP is used for parallelism within a (multi-core) node while MPI is used for parallelism between nodes. There have also been efforts to run OpenMP on software-distributed shared memory systems. Four possible performance reasons for mixed OpenMP/MPI codes:

Replicated data
Poorly scaling MPI codes
Limited MPI process numbers
MPI implementation not tuned for SMP clusters

OPENMP PROGRAMES

OpenMP

Written by Saurabh Mishra

No responses yet