omp0 — a minimal OpenMP implementation
We're going to implement omp0, a subset of the OpenMP runtime library and a source-to-source compiler that transforms OpenMP pragmas into calls to our runtime. We'll target multi-core CPUs using pthreads as the threading backend, and evaluate performance against the reference GNU OpenMP (GOMP) implementation on standard parallel benchmarks.
A key focus will be exploring different synchronization primitive implementations and their cache coherence effects: ticket locks vs. test-and-set vs. MCS locks, centralized vs. tree barriers, and per-thread vs. tree reductions. We'll measure how these choices impact performance at different thread counts and contention levels using hardware performance counters.
Parallel Regions (Fork-Join)
OpenMP follows the fork-join model. When a thread encounters a parallel region, it forks a team of worker threads that execute the region concurrently, then joins back to sequential execution:
#pragma omp parallel
{
// Each thread executes this block
int tid = omp_get_thread_num();
do_work(tid);
}
// Implicit barrier and join here
The runtime maintains a persistent thread pool to amortize pthread_create overhead. Threads sleep when idle and wake on region entry.
Work Scheduling: Static, Dynamic, Guided
Static scheduling partitions iterations at region entry. Thread i gets iterations [i*N/P, (i+1)*N/P). Zero runtime overhead, but poor load balance if iteration costs vary.
Dynamic scheduling uses a shared counter. Threads atomically claim chunks:
while ((start = atomic_fetch_add(&next, chunk)) < N)
for (i = start; i < min(start + chunk, N); i++)
process(i);
Small chunks improve load balance but increase contention on the counter. Guided scheduling starts with large chunks and shrinks them, balancing both.
Barriers
A centralized counter barrier has all threads increment a shared counter and spin until it reaches P:
atomic_inc(&arrive_count); while (arrive_count < num_threads);
This causes O(P) invalidations as each increment bounces the cache line. A sense-reversing barrier alternates a flag to avoid resetting the counter. A tree barrier organizes threads hierarchically: threads first synchronize within groups, then group leaders synchronize, reducing contention to O(log P).
Critical Sections and Mutex Implementations
Test-and-set (TAS) acquires by spinning on an atomic swap:
while (atomic_swap(&lock, 1) == 1); // spin until we get 0
Every spin iteration issues a write, causing cache invalidations across all cores even when the lock is held. Test-and-test-and-set (TTAS) improves this by spinning on a read first:
while (1) {
while (lock); // spin on cached read (no bus traffic)
if (atomic_swap(&lock, 1) == 0) // try to acquire
break;
}
Ticket locks provide FIFO fairness. Each thread takes a ticket number and waits for its turn:
int my_ticket = atomic_fetch_add(&next_ticket, 1); while (now_serving != my_ticket);
When the lock is released (now_serving++), all waiters' cache lines are invalidated even though only one can proceed. MCS locks solve this with a linked list where each thread spins on its own node.
Atomic Operations
The atomic directive maps directly to hardware atomics for simple updates:
#pragma omp atomic counter++; // becomes: atomic_fetch_add(&counter, 1)
This avoids mutex overhead for single-word updates but still causes cache line bouncing under contention.
Reductions
A critical-section reduction serializes all updates:
#pragma omp critical global_sum += local_sum; // O(P) serial time
Per-thread reduction gives each thread a private accumulator, combined at the barrier:
partial[tid] = local_sum;
barrier();
if (tid == 0)
for (i = 0; i < P; i++) global_sum += partial[i];
This is O(P) but only the master does the work. A tree reduction parallelizes the combine phase in O(log P) steps, which matters at high thread counts.
The main challenge here is navigating the tradeoffs between correctness, performance, and cache coherence. Each synchronization primitive has multiple implementations with different characteristics under varying contention levels.
For mutexes, TAS is simple but generates excessive bus traffic. TTAS reduces traffic but causes thundering herd on release. Ticket locks are fair but suffer the same invalidation storm. MCS locks are optimal but complex. We want to measure which of these actually matters for typical OpenMP workloads.
For barriers, centralized barriers are simple but O(P) in coherence traffic. Tree barriers reduce traffic but add latency from multiple synchronization phases. The right choice depends on thread count and how often barriers occur.
For reductions, the combining strategy matters at scale. Per-thread arrays with linear combining is simple but serializes the final phase. Tree reductions parallelize combining but require log(P) barrier-like synchronizations.
We'll use hardware performance counters (cache misses, bus transactions) to understand the cache coherence impact of each implementation choice.
Hardware: GHC cluster machines, PSC bridges2 for scaling tests
Software: Starting from scratch using pthreads. Will use Clang's lexer for pragma parsing or a simple custom parser.
Plan to achieve:
#pragma omp parallel — thread team creation with persistent pool#pragma omp for — static and dynamic schedulingreduction clause — thread-local accumulators with tree reduce#pragma omp barrier — sense-reversing barrier#pragma omp critical — mutex-protected regionsHope to achieve:
private/shared clauses#pragma omp atomic — map to hardware atomicsfirstprivate/lastprivate clausesguided schedulingBenchmarks: We'll evaluate omp0 against GOMP on:
We chose multi-core x86 CPUs with pthreads because:
parallel regionparallel for, barriers