OMP0(418) Proposal OMP0(418)

TITLE

omp0 — a minimal OpenMP implementation

AUTHORS

SUMMARY

We're going to implement omp0, a subset of the OpenMP runtime library and a source-to-source compiler that transforms OpenMP pragmas into calls to our runtime. We'll target multi-core CPUs using pthreads as the threading backend, and evaluate performance against the reference GNU OpenMP (GOMP) implementation on standard parallel benchmarks.

A key focus will be exploring different synchronization primitive implementations and their cache coherence effects: ticket locks vs. test-and-set vs. MCS locks, centralized vs. tree barriers, and per-thread vs. tree reductions. We'll measure how these choices impact performance at different thread counts and contention levels using hardware performance counters.

BACKGROUND

Parallel Regions (Fork-Join)

OpenMP follows the fork-join model. When a thread encounters a parallel region, it forks a team of worker threads that execute the region concurrently, then joins back to sequential execution:

#pragma omp parallel
{
    // Each thread executes this block
    int tid = omp_get_thread_num();
    do_work(tid);
}
// Implicit barrier and join here

The runtime maintains a persistent thread pool to amortize pthread_create overhead. Threads sleep when idle and wake on region entry.

Work Scheduling: Static, Dynamic, Guided

Static scheduling partitions iterations at region entry. Thread i gets iterations [i*N/P, (i+1)*N/P). Zero runtime overhead, but poor load balance if iteration costs vary.

Dynamic scheduling uses a shared counter. Threads atomically claim chunks:

while ((start = atomic_fetch_add(&next, chunk)) < N)
    for (i = start; i < min(start + chunk, N); i++)
        process(i);

Small chunks improve load balance but increase contention on the counter. Guided scheduling starts with large chunks and shrinks them, balancing both.

Barriers

A centralized counter barrier has all threads increment a shared counter and spin until it reaches P:

atomic_inc(&arrive_count);
while (arrive_count < num_threads);

This causes O(P) invalidations as each increment bounces the cache line. A sense-reversing barrier alternates a flag to avoid resetting the counter. A tree barrier organizes threads hierarchically: threads first synchronize within groups, then group leaders synchronize, reducing contention to O(log P).

Critical Sections and Mutex Implementations

Test-and-set (TAS) acquires by spinning on an atomic swap:

while (atomic_swap(&lock, 1) == 1);  // spin until we get 0

Every spin iteration issues a write, causing cache invalidations across all cores even when the lock is held. Test-and-test-and-set (TTAS) improves this by spinning on a read first:

while (1) {
    while (lock);                    // spin on cached read (no bus traffic)
    if (atomic_swap(&lock, 1) == 0)  // try to acquire
        break;
}

Ticket locks provide FIFO fairness. Each thread takes a ticket number and waits for its turn:

int my_ticket = atomic_fetch_add(&next_ticket, 1);
while (now_serving != my_ticket);

When the lock is released (now_serving++), all waiters' cache lines are invalidated even though only one can proceed. MCS locks solve this with a linked list where each thread spins on its own node.

Atomic Operations

The atomic directive maps directly to hardware atomics for simple updates:

#pragma omp atomic
counter++;   // becomes: atomic_fetch_add(&counter, 1)

This avoids mutex overhead for single-word updates but still causes cache line bouncing under contention.

Reductions

A critical-section reduction serializes all updates:

#pragma omp critical
global_sum += local_sum;  // O(P) serial time

Per-thread reduction gives each thread a private accumulator, combined at the barrier:

partial[tid] = local_sum;
barrier();
if (tid == 0)
    for (i = 0; i < P; i++) global_sum += partial[i];

This is O(P) but only the master does the work. A tree reduction parallelizes the combine phase in O(log P) steps, which matters at high thread counts.

THE CHALLENGE

The main challenge here is navigating the tradeoffs between correctness, performance, and cache coherence. Each synchronization primitive has multiple implementations with different characteristics under varying contention levels.

For mutexes, TAS is simple but generates excessive bus traffic. TTAS reduces traffic but causes thundering herd on release. Ticket locks are fair but suffer the same invalidation storm. MCS locks are optimal but complex. We want to measure which of these actually matters for typical OpenMP workloads.

For barriers, centralized barriers are simple but O(P) in coherence traffic. Tree barriers reduce traffic but add latency from multiple synchronization phases. The right choice depends on thread count and how often barriers occur.

For reductions, the combining strategy matters at scale. Per-thread arrays with linear combining is simple but serializes the final phase. Tree reductions parallelize combining but require log(P) barrier-like synchronizations.

We'll use hardware performance counters (cache misses, bus transactions) to understand the cache coherence impact of each implementation choice.

RESOURCES

Hardware: GHC cluster machines, PSC bridges2 for scaling tests

Software: Starting from scratch using pthreads. Will use Clang's lexer for pragma parsing or a simple custom parser.

GOALS AND DELIVERABLES

Plan to achieve:

#pragma omp parallel — thread team creation with persistent pool
#pragma omp for — static and dynamic scheduling
reduction clause — thread-local accumulators with tree reduce
#pragma omp barrier — sense-reversing barrier
#pragma omp critical — mutex-protected regions
Source-to-source compiler for the above pragmas

Hope to achieve:

private/shared clauses
#pragma omp atomic — map to hardware atomics
firstprivate/lastprivate clauses
guided scheduling
Nested parallelism
Source to source compiler

Benchmarks: We'll evaluate omp0 against GOMP on:

Pi calculation (Monte Carlo) — tests reduction, random number generation
Mandelbrot set — irregular workload, tests dynamic scheduling
N-body simulation — tests barriers, reductions, compute-heavy loops
Matrix multiplication — regular workload, tests static scheduling, cache behavior

PLATFORM CHOICE

We chose multi-core x86 CPUs with pthreads because:

OpenMP is designed for shared-memory systems
pthreads gives us direct control over thread creation and synchronization
GHC machines provide a consistent test environment
PSC bridges2 allows us to scale up to high thread counts for testing
Easy to compare against GOMP (same backend)

SCHEDULE

Week 1: Thread pool implementation, basic parallel region
Week 2: Static scheduling for parallel for, barriers
Week 3: Dynamic scheduling, critical sections — MILESTONE
Week 4: Reductions, private/shared clauses
Week 5: Performance tuning, benchmarking