Library-based solutions for algorithms with complex patterns of parallelism

Carlos Hugo González Vázquez

Advisor: Basilio B. Fraguela

Introduction

Evolution of microprocessor technology.

The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software, Herb Sutter, Dr. Dobb's Journal, 30(3), March 2005

Classes of parallelism

Process interaction

  • Shared memory

  • Distributed memory

Problem decomposition

  • Data parallel

  • Task Based

Data structures

  • Regular

  • Irregular

Purpose of the thesis

To study complex parallel programming patterns, abstract their components, and develop solutions that ease the implementation of programs that use the patterns.

Tools

C++

Tools

Index

  1. Divide-and-conquer

  2. Amorphous data-parallelism

  3. Dependent tasks

  4. Conclusions and future work

Divide-and-conquer parallel pattern

parallel_recursion

Divide-and-conquer

Components

subdivision

base case

reduction

Skeletons

Definition

High-level parallel programming models for parallel and distributed computing. Algorithmic skeletons take advantage of common programming patterns to hide the complexity of parallel and distributed applications. Starting from a basic set of patterns (skeletons), more complex patterns can be built by combining the basic ones.

Skeletons

Properties

Template for:

parallel_recursion

parallel_recursion interface.

parallel_recursion

parallel_recursion pseudocode.

parallel_recursion

parallel_recursion

Example of usage.

fib with TBB

Fibonacci implementation using the Threading Building Blocks.

Evaluation

Name Description Arity Assoc
fib recursive computation of 43rd Fibonacci number 2 Yes
merge merge two sorted sequences of 100 million integers each 2 -
qsort quicksort of 10 million integers 2 -
nqueens N Queens solution count in 14 × 14 board var Yes
treeadd add values in binary tree with 24 levels 2 Yes
bisort sort balanced binary tree with 22 levels 2 No
health 2000 simulation steps in 4-ary tree with 6 levels 4 No
tsp traveling salesman problem on binary tree with 23 levels 2 No
System with 4 Intel Xeon hexa-core 2.40GHz E7450 CPUs

icpc compiler v11.0, optimization level 3.

Evaluation

Programming effort.

Evaluation

Cyclomatic number.

Evaluation

Source Lines Of Code.

Evaluation

Speedups relative to a sequential implementation.

Evaluation

Best-case performance difference relative to TBB and OpenMP implementations.

Amorphous data parallelism

parallel_domain_proc

Amorphous data parallelism

Definition

Given a set of active nodes and an ordering on active nodes, amorphous data-parallelism is the parallelism that arises from simultaneously processing active nodes, subject to neighborhood and ordering constraints.

Amorphous Data-parallelism in Irregular Algorithms, Pingali et al.

Example

Delaunay Mesh Refinement algorithm.

Example

Generic algorithm main loop

Common amorphous data parallel algorithm main loop.

Domains

The properties of a workitem are defined on a domain, that can be used to classify each workitem

Domains

Uses

Decomposition

Interface

Required Worklist interface.
Required Domain interface.
Required Operation interface.

Interface

Entry point of the library

parallel_domain_proc main function.

Evaluation

Name Description
Boruvka Minimal spanning tree through successive edge-contraction
Delaunay Mesh Refinement Triangulation of a set of points that fulfills quality constraints
Spanning Tree Simple spanning tree of an unweighted graph
Graph Labelling Identify separated clusters in a graph
System with 12 AMD Opteron cores at 2.2 GHz

icpc v12, optimization level 3.

Evaluation

Example of an algorithm implemented with our library.

Evaluation

% of programmability complexity compared with the serial implementations.

Evaluation

Speedups relative to optimized serial implementations.

Evaluation

% of out-of-domain elements for 8 cores and 16 bottom level subdomains.

Configuration capabilities

Configuration capabilities

Redirect

False

Each thread processes its assigned subdomain from the start

True

Threads with active workitems start processing. New workitems are sent to idle threads

Configuration capabilities

Over-decomposition

Create more subdomains than threads to make better use of the scheduler. The Threading Building Blocks scheduler implements task-stealing, so it can profit from the excess parallelism.

Configuration capabilities

Clustered Domain

Configuration capabilities

Clustered Tree Domain

Configuration capabilities

DomainND

Domain comparison

Comparison of the speedups of different domains with vector or list worklist containers. Lists use a custom pool allocator.

Overdecomposition

Performance of different levels of overdecomposition for 16 cores.

Dependent Tasks

DepSpawn

Introduction

Concepts

Introduction

Function analysis and risks

Dependency tree

Dependency tree

Formal model

Definitions

T is requested when spawn is called.

T is an ancestor of U if T requests U or an ancestor of U.

T is a descendant of U if U is an ancestor of T.

T logically precedes U if it is invoked before U in the sequential execution.

T dynamically precedes U if T is requested before U.

T has a dependency with U if they both access a memory position that one or both of them modify.

Formal model

Definitions

A task T waits for the completion of task U if:

 
  1. U dynamically precedes T, U is not an ancestor of T, and T has a dependency with U or

  2. U is a descendant of a task that fulfills condition 1.

Example

Dependencies

Tasks-inside-tasks

Dependencies

Overlaps

Dependencies

Arrays

Array support

Blitz++

Array support

Arrays are supported by a modified Array class from Blitz++.

Manual synchronization

Evaluation

Name Description
Barnes-Hut N-Body simulation Simulates a dynamical system of bodies under the influence of physical forces
LU decomposition Factorizes a matrix as a lower triangular matrix and an upper triangular matrix
Cholesky decomposition Factorizes a matrix as a lower triangular matrix and its conjugate transpose
Sylverster equations resolution Solves the matricial equation AX + XB = C on X
System with 2 Intel Xeon E5620 quad-core processors

g++ 4.6, optimization level 3

GotoBLAS2 for numeric computations.

Evaluation

Comparison of the programmability metrics using DepSpawn relative to OpenMP.

Evaluation

Speedups of the different benchmarks relative to a sequential implementation.

Conclusions

Conclusions

Future work

Publications

  1. Carlos H. González, Basilio B. Fraguela, Enhancing and Evaluating the Configuration Capability of a Skeleton for Irregular Computations, 23rd Euromicro International Conference on Parallel, Distributed and Network-based Processing, pp. 119–127, 2015

  2. Carlos H. González, Basilio B. Fraguela, An Algorithm Template for Domain-Based Parallel Irregular Algorithms, International Journal of Parallel Programming, 42 (6), pp. 948–967, 2014

  3. Carlos H. González, Basilio B. Fraguela, A framework for argument-based task synchronization with automatic detection of dependencies, Parallel Computing, 39 (9), pp. 445–489, 2013

  4. Carlos H. González, Basilio B. Fraguela, An Algorithm Template for Domain-Based Parallel Irregular Algorithms Proceedings of the International Symposium on High-level Parallel Programming and Applications (HLPP2013), 2013

  5. Carlos H. González, Basilio B. Fraguela, A framework for argument-based task synchronization, Proceedings of the 16th Workshop on Compilers for Parallel Computing (CPC’12), 2012

  6. Carlos H. González, Basilio B. Fraguela, An algorithm template for parallel irregular algorithms, Proceedings of the ACACES 2011, pp. 163–166., 2011

  7. Carlos H. González, Basilio B. Fraguela, A generic algorithm template for divide-and-conquer in multicore systems, Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications 2010 (HPCC’10), pp. 79–88, 2010

  8. Carlos H. González, Basilio B. Fraguela, Una plantilla genérica para el patrón de paralelismo divide-y-vencerás en sistemas multinúcleo, Actas das XXI Jornadas de Paralelismo (JP’10), pp. 19–26, 2010