MassiveThreads User’s Guide

Copyright 2010-2014 Jun Nakashima (Read COPYRIGHT for detailed information.)

Copyright 2010-2014 Kenjiro Taura (Read COPYRIGHT for detailed information.)

Table of Contents

Next: , Previous: , Up: (dir)   [Contents]

1 Getting Started


Next: , Previous: , Up: Top   [Contents]

2 MassiveThreads Library

TODO: write about the library API itself.


Next: , Previous: , Up: Top   [Contents]

3 Higher-Level Interfaces


Next: , Previous: , Up: Higher-Level Interfaces   [Contents]

3.1 Higher Level Interfaces Overview

MassiveThreads API described so far is still low level and bit burdensome as a parallel programming interface. MassiveThreads also provides higher level APIs, easier and more convenient APIs for programmers.

One is what we call TBB-compatible interface, that provides a subset of functions of Intel Threading Building Block. It does not only provide TBB-compatible interface, but also allows you to switch between various lightweight thread libraries under the same TBB-compatible interface. Currently supported libraries include MassiveThreads, Qthreads, Nanos++, and what we call a dummy scheduler. The last one elides task parallel primitives.

The other interface is what we call a task parallel switcher, with which you can write a single program running on top of even wider set of task parallel systems including OpenMP, Cilk, and TBB.

Besides providing a uniform API on various runtime systems, they serve another important purpose, which is to allow you to trace your task parallel programs with DAG Recorder, a tracing tool described later in this manual. See DAG Recorder. By programming in these APIs, rather than in the native API of the respective runtime system, your are free from the burden of manually instrumenting your programs for tracing. To this end, we also provide headers to facilitate instrumentation of OpenMP and Cilk. They do not serve any purpose of making OpenMP and Cilk more convenient nor more uniform; they simply make instrumenting OpenMP and Cilk easier.

Here is the summary of choices of APIs and runtime systems.

APIRuntime SystemHeader fileflags
TBB-compatibleNone (dummy)mtbb/task_group.h-DTO_SERIAL
TBB-compatibleIntel TBBmtbb/task_group.h-DTO_TBB -ltbb
TBB-compatibleMassiveThreadsmtbb/task_group.h-lmyth-native
TBB-compatibleQthreadsmtbb/task_group.h-DTO_QTHREAD -lqthread
TBB-compatibleNanos++mtbb/task_group.h-DTO_NANOX -lnanox-c
OpenMP-likeOpenMPtpswitch/omp_dr.h
Cilk-likeCilktpswitch/cilk_dr.h
Cilkplus-likeCilkplustpswitch/cilk_dr.h
Task Parallel SwitcherNone (dummy)tpswitch/tpswitch.h-DTO_SERIAL
Task Parallel SwitcherIntel TBBtpswitch/tpswitch.h -DTO_TBB -ltbb
Task Parallel SwitcherMassiveThreadstpswitch/tpswitch.h-DTO_MTHREAD_NATIVE -lmyth-native
Task Parallel SwitcherQthreadstpswitch/tpswitch.h-DTO_QTHREAD -lqthread
Task Parallel SwitcherNanos++tpswitch/tpswitch.h-DTO_NANOX -lnanox-c
Task Parallel SwitcherOpenMPtpswitch/tpswitch.h-DTO_OMP
Task Parallel SwitcherCilktpswitch/tpswitch.h-DTO_CILK
Task Parallel SwitcherCilkplustpswitch/tpswitch.h-DTO_CILKPLUS

Next: , Previous: , Up: Higher-Level Interfaces   [Contents]

3.2 TBB-Compatible Interface


Next: , Previous: , Up: TBB-Compatible Interface   [Contents]

3.2.1 TBB-Compatible Interface Overview

As of writing, it supports task_group class, parallel_for template function, and parallel_reduce template function. See respective sections of the TBB reference manual for these classes. We will see examples using task_group class below.


Next: , Previous: , Up: TBB-Compatible Interface   [Contents]

3.2.2 Installing TBB-Compatible Interface

TBB-compatible interface is distributed as a part of MassiveThreads, so you do not do anything particular to install it besides the installation procedure of MassiveThreads.

After installation, the files constituting the API are installed as:

Note that they are under mtbb directory, instead of tbb directory as in the original TBB.


Next: , Previous: , Up: TBB-Compatible Interface   [Contents]

3.2.3 Writing Programs Using TBB-Compatible Interface

Using TBB-Compatible interface is a lot like using the regular TBB. You include mtbb/{task_group,parallel_for,parallel_reduce}.h instead of tbb/{task_group,parallel_for,parallel_reduce}.h, and use namespace mtbb instead of namespace tbb.

Here is a simple example (bin_mtbb.cc).

#include <mtbb/task_group.h>

long bin(int n) {
  if (n == 0) return 1;
  else {
    mtbb::task_group tg;
    long x, y;
    tg.run([=,&x] { x = bin(n - 1); });
    y = bin(n - 1);
    tg.wait();
    return x + y;
  }
}

int main(int argc, char ** argv) {
  int n = atoi(argv[1]);
  long x = bin(n);
  printf("bin(%d) = %ld\n", n, x);
  return 0;
}

I hope you agree that changes are minimal. The original TBB version would look like this (only differences are the file name of the include file and namespace prefix of the task_group class).

#include <tbb/task_group.h>

long bin(int n) {
  if (n == 0) return 1;
  else {
    tbb::task_group tg;
    long x, y;
    tg.run([=,&x] { x = bin(n - 1); });
    y = bin(n - 1);
    tg.wait();
    return x + y;
  }
}

int main(int argc, char ** argv) {
  int n = atoi(argv[1]);
  long x = bin(n);
  printf("bin(%d) = %ld\n", n, x);
  return 0;
}

Without DAG Recorder, you would compile bin_mtbb.cc as follows.

$ g++ --std=c++0x bin_mtbb.cc -lmyth-native

Remark 1: --std=c++0x is given to use C++ lambda expression at line 8, proposed in C++0x and standardized in C++11. GCC supports it since 4.5, when one of the following command line options --std=c++0x, --std=gnu0x, --std=c++11, or --std=gnu11 is supplied. If your GCC does not support it, you could pass any callable object (any object supporting operator()). We use lambda expressions for brevity in this manual.

Remark 2: Depending on your configuration, you might need to add -I, -L, and -Wl,-R options to the command line. For example, if you install MassiveThreads under /home/you/local (i.e., gave /home/you/local to --prefix of the configure command), the command line will be:

$ g++ --std=c++0x -I/home/you/local/include -L/home/you/local/lib -Wl,-R/home/you/local/lib bin_mtbb.cc -lmyth-native

Previous: , Up: TBB-Compatible Interface   [Contents]

3.2.4 Choosing Schedulers Beneath the TBB-Compatible Interface

With the above command, you get a program that uses TBB-compatible API with MassiveThreads as the underlying scheduler. Roughly speaking, task_group’s run method will create a thread of MassiveThreads library via myth_create and wait method will wait for all threads associated with the task group object to finish via myth_join.

The mtbb/task_group.h file allows you to use threading libraries other than MassiveThreads, by defining a compile time flag TO_XXX. Currently, you can choose from the original Intel TBB, MassiveThreads, Qthreads, Nanos++, or None. Flags you should give to them are listed below.

Runtime systemFlag
Intel TBB-DTO_TBB
MassiveThreads-DTO_MTHREAD_NATIVE (or nothing)
Qthreads-DTO_QTHREAD
Nanos++-DTO_NANOX
None-DTO_SERIAL

The last one, None, elides all tasking primitives; run(c) executes c() serially and wait() is a noop.

In order to use mtbb/task_group.h with the scheduler you chose, you of course need to install the respective scheduler and link your program with it.


Previous: , Up: Higher-Level Interfaces   [Contents]

3.3 Task Parallel Switcher

TBB-compatible interface unifies various schedulers under the same, TBB-compatible interface. Task parallel switcher goes one step further by defining an API that can be mapped onto OpenMP and Cilk as well.

OpenMP, Cilk, and TBB’s task_group interfaces are all conceptually very similar; they all define ways to create tasks and wait for outstanding tasks to finish, after all.

Yet there are idiosyncrasies that make defining truly uniform APIs difficult.

TODO: detail the following


Previous: , Up: Top   [Contents]

4 DAG Recorder


Next: , Previous: , Up: DAG Recorder   [Contents]

4.1 DAG Recorder Overview

DAG Recorder is a tracing tool to analyze execution of task parallel programs. It records all relevant events in an execution of the program, such as task start, task creation, and task synchronization and stores them in a manner that is able to reconstruct the computational DAG of the execution.


Next: , Previous: , Up: DAG Recorder   [Contents]

4.2 Installing DAG Recorder

DAG Recorder is distributed as a part of MassiveThreads, so installing MassiveThreads automatically installs DAG Recorder too. DAG Recorder does not internally depend on MassiveThreads in any way, however; you can, for example, use DAG Recorder to analyze TBB or OpenMP programs.

After installation, files directly visible to the user are the following.

where PREFIX is the path you specified with --prefix at configure command line.

In most cases, you do not have to directly include dag_recorder.h. TBB-compatible interface or aforementioned wrappers (omp_dr.h and cilk_dr.h) will automatically include it.


Next: , Previous: , Up: DAG Recorder   [Contents]

4.3 Writing Programs That Use DAG Recorder


Next: , Previous: , Up: Writing Programs That Use DAG Recorder   [Contents]

4.3.1 Common Basics

Currently, DAG Recorder supports the following task parallel APIs.

Making your programs ready for DAG Recorder involves replacing original task parallel primitives with equivalent, instrumented versions. You also need to specify where to start/stop instrumentation and dump the result. We provide header files to make the instrumentation nearly automatic or at least quite mechanical. What you exactly need to do depends on the programming model you chose and are detailed in the following subsections.


Next: , Previous: , Up: Writing Programs That Use DAG Recorder   [Contents]

4.3.2 Using DAG Recorder with TBB-Compatible Interface

If you are using TBB-Compatible Interface (see Writing Programs Using TBB-Compatible Interface), the instrumentation is most straightforward and least intrusive. Let’s say you have a program including mtbb/task_group.h such as this.

#include <mtbb/task_group.h>

long bin(int n) {
  if (n == 0) return 1;
  else {
    mtbb::task_group tg;
    long x, y;
    tg.run([=,&x] { x = bin(n - 1); });
    y = bin(n - 1);
    tg.wait();
    return x + y;
  }
}

int main(int argc, char ** argv) {
  int n = atoi(argv[1]);
  long x = bin(n);
  printf("bin(%d) = %ld\n", n, x);
  return 0;
}

Instrumentation is turned on simply by giving -DDAG_RECORDER=2 at the command line. What else you need to do is to insert calls to dr_start, dr_stop, and dr_dump at appropriate places like this (bin_mtbb_dr.cc).

#include <mtbb/task_group.h>

long bin(int n) {
  if (n == 0) return 1;
  else {
    mtbb::task_group tg;
    long x, y;
    tg.run([=,&x] { x = bin(n - 1); });
    y = bin(n - 1);
    tg.wait();
    return x + y;
  }
}

int main(int argc, char ** argv) {
  int n = atoi(argv[1]);
  dr_start(0);
  long x = bin(n);
  dr_stop();
  dr_dump();
  printf("bin(%d) = %ld\n", n, x);
  return 0;
}

As you will see already, you should insert:

dr_start takes a pointer, which may be zero, to dr_options data structure as the argument. Controlling the Behavior of DAG Recorder for options you can specify.

Here are the command lines to compile this program, with and without DAG Recorder

The reason why you set DAG_RECORDER to “2” is historical. There was a version one, which have become obsolete by now.

You could switch to other schedulers in the way described already. See Choosing Schedulers Beneath the TBB-Compatible Interface. For example, you will get the original TBB scheduler with the following command line.

g++ --std=c++0x bin_mtbb_dr.cc -DTO_TBB -DDAG_RECORDER=2 -ldr -ltbb 

Next: , Previous: , Up: Writing Programs That Use DAG Recorder   [Contents]

4.3.3 Using DAG Recorder with OpenMP

OpenMP uses directives (pragma omp task and pragma omp taskwait) to express task parallel programs. It almost always uses pragma omp parallel and pragma omp single (or pragma omp master) to enter a task parallel section. Here is an equivalent program to our example, written in the regular OpenMP.

#include <stdio.h>
#include <stdlib.h>

long bin(int n) {
  if (n == 0) return 1;
  else {
    long x, y;
#pragma omp task shared(x)
    x = bin(n - 1);
#pragma omp task shared(y)
    y = bin(n - 1);
#pragma omp taskwait
    return x + y;
  }
}

int main(int argc, char ** argv) {
  int n = atoi(argv[1]);
#pragma omp parallel 
#pragma omp single
  {
    long x = bin(n);
    printf("bin(%d) = %ld\n", n, x);
  }
  return 0;
}

We need to instrument these pragmas, for which we defined equivalent macros (not pragmas) in a header file tpswitch/omp_dr.h. This is not as straightforward as we hope, but we do not know any good mechanism to introduce a new pragma or redefine existing pragmas.

tpswitch/omp_dr.h defines the following macros.

Without DAG Recorder, they are expanded into equivalent OpenMP pragmas in an obvious manner:

So, here is DAG Recorder-ready version of the above program.

#include <stdio.h>
#include <stdlib.h>
#include <tpswitch/omp_dr.h>

long bin(int n) {
  if (n == 0) return 1;
  else {
    long x, y;
    pragma_omp_task(shared(x), 
		    x = bin(n - 1));
    pragma_omp_task(shared(y), 
		    y = bin(n - 1));
    pragma_omp_taskwait;
    return x + y;
  }
}

int main(int argc, char ** argv) {
  int n = atoi(argv[1]);
  pragma_omp_parallel_single(, {
      dr_start(0);
      long x = bin(n);
      dr_stop();
      printf("bin(%d) = %ld\n", n, x);
      dr_dump();
    });
  return 0;
}

This source code can be compiled with and without DAG Recorder.


Next: , Previous: , Up: Writing Programs That Use DAG Recorder   [Contents]

4.3.4 Using DAG Recorder with Cilk and CilkPlus

There are two versions of Cilk; the original MIT Cilk and CilkPlus. The former is implemented as a source to source translator (cilkc) and it is a strictly C extension (C++ not supported). The latter is natively supported by Intel C++ Compiler and GCC version $\geq$ 4.9. It supports both C and C++ for writing serial parts. DAG Recorder supports both Cilk and CilkPlus. Hereafter, when we say Cilk, it means the original MIT Cilk version.

CilkPlus uses directives _Cilk_spawn and _Cilk_sync statements. Here is our example in CilkPlus.

#include <stdio.h>
#include <stdlib.h>

long bin(int n) {
  if (n == 0) return 1;
  else {
    long x, y;
    x = _Cilk_spawn bin(n - 1);
    y = _Cilk_spawn bin(n - 1);
    _Cilk_sync;
    return x + y;
  }
}

int main(int argc, char ** argv) {
  int n = atoi(argv[1]);
  long x;
  x = _Cilk_spawn bin(n);
  _Cilk_sync;
  printf("bin(%d) = %ld\n", n, x);
  return 0;
}

Alternatively you can include <cilk/cilk.h> and use more human friendly cilk_spawn and cilk_sync keywords.

#include <stdio.h>
#include <stdlib.h>
#include <cilk/cilk.h>

long bin(int n) {
  if (n == 0) return 1;
  else {
    long x, y;
    x = cilk_spawn bin(n - 1);
    y = cilk_spawn bin(n - 1);
    cilk_sync;
    return x + y;
  }
}

int main(int argc, char ** argv) {
  int n = atoi(argv[1]);
  long x;
  x = cilk_spawn bin(n);
  cilk_sync;
  printf("bin(%d) = %ld\n", n, x);
  return 0;
}

Cilk uses directives spawn and sync statements to create and synchronize tasks. Here is our example in Cilk.

#include <stdio.h>
#include <stdlib.h>

cilk long bin(int n) {
  if (n == 0) return 1;
  else {
    long x, y;
    x = spawn bin(n - 1);
    y = spawn bin(n - 1);
    sync;
    return x + y;
  }
}

cilk int main(int argc, char ** argv) {
  int n = atoi(argv[1]);
  long x;
  x = spawn bin(n);
  sync;
  printf("bin(%d) = %ld\n", n, x);
  return 0;
}

There is a subtle but important difference between Cilk and CilkPlus. In Cilk, a function that spawns a task needs to be explicitly marked as a cilk procedure by the cilk keyword at function declaration; and, once a procedure is marked as a cilk procedure, it cannot be called by a regular function call syntax; it must always be spawned. That is, in our example, the following is prohibited.

int x = bin(n);

It must instead be written as

int x;
x = spawn bin(n);
sync;

As a result, the enclosing function must also be marked as a cilk procedure.

Whether you use Cilk or CilkPlus, modifications necessary to make these programs ready are summarized as follows.

  1. include tpswitch/cilk_dr.cilkh (Cilk) or tpswitch/cilkplus_dr.h (CilkPlus)
  2. enclose all spawn, cilk_spawn, and _Cilk_spawn statements with spawn_(...) macro. e.g.,
    y = cilk_spawn f(x);
    

    should be rewritten to:

    spawn_(y = cilk_spawn f(x));
    
  3. replace all sync and cilk_sync statements with sync_ and cilk_sync_, respectively.
  4. any function that spawns a task needs to begin with cilk_begin. This is to indicate the beginning of a task. If you forget this, a compilation error should result, complaining “no such variable __cilk_begin__”;
  5. replace return statements with either cilk_return(val) or cilk_void_return, depending on whether the return statement returns a value. This is to indicate the end of a task.

    (TODO : wish to fix this) If you forget this, a compilation succeeds, but DAG Recorder fails.

  6. (TODO : get rid of this restriction) As of writing, if you insert cilk_begin into a function, that function always needs to be spawned. That is, such a function cannot be called by a normal function call syntax. This is prohibited in MIT Cilk anyways and flagged as a compilation error. It is on you when you use CilkPlus, which allows task-spawning functions to be called serially without spawn keywords. If you forget this, there are no compilation errors and DAG Recorder will be confused.

Here is the modified CilkPlus program.

#include <stdio.h>
#include <stdlib.h>
#include <tpswitch/cilkplus_dr.h>

long bin(int n) {
  cilk_begin;
  if (n == 0) cilk_return(1);
  else {
    long x, y;
    spawn_(x = cilk_spawn bin(n - 1));
    spawn_(y = cilk_spawn bin(n - 1));
    cilk_sync_;
    cilk_return(x + y);
  }
}

int main(int argc, char ** argv) {
  cilk_begin;
  int n = atoi(argv[1]);
  dr_start(0);
  long x;
  spawn_(x = cilk_spawn bin(n));
  cilk_sync_;
  dr_stop();
  printf("bin(%d) = %ld\n", n, x);
  dr_dump();
  cilk_return(0);
}

And here is Cilk version.

#include <stdio.h>
#include <stdlib.h>
#include <cilk/cilk.h>

long bin(int n) {
  if (n == 0) return 1;
  else {
    long x, y;
    x = cilk_spawn bin(n - 1);
    y = cilk_spawn bin(n - 1);
    cilk_sync;
    return x + y;
  }
}

int main(int argc, char ** argv) {
  int n = atoi(argv[1]);
  long x;
  x = cilk_spawn bin(n);
  cilk_sync;
  printf("bin(%d) = %ld\n", n, x);
  return 0;
}

This source code can be compiled with and without DAG Recorder.

Instrumeting Cilk or CilkPlus programs is admittedly more burdensome than instrumenting OpenMP or TBB. The main reason for this is that Cilk’s spawn statement and CilkPlus’s cilk_spawn statement create a task that executes the body of a procedure, rather than an entire procedure call statement, so we need to mark the beginning of the called procedure as the beginning of the task. That’s why you need to insert cilk_begin. The difference between the two is subtle, but consider the following example.

spawn f(g(x));

In this Cilk code, evaluation of g(x) is not performed by the spawned task, so there is no way to mark the beginning of the task by tweaking macros that receive the entire procedure call statement.

In contrast, a similar TBB code:

tg.run([=] { f(g(x)); });

spawns a task that performs f(g(x)) entirely. To make matters even simpler, the task spawning primitive is just another method rather than a builtin syntax, which we can transparently instrument by having another class that implements run method.


Previous: , Up: Writing Programs That Use DAG Recorder   [Contents]

4.3.5 Using DAG Recorder with tpswitch.h

Just give -DDAG_RECORDER=2 and respective linker options (e.g., -lmyth-native -ldr -lpthread) to the command line.

TODO: more detailed and reader-friendly description.


Next: , Previous: , Up: DAG Recorder   [Contents]

4.4 Running Your Programs with DAG Recorder


Next: , Previous: , Up: Running Your Programs with DAG Recorder   [Contents]

4.4.1 Basics of Running Your Programs with DAG Recorder

Once you obtained an executable compiled and linked with DAG Recorder, you can run it just normally.

$ ./bin_mtbb_dr 20
bin(20) = 1048576

You will find the following three files generated under the current directory.

Run this program with setting environment variable DR=0, and you can run the program with DAG Recorder turned off.

$ DR=0 ./bin_mtbb_dr 20
bin(20) = 1048576

It still imposes a small overhead (essentially, looking up a global variable + branch) for each tasking primitive. We believe this overhead is rarely an issue, but if you want to completely eliminate this overhead, compile the program without DAG_RECORDER=2.


Previous: , Up: Running Your Programs with DAG Recorder   [Contents]

4.4.2 Controlling the Behavior of DAG Recorder

The behavior of DAG Recorder can be controlled either from within the program or by environment variables; you can pass a pointer to dr_options structure to dr_start, which has been 0 in the examples we have shown so far. If the argument to dr_start is null (zero), options can be set via environment variables. We will illustrate how they work.

First about environment variables. Run the program with setting the environment variable DR_VERBOSE to 1, and you will see the list of environment variables and their values printed by dr_start. You will also see messages about files generated by dr_dump.

$ DR_VERBOSE=1 ./bin_mtbb_dr 10
DAG Recorder Options:
dag_file_prefix (DAG_RECORDER_DAG_FILE_PREFIX,DR_PREFIX) : 00dr
dag_file_yes (DAG_RECORDER_DAG_FILE,DR_DAG) : 1
stat_file_yes (DAG_RECORDER_STAT_FILE,DR_STAT) : 1
gpl_file_yes (DAG_RECORDER_GPL_FILE,DR_GPL) : 1
dot_file_yes (DAG_RECORDER_DOT_FILE,DR_DOT) : 0
text_file_yes (DAG_RECORDER_TEXT_FILE,DR_TEXT) : 0
gpl_sz (DAG_RECORDER_GPL_SIZE,DR_GPL_SZ) : 4000
text_file_sep (DAG_RECORDER_TEXT_FILE_SEP,DR_TEXT_SEP) : |
dbg_level (DAG_RECORDER_DBG_LEVEL,DR_DBG) : 0
verbose_level (DAG_RECORDER_VERBOSE_LEVEL,DR_VERBOSE) : 1
chk_level (DAG_RECORDER_CHK_LEVEL,DR_CHK) : 0
uncollapse_min (DAG_RECORDER_UNCOLLAPSE_MIN,DR_UNCOLLAPSE_MIN) : 0
collapse_max (DAG_RECORDER_COLLAPSE_MAX,DR_COLLAPSE_MAX) : 1152921504606846976
node_count_target (DAG_RECORDER_NODE_COUNT,DR_NC) : 0
prune_threshold (DAG_RECORDER_PRUNE_THRESHOLD,DR_PRUNE) : 100000
alloc_unit_mb (DAG_RECORDER_ALLOC_UNIT_MB,DR_ALLOC_UNIT_MB) : 1
pre_alloc_per_worker (DAG_RECORDER_PRE_ALLOC_PER_WORKER,DR_PRE_ALLOC_PER_WORKER) : 0
pre_alloc (DAG_RECORDER_PRE_ALLOC,DR_PRE_ALLOC) : 0
dag_recorder: writing dag to 00dr.dag
dr_pi_dag_dump: 28648 bytes
dag recorder: writing stat to 00dr.stat
dag recorder: writing parallelism to 00dr.gpl
bin(10) = 1024

Uppercase names within parentheses are environment variables you might want to set. They start with a prefix DAG_RECORDER_ and many of them have a shorter version that begin with DR_. The list will change as our experiences accumulate. Below is the list of frequently used variables (consider other variables are still experimental).

variabledefaultdescription
DR_DAG_PREFIX00drPrefix of all files below
DR_DAG11 if generate a DAG file (to DR_DAG_PREFIX.dag)
DR_STAT11 if generate a summary stat file (to DR_DAG_PREFIX.stat)
DR_GPL11 if generate a parallelism profile file (to DR_DAG_PREFIX.gpl)
DR_DOT01 if generate a DAG file in a graphviz format (to DR_DAG_PREFIX.dot), which can be converted into viewable images by the dot command. You need to have graphviz package installed in yours system
DR_TEXT01 if generate a human-readable text-formatted DAG file (to DR_DAG_PREFIX.txt). Specify this when you want to inspect raw data
DR_TEXT_SEP|The field delimiter used in the text-formatted DAG file
DR_VERBOSE0Set verbosity
DR_COLLAPSE_MAXa huge valueDetermine how aggressively the DAG Recorder collapses subgraphs. Specifically, the value determines an upper bound of time (in clock cycles) any single node resulted from collapsing a subgraph can span. In other words, any single node in the DAG represents either a true single node (i.e., performed no tasking primitives) or a subgraph that took shorter than this number of clocks. The default is a huge value, which means the system can collapse subgraphs as much as it can. Set it to a small value to guarantee a minimum resolution.

Let us move on to the second method, which is to control the behavior from your program. As briefly noted above, this is done by passing a pointer to dr_options structure to dr_start. See PREFIX/include/dag_recorder.h for the list of fields. Note that field names were also displayed with DR_VERBOSE=1 option above. For example, the line:

dag_file_prefix (DAG_RECORDER_DAG_FILE_PREFIX,DR_PREFIX) : 00dr

tells you dag_file_prefix is the field name you want to set to change the prefix of generated files.

When you change some of these fields, you will want to leave other fields to their default values. dr_options_default(opts) is the function that fills the structure pointed to by opts with default and environmentally-set values. So, the typical sequence you want to use will be:

dr_options opts[1];
dr_options_default(opts);
opts->dag_file = ...;
opts->whatever_you_want_to_change = ...;
   ...
dr_start(opts);

Next: , Previous: , Up: DAG Recorder   [Contents]

4.5 dag2any DAG to any data converter

about dag2any


Next: , Previous: , Up: DAG Recorder   [Contents]

4.6 Viewing Recorded Data

Tools to view DAG Recorder data are still ad-hoc; ideally there should be a single tool to view the same data from many angles. As of writing, there instead is an interactive tool to show parallelism profile and a set of files derived from the DAG data, viewable by standard tools such as gnuplot. We will continue to work on developing tools to analyze DAG data from many angles and unify their user interfaces.


Next: , Previous: , Up: Viewing Recorded Data   [Contents]

4.6.1 Viewing Parallelism Profile with gnuplot

By default, programs traced by DAG Recorder generates a parallelism profile as a gnuplot file. You can simply view it by gnuplot. A parallelism profile looks like this.

gpl/tbb

The horizontal axis represents time (in clock cycles) and the vertical axis the number of tasks of various conditions, indicated by colors.

For example, consider the following program:

#include <mtbb/task_gorup.h>
int main() {
  mtbb::task_group tg;
  a();
  tg.run([=] b());
  c();
  tg.wait();
  d();
}

and the DAG resulting from executing this program.

svg/dag

The label of an edge indicates how the node it points to is classified when its source node has finished. For example, the node q is counted as create, from the time when p finished (i.e., the task entered tg.run([=] { b(); })) to the time when q started.

p” becomes available when both q and p’ finished, so how it is classified depends on which of them finished last. If q finished later than p’, it is classified as end; otherwise as wait cont.

In most cases, your primary interest will be in “running.” If this stays constant around the number of workers used, it means the same number of cores are maximally utilized (as long as the operating system runs each worker on a distinct core). If it is not the case, that is, there are intervals in which the number of running tasks is lower than the number of workers used, you should check if there are enough available tasks.

If there are no or little available tasks in an interval, it means your program did not have enough tasks in that interval, so you might have to consider increasing the parallelism in that interval. In some cases you have simply left some section of your code left not parallelized at all, which is easily visible in the parallelism profile. A tool drview will help you spot source code locations when this happens. see Viewing DAG file with drview.

If, on the other hand, available tasks are abundant, it means the runtime system, for whatever reasons, was not able to fully exploit available parallelism. There are several possible reasons for this.


Next: , Previous: , Up: Viewing Recorded Data   [Contents]

4.6.2 Visualizing the DAG via graphviz

You can generate the DAG captured by DAG Recorder, by setting environment variable DAG_RECORDER_DAG_FILE (or DR_DAG) to the filename you want to have it in. The file is a text file of a graphviz dot format, which can then be transformed into various graphics format by graphviz tool dot.

Since a program easily generates a DAG of millions or more nodes, this feature will be useful only for short runs.

For example, you can see the DAG by any SVG viewer by the following procedure.

$ DR_DAG=00dr.dot ./a.out
$ dot -Tsvg -o 00dr.svg 00dr.dot 
$ any-svg-viewer 00dr.svg

See graphviz package and dot manual for further information about the dot tool.

When you use this feature to visualize the true topology of the DAG your program generated, you might want to turn off the subgraph contraction algorithm DAG Recorder implements to save space. To this end, you can set DR_COLLAPSE_MAX environment variable to zero.

$ DR_COLLAPSE_MAX=0 DR_DAG=00dr.dot ./a.out
$ dot -Tsvg -o 00dr.svg 00dr.dot 
$ any-svg-viewer 00dr.svg

Next: , Previous: , Up: Viewing Recorded Data   [Contents]

4.6.3 Understanding Stat File

By default, programs traced by DAG Recorder generates a small text file that summarizes various pieces of information of the execution. You can view it by any text editor. Here is an example.

create_task           = 1048575
wait_tasks            = 1048575
end_task              = 1048576
work (T1)             = 1313026836
delay                 = 9031849743
no_work               = 11285973
critical_path (T_inf) = 91285263
n_workers (P)         = 4
elapsed               = 2589040638
T1/P                  = 328256709.000
T1/P+T_inf            = 419541972.000
T1/T_inf              = 14.384
greedy speedup        = 3.130
observed speedup      = 0.507
observed/greedy       = 0.162
task granularity      = 9601.938
interval granularity  = 3200.645
dag nodes             = 5242877
materialized nodes    = 351
compression ratio     = 0.000067
end-parent edges:
 266182 7 7 1
 1 253506 16 2
 0 8 280326 5
 1 4 9 248486
create-child edges:
 266204 0 0 0
 0 253527 0 0
 0 0 280342 0
 0 0 0 248502
create-cont edges:
 266187 7 6 4
 2 253514 9 2
 3 7 280329 3
 4 3 0 248495
wait-cont edges:
 266183 0 1 0
 0 253531 0 0
 1 0 280361 0
 0 0 0 248498
other-cont edges:
 0 0 0 0
 0 0 0 0
 0 0 0 0
 0 0 0 0

Previous: , Up: Viewing Recorded Data   [Contents]

4.6.4 Viewing DAG file with drview

drview is a tool that shows parallelism profile of an execution and allows you to zoom into an interval in it. This way it helps you pinpoint tasks executing when parallelism was low.

Prerequisites: drview is a python script that relies on the following libraries.

Please make sure you should be able to import respective python modules (matplotlib and gtk).

To use drview, you first need to convert the .dag file generated by DAG Recorder into SQLite3 format using dag2any tool described above. Then you pass the resulting SQLite3 file to drview.

TODO: We are planning to improve this crude interface, so you can directly give a .dag file to drview.

$ dag2any 00dr.dag 
writing sqlite3 to 00dr.sqlite
basics:  ........................................
nodes:   ........................................
edges:   ........................................
strings: ........................................
committing
$ drview 00dr.sqlite

This will bring up the user interface window.

BUG: The initial configuration of panes is far from satisfactory. Please adjust their sizes manually by grabbing borders between panes. I am still trying to figure out how to configure their sizes.

After manually adjusting pane sizes, you will obtain something like this.

img/drview_screenshot_resized

On the leftmost pane, you see the parallelism profile, the same information you can see by the gnuplot-formatted parallelism profile. see Viewing Parallelism Profile with gnuplot.

On the center pane is the list of DAG nodes executed. Each row represents a group of nodes that share the same start and end positions. They are ordered by the total number of cycles spent in the group of tasks. If you double-click on a row, the right pane shows the source code of the corresponding location. By clicking somewhere in the “start” or “end” column, the source code pane will display the group’s start or end position, respectively.

The most useful feature of this tool is that you can zoom into an interval of your interest in the parallelism pane. Hold the left button of the mouse pushed and specify a rectangular region in the parallelism pane, and you will see the parallelism and the task panes redrawn to reflect the tasks executed in the selected interval. This way, you can easily know the source locations of low parallelism.


Previous: , Up: DAG Recorder   [Contents]

4.7 Querying Recorded Data