TreeMatch Documentation

V1.0 June 14 2017

- Emmanuel Jeannot (
- François Tessier (


TreeMatch computes a permutation of process ranks according to the hardware
topology and the processes affinity.

For technical details and evaluation see the following papers:

Emmanuel Jeannot and Guillaume Mercier. Near-optimal placement of MPI
processes on hierarchical NUMA architecture. In Euro-Par, Ischia, Italy, August

2) Guillaume Mercier and Emmanuel Jeannot. Improving MPI Applications
Performance on Multicore Clusters with Rank Reordering. In Proceedings of the
16th International EuroMPI Conference, Santorini, Greece, September 2011, LNCS
6960, pp 39-49.

3) Emmanuel Jeannot, Guillaume Mercier, and Francois Tessier. Process Placement
in Multicore Clusters: Algorithmic Issues and Practical Techniques. IEEE
Transactions on Parallel and Distributed Systems, 25(4):993–1002, Avril 2014.

The `mapping' executable

The distribution provides the mapping executable

mapping -t|x <Architecture file[tgt|xml]> -c <Communication pattern file> [-b <binding constraint file>] [-d disable topology optimization] [-v verbose_level] [-m evaluation metric (1: SUM_COM (default), 2: MAX_COM, 3: HOP_BYTE)] [-o oversubscribing factor] [-e force exhaustive search] [-p maximum number of threads used in parallel sections (default : number of cores)] [-g force greedy k-partionniong] [-h display this help]


The architecture is a tree that can be described as :

  • a tgt file (tleaf Scotch format) or,
  • an xml file (HWLOC format)

+ A tleaf describes a tree using the following syntax:

tleaf n a_1 s_1 ... an s_n
where n+1 is the number of levels of the tree, a_i is the arity of level i and s_i is
the communication speed between levels i and i+1

Example of tleaf can be found in: treematch/examples/topologies

+ An xml file can be generated by hwloc through the lstopo command using the '--no-io --merge --of xml' options.

Example of xml files can be found in: treematch/examples/topologies

+ What about the core numbering?
We take into account 2 types of core numbering. The physical one depends on the manufacturer and is very different from one architecture to an other. The XML file generated by hwloc and taken as input by our mapping tool uses this numbering. The logical one numbers each core lineraly with an incremental way. Formally, it follows a depth-first search algorithm carried out on the topology tree and number the leaves.

Communication pattern

The communication pattern file is a matrix that describes the affinity between the processes. The higher the value, the higher the affinity. Processes are  numbered from the top left of the matrix.

Example of communication patterns can be found in: treematch/examples/com_patterns

Binding constraints

The binding constraints is an optional file that describes on which resources the algorithm can map the processes.

3 4 5 7 8 9 10 11
means that only the above resources can be used for the mapping.

Example of binding constraints can be found in: treematch/examples/binding_constraints

Without binding constraints, all the available resources can be used for mapping processes.

Topology optimization

The topology optimization is used to decompose levels of the tree with arity that is not prime into different levels of prime numbers. This is used to speed-up the computation. You can disable it in case you see problems of solution quality.  


The verbose level is an optional argument that describes the level of severity of the displayed messages. We have 6 levels:

NONE     0
/* output in stderr*/
ERROR    2
/* output in stdout*/
INFO     5
DEBUG    6

The default level is 2

Evalutaion metrics

We have three evaluation metrics that is computed by TreeMatch:

  • The sum of all the communications once the perutaion is applied.
  • The maximum of all the communications
  • The hobyte: the product of the sze of all the communcations time their respective number of hops once mapped


We can define an oversubscribing factor X in case you want at most X processes to be mapped on the leaves of the tree. Without setting tsi factor teh default value is 1 (no oversubscribing). This is useful if the leaves represnt core with hypertherading or nodes with several cores.

Optimal search

The internal algorithm can perform an exhistive search to find the best group decompistion when matching processes to subtrees. This can be very long for standard case and this is by default disabled.


Some parts of the program are multithread. By default teh number of threads used in these parts is the number of cores of the node. You can change this value with the -p option


If Scotch is available, the internal k-way partitionning is done using the recursive coarsenning/uncoarsenning tecnhique. You can force the internal greedy partitionner to unsure similar solutions between platfrom where Scotch is not install everywhere.


The output of mapping is process permutation:

TreeMatch: 0,1,8,9,11,10,4,5,6,13,15,2,12,14,7,3
Means that proccess 0 is mapped to core 0, 1 to 1 , 2 to 8, ... and 15 to 3

An example on how to use the mapping executable is in treematch/examples/
mapping -t topologies/128.tgt -c com_patterns/128.mat
mapping -t topologies/128.tgt -c com_patterns/16.mat -b binding_constraints/16.bind
mapping -x topologies/64.xml -c com_patterns/64.mat
mapping -x topologies/64.xml  -c com_patterns/16.mat -b binding_constraints/16.bind

Testing TreeMatch

In the examples directory run the perl script to test the installed
version of TreeMatch

You can pass the path to the mapping executable in the command line in case an
older version is install in your default path.   

The TreeMatch Library

A C library is created that can be linked using -ltreematch

The main functions of the API are:

/************ TreeMatch Public API ************/
/* construct topology from local one using hwloc */
tm_topology_t* tm_get_local_topology_with_hwloc(void);

/* Aletrnatively, load XML or TGT topology */
tm_topology_t *tm_load_topology(char *arch_filename, tm_file_type_t arch_file_type);
   Alternatively, build a synthetic balanced topology.

   nb_levels : number of levels of the topology +1 (the last level must be of cost 0 and arity 0).
   arity : array of arity of the first nb_level (of size nb_levels)
   cost : array of costs between the levels (of size nb_levels)
   core_numbering: numbering of the core by the system. Array of size nb_core_per_node

   nb_core_per_nodes: number of cores of a given node. Size of the array core_numbering

   both arity and cost are copied inside tm_build_synthetic_topology

   The numbering of the cores is done in round robin fashion after a width traversal of the topology.
   for example:
       {0,1,2,3} becomes 0,1,2,3,4,5,6,7...
       {0,2,1,3} becomes 0,2,1,3,4,6,5,7,...

   Example of call to build the 128.tgt file: tleaf 4 16 500 2 100 2 50 2 10

   double cost[5] = {500,100,50,10,0};
   int arity[5] = {16,2,2,2,0};
   int cn[2]={0,1};

   topology = tm_build_synthetic_topology(arity,cost,5,cn,2);

tm_topology_t  *tm_build_synthetic_topology(int *arity, double *cost, int nb_levels, int *core_numbering, int nb_core_per_nodes);
/* load affinity matrix */
tm_affinity_mat_t *tm_load_aff_mat(char *com_filename);
   Alternativelly, build the affinity matrix from a array of array of matrix of size order by order
   For performance reason mat is not copied.
tm_affinity_mat_t * tm_build_affinity_mat(double **mat, int order);
/* Add constraints to toplogy
   Return 1 on success and 0  if the constari,ts id are not compatible withe nodes id */
int tm_topology_add_binding_constraints(char *bind_filename, tm_topology_t *topology);
/* Alternatively, set the constraints from an array.
   Return 1 on success and 0  if the constari,ts id are not compatible withe nodes id

   The array constraints is copied inside tm_topology_set_binding_constraints

int tm_topology_set_binding_constraints(int *constraints, int nb_constraints, tm_topology_t *topology);
/* display arity of the topology */
void  tm_display_arity(tm_topology_t *topology);
/* display the full topology */
void  tm_display_topology(tm_topology_t *topology);
/* Optimize the topology by decomposing arities */
void tm_optimize_topology(tm_topology_t **topology);
/* Manage oversubscribing */
void tm_enable_oversubscribing(tm_topology_t *topology, unsigned int oversub_fact);
/* core of the treematch: compute the solution tree */
tm_tree_t *tm_build_tree_from_topology(tm_topology_t *topology, tm_affinity_mat_t *aff_mat, double *obj_weight, double *com_speed);
/* compute the mapping according to the tree and the core numbering*/
tm_solution_t *tm_compute_mapping(tm_topology_t *topology, tm_tree_t *comm_tree);
/* display the solution*/
double tm_display_solution(tm_topology_t *topology, tm_affinity_mat_t *aff_mat, tm_solution_t *sol, tm_metric_t metric);
/* display RR, packed, MPIPP*/
void tm_display_other_heuristics(tm_topology_t *topology, tm_affinity_mat_t *aff_mat, tm_metric_t metric);
/* free TM strutures*/
void tm_free_topology(tm_topology_t *topology);
void tm_free_tree(tm_tree_t *comm_tree);
void tm_free_solution(tm_solution_t *sol);
void tm_free_affinity_mat(tm_affinity_mat_t *aff_mat);
/* manage verbosity of TM*/
void tm_set_verbose_level(unsigned int level);
unsigned int  tm_get_verbose_level(void);
/* finalize treematch :check memory if necessary, and free internal variables (thread pool)*/
void tm_finalize();

Ask for exhaustive search: may be very long
   new_val == 0 : no exhuative search
   new_val != 0 : exhuative search
void tm_set_exhaustive_search_flag(int new_val);
int tm_get_exhaustive_search_flag();

Ask for greedy k-partitionning even if scotch is available
   new_val == 0 : no greedy k-partitionning
   new_val != 0 : greedy k-partitionning
void tm_set_greedy_flag(int new_val);
int tm_get_greedy_flag();

/* Setting the maximum number of threads you want to use in parallel parts of TreeMatch */
void tm_set_max_nb_threads(unsigned int val);


/* core of the treematch: compute the solution tree */
The core of the algorithm is:
tm_tree_t *tm_build_tree_from_topology(tm_topology_t *topology, tm_affinity_mat_t *aff_mat, double *obj_weight, double *com_speed);
   - topology is the is the topology/architecture treematch deals with
   - aff_matis the communication matrix
   - obj_weight can be set to NULL if not used, or is of length the N
   - comm_speed can be set to NULL if not used, or is of length the number of levels of the topology tree
This last two variables are used to scale the communication matrix for each step
of the algorithm:
new_comm[i][j] = 1e-4*comm[i][j]/speed-fabs(avg-(obj_weight[i]+obj_weight[j])/2);
For all i and j in [0,N[, where avg is the average of obj_weight and speed is the value of comm_speed[l]
where l is the level of the considered step.

Once the solution is build, treematch has to compute the permutation:
tm_solution_t *tm_compute_mapping(tm_topology_t *topology, tm_tree_t *comm_tree);

tm_solution_t is a strucure that contains mainly two fields:
- int *sigma
- int **k

sigma[i] is such that  process i is mapped on core sigma[i]
k[i] is such that core i executes process k[i][j] (0<=j<<=oversubscribing factor - 1)

size of sigma is the number of processes (nb_objs)
size of k is the number of cores/nodes   (nb_compute_units)
size of k[i] is the number of process we can execute per nodes (1 if no oversubscribing)

We must have the number of process<=number of cores

k[i] == NULL if no process is mapped on core i

An example on how to use most these functions can be found  in: