(Predictor)Detailed
design* September, 2000 * |

**-
last edited 27.12.00****
-**

**Contents**

**1 Introduction**

**2 Predictor
implementation**

2.1 Representation of the program as an hierarchy of intervals

2.2 Program execution characteristics on each processor

2.3 Main characteristics and their components

2.4 Source data forPredictor

3.1 General principles of simulation

3.2 Ordinary functions

3.3 Interval delimiter functions

3.4 Input/output functions

3.5 Create/delete object functions

3.6 Data and resources distribution functions

3.7 Functions of collective operation initialization

3.8 Functions of collective operation execution

3.9 Parallel loop functions

3.10 Unknown functions

**4
Estimation of remote data access overhead**

4.1 Main terms and definitions

4.2 Estimation of array redistribution exchanges

4.3 Estimation of distributed array edge exchanges

4.4 Estimation of reduction operation exchanges

4.5 Estimation of exchanges during remote array buffer loading

**Appendix
1. Link from field name in output HTML-file to field name in
structure
_IntervalResult**

by Predictor

The ** Predictor** is intended for performance
analysis and performance debugging of DVM-programs without usage
of a real parallel computer (access to which is usually limited
or complicated). With the

The ** Predictor** is the system for processing the
trace information gathered by Run-Time library (Lib-DVM) during
the program execution on a workstation and consists of two major
components:

The performance of parallel programs on multiprocessor computers with distributed memory is determined by the following major factors:

- Program parallelism degree - a part of parallel calculations in the total volume of calculations.
- Balance of processor load during parallel calculations.
- Time needed for execution of interprocessor communications.
- Degree of overlapping of interprocessor communications with calculations.

The ** Predictor** allows user to obtain the above
and others characteristics and estimate quality of the parallel
program.

There are three major stages of *Predictor** *work:

- read trace; process information about interval structures, support system function call sequence and nesting, input and output function parameters used for simulation and time of each call execution;
- simulate program execution on the base of the program execution structure obtained at the previous stage and calculate program execution characteristics described in 2.2.1 for each interval;
- write characteristics into HTML file.

**2.1 Representation of the program as an hierarchy
of intervals**

For detailed program efficiency analysis user can split program into intervals and obtain performance characteristics for each interval.

The execution of the program can be represented as a sequence of alternating sequential or parallel intervals (groups of operators) execution. By default, the program is considered as one interval. Also user can define intervals by means of C-DVM and Fortran DVM languages. There is also opportunity to set a mode of compilation when the following loops can be declared as intervals

- all parallel loops;
- all sequential loops, containing parallel loops;
- all sequential loops.

(Constraint: when ** Predictor** is used, the integer expression with an interval must
not appear inside parallel loop).

The user can also split any interval into smaller intervals or unite neighbor intervals (in order of execution) in new one, i.e. to present the program as hierarchy of intervals of several levels (the whole program is an interval of highest level 0).

The mechanism of splitting the program into intervals serves
for more detailed analysis of behavior of the program during its
execution. Looking through results with the help of the ** Predictor**,
user can set a depth of details i.e. ignore intervals of
prescribed levels.

For simplification of the further
description we shall enter the following notions. An interval
will be called **simple**, if it does not contain other
intervals (**nested** intervals). We
refer to intervals including nested intervals as **composite**.
Thus, the whole program is a interval tree: an interval of
highest level (whole program) is a root of the tree; an interval
of the lowest level is a leaf.

While processing trace information one
of the intervals is an **active** interval (it is an interval
of the lowest level containing program operator executed at the
moment). For the **active** interval the following
information is accumulated:

- type of the interval (
**IntervalType**); - number of the interval entering (
**EXE_count**); - source file name (
**source_file**); - line number corresponding to the
beginning of the interval (
**source_line**).

At this stage the number of communication operations within the interval is calculated:

- the number of input/output operations (
**num_op_io**); - the number of reduction operations (
**num_op_reduct**); - the number of edge exchange operations (
**num_op_shadow**); - the number of remote access operations (
**num_op_remote**); - the number of array redistribution operations (
**num_op_redist**);

In ** Predictor** each interval corresponds with an
object of “Interval” type which contains all necessary
interval characteristics; during trace
processing the interval tree builds. Each interval
contains a vector of sub-objects of “Processor” type,
each element of the vector saves characteristics of a processor
involved in the interval execution. Thus after trace processing
stage we have an interval tree with nodes of “Interval”
type containing saved characteristics (

**2.2 Program execution characteristics on each
processor**

“Processor” object contains the following characteristics of the processor involved in the interval execution:

**Execution_time –**the interval execution time.**IO_time**– input/output time.**SYS_time –**productive system time**CPU_time -**productive user program time (user program calculations taking into account loop ‘’slicing’’).**Lost_time -**sum of lost times due to insufficient parallelism (**Insuff_parallelism**), losses because of communications (**Communication**) and processor idle time (**Idle**).**Communication –**total communication time. Time of every communication type is extrapolated by*time extrapolator (RATER)*.**Insuff_parallelism**=**Insuff_parallelism_USR**+**Insuff_parallelism_SYS**.**Insuff_parallelism_USR –**user program losses because of insufficient parallelism.**Insuff_parallelism_SYS –**system losses because of insufficient parallelism.**Synchronization –**time of losses because of dissynchronization.**Time_variation -**time variation of collective operation completion.**Idle –**processor idle time - difference between maximal interval execution time (looking through all processors) and interval execution time at the current processor.**Load_imbalance –**time of imbalance of processors loading, difference between maximal processor time (**CPU+SYS**) and corresponding time at the current processor.**Overlap –**total time of overlapping of asynchronous communication operations; sum of time of overlapping of asynchronous input/output operations**IO_overlap**), reductions (**Reduction_overlap**), edge exchanges (**Shadow_overlap**), remote access (**Remote_overlap**) array redistributions (**Redistribution_overlap**).**IO_comm –**total time of communications for input/output operations.**IO_synch -**losses because of dissynchronization for input/output operations.**IO_overlap -**time of overlapping of asynchronous input/output operations.**Wait_reduction -**total time of communications for reduction operations.**Reduction_synch -**losses because of dissynchronization for reduction operations.**Reduction_overlap -**time of overlapping of asynchronous reduction operations.**Wait_shadow -**total time of communications for edge exchange operations.**Shadow_synch -**losses because of dissynchronization for edge exchange operations.**Shadow_overlap -**time of overlapping of asynchronous edge exchange operations.**Remote_access -**total time of communications for remote access operations.**Remote_synch -**losses because of dissynchronization for remote access operations.**Remote_overlap -**time of overlapping of asynchronous remote access operations.**Redistribution -**total time of communications for array redistribution operations (redistribute, realign).**Redistribution_synch -**losses because of dissynchronization for array redistribution operations.**Redistribution_overlap -**time of overlapping of asynchronous array redistribution operations.

As mentioned above accumulation of these characteristics is performed at the first stage of simulation – at the stage of trace processing; the characteristics are saved in “Processor” object for each interval and for each processor used for the interval execution.

While trace processing the tree of intervals (tree of “Interval” objects) is built, each interval contains a vector of “Processor” objects. The vector size is equal to the number of processors in the root processor topology, but processors involved in the interval execution are significant only.

**2.3 Main characteristics and their components.**

The possibility of processing the trace, accumulated by
Lib-DVM at the stage of simulation execution of the program
allows ** Predictor** to give user the following

**execution time (Execution_time) -**maximal parallel program execution time (looking through all processors)**.**

**efficiency coefficient (Efficiency)**– main characteristic of parallel program execution. It is necessary to calculate two amounts of time:

Productive time (Productive_time), required for the parallel program execution on serial computer (consists of three parts):

productive user program time (Productive_CPU_time)– time which the program spends for user calculations.productive system time (Productive_SYS_time)- time which the program spends for system calls (Lib-DVM calls).input/output time (IO_time) -time which the program spend for input/output.

Total processor time (Total_time)calculated as a product of execution time by the number of processors.

Efficiency coefficient- is a ratio of the productive time to the total processor time.

**lost time (Lost_time) -**is the total processor time subtracted by the productive time. There are following components of the**lost time**:

**synchronization (Synchronization)**– used for estimation of the total potential losses, which can arise because of non-simultaneous start of collective operations on different processors. A main reason of these losses is imbalance of processors loading during execution of parallel loops. For estimating the potential losses because of imbalance, the generalized characteristic should be given to the user – imbalance.

**imbalance (Load_imbalance)**. Dissynchronization can occur not only due to imbalance, but also due to differences in completion times of a collective operation on different processors connected with its realization on the specific parallel machine.

**time variation****(Time_variation)**of collective operation completion is used by programmer to evaluate the potential dissynchronization.

**time of overlapping (Overlap)**is an important characteristic, showing the degree of overlapping of interprocessor communications with calculations. It is sum of basic characteristics – time of overlapping of reduction operations and time of overlapping of shadow edge exchanges.

For more detail information about above characteristics see [1].

After the tree of “Interval” objects is built the
method “Integrate” is called for each object. This
method calculate interval characteristics using corresponding
characteristics of processor vector and characteristics of nested
intervals. Finally HTML-file is created according to the interval
tree: characteristics of every interval are written into the
special “template” (see **IntervalTemplate.cpp**
file). Correspondence between “Interval” object fields
and HTML-file fields can be found in Appendix 1.

The following data serve for simulation: trace of the program
execution on one processor, configuration information saved in
some file (for example, “**Predictor.par**”) and
command line options. To obtain necessary trace information the
following parameters should be edited in the configuration file **usr.par**:

Is_DVM_TRACE=1; |
- trace on; |

FileTrace=1; |
- accumulate trace in files; |

MaxIntervalLevel=3; |
- maximum level of nested intervals; |

PreUnderLine=0; |
- do not underline “call” in trace file; |

PostUnderLine=0; |
- do not underline “ret” in trace file; |

MaxTraceLevel=0; |
- maximum trace depth for nested functions. |

Parameters ** PreUnderLine,
PostUnderLine è MaxTraceLevel** say to Lib-DVM that
it is not necessary to accumulate lines of underscores in trace
and it is not necessary to trace nested Lib-DVM calls, which
gives much smaller size of the trace file.

** Note**. To run the parallel program on one
processor with explicitly defined processor configuration or with
dynamic set up for allocated processors it is necessary to define
corresponding “virtual” processor system by

For example, to define 2*2 “virtual” processor system use the following parameter values:

IsUserPS=1; |
- use “virtual” processor system definition; |

UserPS=2,2; |
- “virtual” processor system topology. |

Let us consider the following trace fragment:

1. call_getlen_
TIME=0.000005 LINE=31 FILE=GAUSS_C.CDV

2. ArrayHandlePtr=951cd0;

…

3. ret_getlen_ TIME=0.000002
LINE=31 FILE=GAUSS_C.CDV

4. Res=4;

Line 1 identifies function of Lib-DVM (**getlen**_). It
contains:

- time of execution of the user part of the program (time
interval between previous Lib-DVM return and the given
function call - TIME=0.000005; this time (
**call_time**) is summed up in**CPU_Time**field - source file line number (LINE=31) which contains Lib-DVM
function call. This information is transferred to
**source_line**field of “Interval” structure; - source file name (FILE=GAUSS_C.CDV), it is transferred to
**source_file**field of “Interval” structure;

Line 2 (and probably lines after it) contains names and values
of actual function parameters. They are transformed into input
numerical values, packed in structures and used for simulation of
every Lib-DVM function.

Line 3 traces information about Lib-DVM function return. The only
value used by ** Predictor** is the time (

Line 4 contains function return value and used as it was described for the line 2.

Every Lib-DVM function is processed by ** Predictor**
in accordance with its own algorithm, though many functions
(“unknown” functions and “ordinary”
functions) are processed by the same algorithm.

** Predictor** configuration information is defined
in the special file (“

**// System type = network | transputer
type = network;
// Communication characteristics (mks)
start time = 75;
send byte time = 0.2;
// Comparative processors performance
power = 1.00;
// Topology - optional
topology = {2, 2};**

Lines beginning from “//” are comments. The
parameter **topology** defines the virtual processor system
topology, i.e. its rank and size on each dimension. The value **network**
of **type** parameter means that the processor system is the
network of workstations, and the value **transputer** means
that the dedicated processor system uses transputer system as
communication network. The linear approximation is used to
calculate the time of transmission of **n **bytes:

T = (start time) + n *( send byte time),

where:

**start time** – start time of the data transmission;

**send byte time** - time of 1 byte transmission.

Parameter **Power **defines the** **ratio of the
productivity of the processor where ** Predictor** is
working to the productivity of the processor where the parallel
program will be executed.

The order of parameters is arbitrary.

Structure of the output HTML-file is the same as the interval structure in the program. Every HTML-file fragment corresponds to some interval and contains data characterizing the interval, the integral characteristics of the program execution on the interval and also links to the fragments with the information about nested intervals. HTML-file is a tree of intervals with special buttons to traverse the tree in any direction.

**3.1 General principles of simulation**

At the first stage of simulation call-graph of Lib-DVM calls
with actual parameters is built. Such a graph is a linear
sequence of calls, as nested calls are not accumulated in the
trace for ** Predictor**. The call-graph is built in
three steps:

Trace file ->

vector ofTraceLineobjects ->

vector ofFuncCallobjects.

At the second stage actual simulation of the program execution on the given multiprocessor system is performed.

To calculate the program execution characteristics Lib-DVM calls are simulated in the same order as they are in trace. During the simulation the following auxiliary structures and structure arrays are used:

- structure with information about the processor system;
- array of structures with information about existing abstract machine representations;
- array of structures with information about existing distributed arrays;
- structure with information about the parallel loop last created;
- array of structures with information about existing reduction variables;
- array of structures with information about existing reduction groups;
- array of structures with information about existing shadow edge renewal groups;
- array of structures with information about reduction operations already started but not finished;
- array of structures with information about shadow edge renewal operations already started but not finished;
- stack of virtual machines (and the current virtual machine in it) corresponding to dynamic created virtual machines and ”their entries”.

During the call simulation array elements are created and deleted in the same order as they were during the program execution.

All support system functions can be divided in the following classes (from the point of view of simulation):

- ordinary functions;
- interval delimiter functions;
- input/output functions;
- data transfer functions;
- create/delete object functions;
- data and resources distribution functions;
- functions of collective operation initialization;
- functions of collective operation execution;
- parallel loop functions;
- unknown functions.

The principles of the simulation of the above functions will be considered in the next chapters.

In this group there are functions executed on every processor of simulated processor system (or its subsystem the current task is mapped on), and the time of their execution does not depend on configuration parameters and is equal to the time of execution on serial computer (taking into account performance of processors). If it is not specified explicitly, Lib-DVM function is simulated according to the following algorithm.

*Simulation**:* times **call_time** and **ret_time**
are added to** Execution_time**, time **call_time **is
added to **CPU_time**, and **ret_time** is added to **SYS_time**
of each processor. Besides, **Insuff_parallelism_USR** of each
processor is added to the time calculated by formula:

T = Tcall_time_{ }* (N_{proc }- 1) / N_{proc}

and **Insuff_parallelism_SYS **is added to the time:

T = Tret_time_{ }* (N_{proc }- 1)/ N_{proc}

This is a base algorithm for time simulation. Further on we will suppose that execution time is simulated according to this algorithm, unless another algorithm is indicated explicitly.

**3.3
Interval delimiter functions**

In this group there are functions used as delimiters of intervals of the simulated program.

Functions:

binter_ |
Create user interval. |

bploop_ |
Create parallel interval. |

bsloop_ |
Create sequential interval. |

** Simulation algorithm:** according to the
information from trace (source file name

binter_ |
User (USER) |

bsloop_ |
Sequential (SEQ) |

bploop_ |
Parallel (PAR) |

If interval is found its value of **EXE_count** increases
by 1, otherwise a new element of the array of intervals nested in
the current interval is created with **EXE_count=1**. The
found interval or new interval becomes current.

Functions*:*

einter_ |
- end of user interval; |

eloop_ |
- end of parallel and sequential interval. |

** Simulation algorithm:** an interval which is
external to the current interval becomes current. Times are
corrected according to the base algorithm.

In this group there are functions used for data input/output. These functions are executed on input/output processor.

** Simulation algorithm:** time simulation algorithm
differs from the base one. Function execution time is added to

**3.5 Create/delete object functions**

In this group there are functions used for creating and
deleting different objects of Lib-DVM support system . When one
of the functions is found in the trace it is necessary either to
create and initialize or delete corresponding object in simulated
multiprocessor system. As a rule creating and deleting objects
are performed by corresponding constructor or destructor of the *time
extrapolator* *(RATER)*.

Function:

crtps_ |
- create subsystem of the given multiprocessor system. |

As subtask is run by:

**long runam_ (AMRef *AMRefPtr)**

where:

***AMRefPtr** – reference to an abstract machine of the
subtask

and the current subtask is terminated by:

**stopam_ ()**

there is a stack of pairs (AM, PS) in from the *Predictor**
*(where **runam_** push pair (AM, PS) in stack, and **stopam_**
pop stack). The top of the stack defines the current system statdefines root e: current ÀÌ and current
PS. The bottom of the stack AM and PS (rootAM, rootPS).

Pair (AM, PS) is created by simulation of call:

**mapam_ (AMRef *AMRefPtr, PSRef *PSRefPtr ).**

Topology PS the corresponding AM is mapped on is read from ** Predictor**
configuration file (

Call **getamr_** has a parameter:

**IndexArray** - array with **i**-th element containing
index value of the corresponding element (i.e. abstract machine)
along (**i+1**)-th dimension. The only thing which matters for
** Predictor** is that a new AM is created and then
mapped on new PS.

** Simulation algorithm:** on the base of information
from parameter file an object of

Function:

crtamv_ |
- create representation of an abstract machine. |

** Simulation algorithm:** on the base of parameters
from the trace (

Function:

crtda_ |
- create distributed array. |

** Simulation algorithm:** on the base of parameters
from the trace (

Function:

crtrg_ |
- create reduction group. |

** Simulation algorithm:** an object of

Function:

crtred_ |
- create reduction variable. |

*Simulation algorithm**:* an object of **RedVar**
class is created (constructor **RedVar::RedVar(long ARedElmSize,
long ARedArrLength, long ALocElmSize)**). Parameters **ARedArrLength**
and **ALocElmSize** are actual function parameters, parameter **ARedElmSize**
is calculated using **RedArrayType** parameter according to
the following table:

RedArrayType |
C type |
ARedElmSize |

1 (rt_INT) | int | sizeof(int) |

2 (rt_LONG) | long | sizeof(long) |

3 (rt_FLOAT) | float | sizeof(float) |

4 (rt_DOUBLE) | double | sizeof(doble) |

Reference to the created object and returned parameter **RedRef**
are saved in a new element of array of reduction variables.

Function:

crtshg_ |
- create edge group. |

** Simulation algorithm:** an object of

Function:

delamv_ |
- delete an abstract machine representation. |

** Simulation algorithm:** an element with key

Function:

delda_ |
- delete distributed array. |

** Simulation algorithm:** an element with key

Functions:

delred_ |
- delete reduction variable; |

delrg_ |
- delete reduction group. |

** Simulation algorithm:** an element with key

Function:

delshg_ |
- delete edge group. |

** Simulation algorithm:** an element with key

**3.6 Data and resources distribution functions**

In this group there are functions used for initial distribution and redistribution of data and resources.

Function:

distr_ |
- | define mapping of an abstract machine representation on multiprocessor system (resource distribution). |

** Simulation algorithm:** an element with key

Function:

align_ |
- define location of distributed array (alignment). |

** Simulation algorithm:** an element with key

Function:

redis_ |
- | redefine mapping of an abstract machine representation on multiprocessor system (resource distribution). |

** Simulation algorithm**: an element with key

Time simulation algorithm differs from the base one. First,
time counters of all processors are set to the same value which
is equal to maximal counter values at the moment of the beginning
of **redis_ **execution. The time which has been added into
the counter sums with **Execution_Time** and with **Synchronization**
of the given processor. The time returned by **AMView::RDisAM(…)
**method is added to **Execution_time** and **Redistribution**
of each processor.

Function:

realn_ |
- redefine location of distributed array. |

** Simulation algorithm:** an element with key

Time simulation algorithm differs from the base one. At first,
time counters of all processors are set to the same value which
is equal to maximal counter values at the moment of the beginning
of **realn_ **execution. The time which has been added to the
counter sums with **Execution_Time** and with **Synchronization**
of the given processor. The time returned by **DArray::RAlnDA(…)**
method is added to **Execution_time** and **Redistribution**
of each processor.

**3.7 Functions of collective operation
initialization**

In this group there are functions used for including reduction variable in reduction and including distributed array edge in edge group.

Function:

insred_ |
- include reduction variable in reduction group. |

** Simulation algorithm:** an element with key

Function:

inssh_ |
- include distributed array edge in edge group. |

** Simulation algorithm:** an element with key

**3.8 Functions of collective operation execution**

In this group there are functions used for collective operation execution.

Function:

arrcpy_ |
- copy distributed arrays. |

*Simulation algorithm**:* elements with keys **FromArrayHandlePtr**
and **ToArrayHandlePtr** are looked for in the array of
distributed arrays. The function **ArrayCopy(DArray* AFromArray,
long* AFromInitIndexArray, long* AFromLastIndexArray, long*
AFromStepArray, DArray* AToArray, long* AToInitIndexArray, long*
AToLastIndexArray, long* AToStepArray, long ACopyRegim)** is
executed. Parameters are references to objects of **DArray**
class from found records. Other parameters are actual parameters **FromInitIndexArray[],
FromLastIndexArray[], FromStepArray[], ToInitIndexArray[],
ToLastIndexArray[], ToStepArray[], CopyRegim** of **arrcpy_**
function.

Time simulation algorithm differs from the base one. At first,
time counters of all processors are set to the same value which
is equal to maximal counter values at the moment of the beginning
of **arrcpy_ **execution. The time which has been added to the
counter sums with **Execution_Time** and with **Synchronization**
of the given processor. The time returned by **ArrayCpy(…) **function
is added to **Execution_time** and **Remote_access** of
each processor.

Function:

strtrd_ |
- start of reduction group. |

** Simulation algorithm:** an element with key

Time simulation algorithm differs from the base one. At first,
time counters of all processors are set to the same value which
is equal to maximal counter values at the moment of the beginning
of **strtrd_ **execution. The time which has been added to the
counter sums with **Execution_Time** and with **Synchronization**
of the given processor.

Function:

strtsh_ |
- start of edge exchange of the given group. |

*Simulation algorithm:*** :** an element
with key

Time simulation algorithm differs from the base one. At first,
time counters of all processors are set to the same value which
is equal to maximal counter values at the moment of the beginning
of **strtsh_ **execution. The time which has been added to the
counter sums with **Execution_Time** and with **Synchronization**
of the given processor.

Function:

waitrd_ |
- wait reduction completion. |

** Simulation algorithm:** an element with key

Time simulation algorithm differs from the base one. For each
processor the current value of time counter is compared with the
time of reduction completion fixed during the simulation of **strtrd_
**function. The difference between above times is added to **Reduction_overlap**.
If the value of processor time counter is less than the time of
reduction completion, then the counter value is set to time of
reduction completion. The time added to the counter sums to **Execution_Time**
and **Reduction_wait** of the given processor.

Function:

waitsh_ |
- wait completion of edge exchange of the given group. |

** Simulation algorithm:** an element with key

Time simulation algorithm differs from the base one. For each
processor the current value of time counter is compared with the
time of exchange completion fixed during the simulation of **strtsh_
**function. The difference between above times is added to **Shadow_overlap**.
If the value of processor time counter is less than the time of
exchange completion, then the counter value is set as time of
exchange completion. The time added to the counter sums to **Execution_Time**
and **Wait_shadow** of the given processor.

In this group there are functions used for parallel loop initialization, distribution of its iterations between processors and the parallel loop execution.

Function:

crtpl_ |
- create parallel loop. |

** Simulation algorithm:** parameter

Function:

mappl_ |
- mapping parallel loop. |

** Simulation algorithm:** an object of

Function:

dopl_ |
- define if it is necessary to continue parallel loop execution. |

** Simulation algorithm:** Time simulation algorithm
differs from the base one. The total number of loop iterations (

(T_{func}/N_{i})*((N_{proc}-1)/N_{proc})

and added to **Insufficient_parallelism_Par**.

** Simulation algorithm:** unknown functions are
simulated according to the base algorithm. The warning about
unknown function in the trace is output.

**4 Estimation of remote data access
overhead**

**4.1
Main terms and definitions**

Before describing algorithms of estimation of remote data access overhead, let us consider the method of mapping calculations on distributed system. Diagram of array distributions on processor topology in DVM model is shown in Fig. 1.

Fig. 1. DVM- model of data distribution on processors.

Let us introduce a notion of *Abstract machine
representation *(for short – **Template** or **AMView**)
– a special array intended for providing two-stage arrays
mapping and calculations mapping on *processor system*:
first they (**Darrays**) are mapped on **AMView**, which
then are mapped on processor system (**VM**). The first
mapping is entirely determined by interrelation between data and
calculations characteristic for algorithm and do not depend on
architecture or configuration of distributed computational
system. The second mapping serves for the parallel program tuning
according to available system resources.

While aligning (realigning) a distributed array, an ordinary
distributed array that has already been mapped on some **AMView**
can act as **AMView** itself.

Information about aligning arrays on template can be received
using **DArray::AlnDA** function call. Distribution of the
template on the processor system is performed using **AMView::DisAm**.

**4.2 Estimation of array redistribution exchanges**

Let’s consider the algorithm of computing overhead
charges of realigning the array over template (**realign**).
This algorithm is implemented in **Darray:RalnDA** function,
which returns amount of time spent in exchanges between
processors during the given operation. Besides, the function
changes the information of the given array, about the template it
is mapped on and by which rule, according to function parameters.
Correction of lists of aligned arrays is performed, for
corresponding templates.

At the first stage of algorithm we check the value of the **ANewSign**
input parameter. If the value is non-zero, contents of the
realigned array will be updated, therefore it is not necessary to
transfer useless elements, so function returns zero. Otherwise,
algorithm continues working.

Saving the information about the array position before
realigning, we get the information about it’s new position
using **DArray:AlnDA** function.

Then, using **CommCost::Update(DArray *, DArray *)**, array
**CommCost::transfer** (filled with zeros at first) is
changed, according to the information about array position
obtained before. This array (**CommCost::transfer**) contains
the information about number of bytes being transferred between
every two processors.

And finally, the amount of time is computed based on data in *transfer*
array, using function **CommCost::GetCost**.

Algorithm implemented in **Update** function:

- Information about array distribution before and after
performing the operation (which we call array
redistribution) which changes the array position. At
first we check if the array was fully replicated (using
the template) before the redistribution (this property is
implemented in
**Darray::AlnDA**function). Let it be so, otherwise go to stage**2**. If the set of processors (M1) on which template elements were situated before the redistribution, contains the set of processors (M2), on which template elements are located after the redistribution, there’s no exchange, hence transfer array is unchanged and algorithm comes to an end. Otherwise, we look for the nearest processor (P1) to every processor (P2) that belongs to M1/M2. It (P1) is situated on the edge of rectangular section made up by processors that belong to M1. Then we perform the described actions for every such pair of processors. We find the array section that belongs to P2, using constructor**Block::Block(DArray *, long**). Array section that belonged to P1 before the operation, equals this array, as it was replicated according to the template. Taking intersection of these two sections and using operator ^ (Block &, Block &)^, we get a section we intend to transfer from P1 to P2. After finding out amount of bytes in it with**Block::GetBlockSize**, we add it to the corresponding element of transfer array. Algorithm comes to an end. It is needed to find out all processor system dimensions on which the array has been distributed or partly distributed before redistribution. If there are no such dimensions go to stage**3.**For the processor system dimensions on which the array has been partly distributed, we find out sets of index values for which there were array elements on corresponding processors before redistribution. We go round all the processors (current is P1) which indices, in the dimensions the array is distributed on, are equal to zero and which indexes, in the dimensions the array is partly distributed on, are equal the minimum of the received corresponding sets.

- For these processors we find out the array section which have been distributed on every processor before redistribution. For all the processors (current is P2) we find out the array sections which are on every processor after redistribution. We take the intersection of the above sections for the current processors P1 and P2. For the processor P2, we find out the nearest processor with array section coinciding with the section on processor P1. Indexes of this processor coincide with indexes of P1 in all dimensions of processor system, except dimensions by which the array was muliplicated or is replicated. The indexes taken are the nearest indexes to P2 indexes in these dimensions (defined using the obtained sets). If id of some processor isn’t equal to the id of P2, number of bytes in intersection is added to the corresponding element of the transfer array. After processing all P1, P2 pairs, algorithm stops.

- For every processor, array sections that belonged to it before and after the redistribution are determined. For every pair of non-coinciding processors, we find their intersection. Number of bytes in the intersection is added to the element of the transfer array, corresponding to the given processor pair. After processing all processor pairs, the algorithm stops.

For finding the time spared by processors, in **GetCost**
function, one of the two algorithms is used. The choice of
algorithm depends on distributed processor system type. In case
of the net with bus organization, the sought time is computed by
this formula:

where *N* is number of the processors, *Ts* –
start time of the exchange operation, *T**B*
– time to send one byte. This formula is a consequence of
the fact that several messages can’t be simultaneously sent
over the net and, consequently, total time equals to the sum of
all exchange times between any two non-coinciding processors.

Total time for the transputer grid generally depends on the shortest way between the two processors most distant from each other, if exchanges between them exist. Also, while evaluating this time it is necessary to consider the possibility of message pipelining. As follows from this remarks, we can make up an algorithm and a formula to determine overhead charges:

- While passing the transfer array, for every non-zero
element, we determine the distance between corresponding
processors using
**VM:GetDistance(long, long)**. We also determine the current maximum distance and the current maximum byte number sent over this distance. Going on this way, we pass all the transfer array. As a result we have:*l*– the distance between the two processors most distant from each other and*L**B**–*length of the biggest message sent over distance*l*. If*l*= 0, the result is zero and the algorithm stops. Otherwise, move on to the next stage. - If
*l*= 1, the sought time is calculated using formula:

If

l> 1, pipelining of the message is possible. So, at first we examine the following function for extremes, which describes the dependence ofLBfromS– size of the message sent during one pipeline phase:

We get that minimum is reached when:

Considering that the message size and the number of phases are integers, we come to this expression:

where

S’= [S] (S’= 1, if [S] = 0);cÎ{0, 1} – an evidence thatS’doesn’t divideLB. In order to find the exact value we find the time by the same formula, but in the integer points near toS’. If they are bigger, value in the pointS’is the sought value. Otherwise we perform the decreasing search, until value on the stagekis bigger than value onk-1. Then, the sought time is the value on stagek-1. Algorithm stops.

As a conclusion, we’ll consider the algorithm of finding
overhead charges while redistributing the template. These charges
arise because all the distributed arrays aligned by the template
using **DArray::AlnDA** (both directly and indirectly), will
be mapped on the template again, after it changes it’s
position in the processor system. Consequently, they will change
their position, and it will lead to the exchanges. The given
algorithm is implemented in **AMView::RdisAM**, which returns
the time spent in the interprocessor exchanges while performing
this operation. Also, the function replaces the information for
the given template, about the rule by which it was mapped on the
processor system (received by the **AMView::DisAM** function),
with information about it’s new position according to the
new parameters, given in the function call.

In the first stage of the algorithm we check the value of the **ANewSign**
input parameter. If the value is non-zero, contents of all the
distributed arrays aligned with the given template will be
updated, therefore it is not necessary to transfer useless
elements, so function returns zero. Otherwise, algorithm
continues working.

Saving the information about the template position before
redistribution, we get the information about it’s new
position after the given operation using the **AMView::DisAM**
function. It is necessary in order to know how the arrays aligned
by this template are positioned before and after the
redistribution.

Then, for every array aligned by the given template, array **CommCost::transfe**r
(filled with zeros at first) is changed, using **CommCost::Update(DArray
*, DArray *)**, according to the information about the array
position before and after the redistribution received before.

And finally, the sought amount of time is computed based on
data in the *transfer* array, using the function **CommCost::GetCost**.

**4.3 Estimation of distributed array edge
exchanges**

Let’s consider the algorithm of computing the time needed for the exchange of the given group bounds. It consists of two parts.

- Inclusion of all needed bounds of distributed arrays in
the given group is performed using
**BoundGroup::AddBound**function for every such array. At the same time the transfer array is changed accordingly. - The sought time is determined by the received transfer
array, using the
**CommCost::GetCost**function (it occurs after the**BoundGroup::StartB**function call).

Algorithm implemented in **CommCost::GetCost** was
described in the previous paragraph, so we’ll consider
algorithm of the **BoundGroup::AddBound** function:

- If the array is fully replicated, function
exits without altering the
*transfer*array (filled with zeros when the group is created). Otherwise the algorithm continues. Then, the possibility to include the bound of the given array into the group with given function parameters is checked. If it is possible, determine using function**DArray::GetMapDim**: by what dimensions will the real bounds exchange be conducted (the dimensions by which array was not replicated), corresponding array dimensions and directions of their break-down. This information is put into**dimInfo**array (its elements are the objects of**DimBound**class). Then we determine the criteria of inclusion in “corner” elements bounds: it is present, if the parameter of the given**ACornerSign**function equals 1 and number of processor system dimensions by which the exchange will be conducted is bigger than 1. The**CommCost::BoundUpdate**function is called with information already determined before as parameters.

- In the
**BoundUpdate**function these actions are performed for every processor. Array section on the given processor is determined. If there are no array elements on this processor, move to the next one. Otherwise, we determine the needed transfers for right and left array section bounds, positioned on processors adjacent by the dimensions that are contained in the dimInfo array. In order to do it, we pass the**dimInfo**array and determine, using**Block::IsLeft (IsRight)**, if they are elements to the right (left) of the given elements according to the current position of the distributed array, for the given section. If there are such elements, and the right (left) bound in the given dimension is determined by the**addBound**function call, we determine the size of transferred boundary and, using the**VM::GetSpecLI**function, find the number of the adjacent processor, on which the given bound is transferred (using the information from the**dimInfo**array corresponding to the given array dimension). Size of the transferred bound is defined like size of the current section , in which number of the elements in the current array dimension is equal to the width of the right (left) boundary in this dimension, in bytes. This value is added to the corresponding element of the*transfer*matrix. If finished with the left and right boundaries, we move on to the “corner” elements, if their criterion of inclusion in the array boundaries is present (let’s consider only two cases, when the “corner” elements are in the intersection of real borders by two dimensions, so there are two elements in**dimInfo**array). Similarly to what has been described, but by all the dimensions of the array dimensions in**dimInfo**, we determine if there exist (for the given array section) elements that are adjacent by the dimensions given for the corner elements of the section by directions of all the diagonals in this set of dimensions (using**IsLeft**and**IsRight**functions). If there are such elements and the corresponding bounds by the given dimensions are given, we determine the size of transferred “corner” elements section and, using the**VM::GetSpecLI**function (which is used number of times equal to the number of the dimensions in the set), find the number of the adjacent processor, on which the given boundary is transferred (using information corresponding to the given array dimensions from the**dimInfo**array). Size of the section that is transferred is defined like size of the current section, in which number of the elements in every array dimension from this set is equal to the width of the corresponding boundary in this dimension, which takes part in the current “corner” element section, in bytes. This value is added to the corresponding element of the*transfer*matrix. Algorithm stops.

**4.4 Estimation of reduction operation exchanges **

The algorithm of determining time spent in exchanges during
the reduction operations is implemented in the **RedGroup::StartR**
function. Before the description of algorithm, we’ll show
what preparatory work is performed.

When another reduction variable is added to the reduction
group, counter of the number of sent bytes (**TotalSize**) is
increased by the size of the reduction variable and auxiliary
information (given for some of the reduction operations) in bytes
using the **RedGroup:AddRV** function. Information about the
reduction variable type isn’t taken into consideration when
evaluating the exchange time.

Description of **StartR** function:

- In input we have the information about which array the
parallel loop, in which reduction operations from the
given group are performed, is mapped on; number of
dimensions of the parallel loop; array that contains the
information about how loop dimensions are mapped on the
array dimensions. If the array on which the parallel loop
is mapped is fully replicated, function exits returning
zero time. Otherwise, using the
**DArray::GetMapDim**function, we determine the processor system dimension for every loop dimension. The loop is mapped on this dimension. If the loop dimension is replicated over all the processor system dimensions, we put either the corresponding dimension number or zero in the**loopAlign**array. Then, using the obtained information, we determine the sought time that depends on the distributed system type.

- If we have the net with bus organization, the
main idea is to send the information by the dimensions
the loop is replicated by while collecting the
information. Hence we can collect it only by one section
made up of processors with fixed indexes by the given
dimensions. Then we broadcast the result to all the
processors. This leads us to the following formula:
*time***= (***T**S***+***T**B***TotalSize)****×****(***N**i1***×****...****×***N**ik***+***N***– 2) ,**where exchange operation,

*T**B*– time to send one byte,*N**ik*–*N*is the number of processors,*Ts*– start time of the number of the processors by the processor system dimension with number**loopAlign[***ik***]**, on which the loop iterations are present. Function returns the value calculated. - If it is the transputer grid, we can collect the
information by all of the sections described in point 2
in parallel. Then we broadcast the result of the
reduction operation to the same sections. Time to collect
the values and to broadcast the result depends on the
distance from the geometrical center of such section to
the most distant corner. Therefore, at first we determine
this value (called
**Distance**) using the**loopAlign**array and information about the size of the corresponding processor system dimensions. Then the sought time is calculated using the following formula:

time= (TS+TB×TotalSize)×(2×Distance + ConerDistance),

where

ConerDistanceis distance from the given sections to the most distant corner processor. Return the resulting value.

**4.5 Estimation of exchanges during remote array
buffer loading**

If aligning the arrays doesn’t get us rid of the remote
data and no shadow edges can be used to access them, their
bufferization is performed through the separate buffer array on
every processor. Estimated time spent during the buffer loading
is calculated in the **ArrayCopy** function. The interface of
this function allows further expansion of the output language,
therefore some parameters aren’t used at this stage.
Exactly, exchanges during the replication of the distributed
array (array that is read) over all the processors in the
processor system are evaluated in this function. Below there is a
description of algorithm implemented in **ArrayCopy**
function:

- We check that the input parameters that determine the
size of the array section don’t exceed the array
bounds using the
**DArray::CheckIndex**function. If the result is positive, the algorithm continues, otherwise an error message is generated and the algorithm stops. *transfer*array (filled with zeros when the function is called) is altered using function**CommCost::CopyUpdate**, according to the parameters that determine the section to replicate and to the information about the distribution of the given array.- The sought time is determined using the
**CommCost::GetCost**function (3.2) and data from the*transfer*array.

In the **CopyUpdate** function, the following actions are
performed:

- Check if the criterion of full replication by template
(implemented in
**DArray::AlnDA**function) is present for the given array. Otherwise move to stage 2. If the template elements are present on all the processors of the system (determined using the information about how template is mapped on the processor system), there are no transfers and*transfer*array is unchanged, therefore function exits. Otherwise, for every processor that doesn’t contain template elements, we find the processor nearest to it that does. Size (in bytes) of the given section of array that is read (previously determined by the**Block::GetBlockSize**function) is put in the corresponding element of the transfer array, for every such processor pair. - All processor system dimensions by which the array is
replicated (or partly replicated) are determined. If
there are no such dimensions, move on to stage 3. In
order to measure the dimensions by which the array is
partly replicated, we determine sets of index values.
This values are determined so that if indexes have these
values, corresponding processors have the array elements
on them. We pass all the processors (let the current one
be P1) with zero indexes in the dimensions that array is
replicated by; and indexes that are equal to minimum of
the corresponding sets, in the dimensions that array is
partly replicated by. For every such processor we find
the intersection (let the current be C1) of the array
section on it with the replicated section given in the
function call. We determine the number of bytes in the
given intersection C1. For every processor (let the
current one be P2) we determine the array sections (let
the current one be C2) on it. If the intersection of C1
and C2 is not empty, move on to the next processor pair.
Otherwise determine the processor nearest to P2 on which
the array section coincides with the P1 section. If
it’s number isn’t equal to the number of P2,
add the number of bytes in intersection determined before
to the corresponding element of the
*transfer*array. After handling all the (P1, P2) processor pairs the algorithm stops. - Determine the section that is the intersection of the
array section on the processor with the section to be
replicated, for every processor. If its empty, move on to
the next processor. Otherwise determine the size (in
bytes) of the given section. Also, determine the array
section on the processor, for every processor. Find the
intersection of such sections for every pair of
non-coinciding processors. If it isn’t empty, move
on to the next processor pair. Otherwise, the number of
bytes received before is added to the element of the
*transfer*array corresponding to the given processor pair. After handling all such processor pairs, the algorithm stops.

**Appendix
1. Link from field name in output HTML-file to field name in
structure _IntervalResult ==>**