Prefictor of DVM-program performance. User's guide

Processors

– dedicated processor system topology;

Efficiency

– efficiency coefficient that is equal to ratio of Productive time to Total time;

Execution time

– execution time that is equal to maximum value of the interval execution times on all processors ;

Total time

– total time of processor usage, i.e. product of Execution time by number of processors;

Productive time

– productive time, that is equal to Productive time: CPU plus Productive time:SYS plus Productive time: I/O;

Predictor of DVM-program performance
User's guide
* June, 2000 *

- last edited 19.10.00 -

Contents

1 Introduction
2 Trace file generation
3 Representation of the program as an hierarchy of intervals
4 Compilation and execution of the program
5 Preparation of the processor system configuration file
6 Predictor command line options
7 Results of predictor’s work
8 Protocol of predictor’s work

1 Introduction

The predictor is intended for performance analysis and performance debugging of DVM-programs without usage of a real parallel computer (access to which is usually limited or complicated). With the predictor user can get the predicted time characteristics of execution of his program on MPP or workstation cluster in more or less details.

The predictor is the system for processing the trace information gathered by DVM-program Run-Time system (Lib-DVM library) during the program execution on a workstation and consists of two major components: trace interpreter (PRESAGE) and time estimator (RATER). The trace interpreter, using trace information and user-defined parameters, calculates and displays extrapolated time characteristics for the program execution on target computer (MPP or workstation cluster), calling functions of time estimator, which simulates parallel DVM-program execution. In fact, time estimator is a model of Lib-DVM library, low level message passing system (MPI), used by the library and hardware.

The performance of parallel programs on multiprocessor computers with distributed memory is determined by the following major factors:

program parallelism degree - a part of parallel calculations in the total volume of calculations;
balance of processor load during parallel calculations;
time needed for execution of interprocessor communications;
degree of overlapping of interprocessor communications with calculations.

The predictor allows to a user to obtain numerical estimations of influence of each factor above. A prediction precision is determined by the following implementation restrictions.

Firstly, the predictor considers, that ratio of execution times of any program part on target computer and a workstation is constant and predictor startup parameter. Secondly, when communication time of any group operation (reduction, shadow edge renewing, remote access buffer loading and so on) is predicted, characteristics of communication network, the number and size of sent messages are taken into account only. Inter-influence of communications of different group operations is not taken into account.

These restrictions are insignificant if a programmer goal is to achieve affectivity of his program execution on different architecture and performance computers. If his goal is to obtain time characteristics on certain computer, the data, given by the predictor should be interpreted taking into account program performance characteristics really obtained on given target computer.

2 Trace file generation

Before running the predictor it is necessary to execute the analyzed DVM-program on a workstation in order to gather the trace information (trace file), which is used by the predictor as input data. Gathering the trace information consists of the following stages:

Splitting the program into intervals (mark the program parts which are especially interesting for user).
Program compilation.
Program execution.

3 Representation of the program as an hierarchy of intervals

The whole program is an interval of the highest (zero) level. This interval can contain several intervals of the next (first) level. Parallel loop, sequential loop, and any statement sequence marked by the programmer and whose execution is always started from the first statement and completed on last statement can be considered as an interval. First level intervals can contain second level intervals etc.

Time characteristics are calculated not only for the whole program, but also for each interval in it. Repeated execution of an interval can be considered as the execution of a separate program containing the multiplied interval statement sequence, which was executed during the real parallel program execution, on the same processors. Actually, the characteristics of the intervals, which are executed many times, are accumulated during each execution of the intervals. Intervals that belong to the same higher level interval are identified by the source file name and a line number in the file, which corresponds to the beginning of the interval, and optionally by some integer number attributed to the interval by the programmer.

Interval program splitting is controlled on the program compilation stage. Programmer can set such compilation modes when parallel loops or sequential loops, containing parallel loops, or all sequential loops in the program, or marked program statement sequences will be intervals. Compiler’s options, that control program splitting on intervals are explained in corresponding user manuals (see “C-DVM compiler user guide”, “Fortran-DVM compiler user guide”).

It is necessary to note that when a program performance is debugged, a user must not run the program with real large volume of computations needed for using the program to solve real tasks. He can, for example, limit a number of regular repeated external iterations up to one-two iterations. The program efficiency coefficient, significantly depending on losses on the program parts, which are executed before first iteration or after last iteration completion, can be significantly decreased. But the user can specify external iteration execution as separate interval and debug this interval performance only.

4 Compilation and execution of the program

After splitting the program into intervals it is necessary to perform the following steps:

program conversion;
program compilation and linking;
program execution.

For convenience, these steps can be performed with the next commands:

For conversion, compilation and linking DVM-program the following commands are provided:

dvm c <C-DVM-program name>
dvm f <F-DVM-program name>

To run the program with trace accumulation for predictor the following command is provided:

dvm runpred < DVM- program name>

Processing result: File <DVM-program name>.ptr is created in the current directory.

Note 1. The following parameters are set automatically during the program execution:

*Is_DVM_TRACE=1;*	- trace on;
*FileTrace=1;*	- accumulate trace in files;
*MaxIntervalLevel=3;*	- maximum level of nested intervals;
*PreUnderLine=0;*	- do not underline “call” in trace file;
*PostUnderLine=0;*	- do not underline “ret” in trace file;
*MaxTraceLevel=0;*	- maximum trace depth for nested functions.

Parameters PreUnderLine, PostUnderLine and MaxTraceLevel say to LIB_DVM that it is not necessary to accumulate lines of underscores in trace and it is not necessary to trace nested LIB_DVM calls, it gives much smaller size of the trace file.

Note 2. To run the parallel program with explicitly defined processor configuration or with dynamic set up for allocated processors it is necessary to define corresponding “virtual” processor system by IsUserPS and UserPS parameters.

For example, to define 2*2 “virtual” processor system use following parameter values:

*IsUserPS=1;*	- use “virtual” processor system definition,
*UserPS=2,2;*	- “virtual” processor system topology.

5 Preparation of the processor system configuration file

Not only the trace of the program execution on one processor, but also configuration information that describe target computer are input data for simulating time characteristics of the program execution

Conceptually, the work of the predictor consists of three basic stages:

reading the trace; getting from it information about the interval structure, about a sequence of the Run-Time system function calls, about input and output function call parameters, which are necessary for their simulating and also about the time of each call execution;
simulating the program execution on dedicated computer which is based on the program execution information received on the previous stage; calculation of time characteristics of the program execution on each interval;
writing the calculated characteristics into HTML-files.

Configuration information saved in “predictor.par” file defines the characteristics of the dedicated (simulated) multiprocessor system and has the following structure:

// System type = network | transputer
type = network;

// Communication characteristics (mks)
start time = 75;
send byte time = 0.2;

// Comparative processors performance
power = 1.00;

// topologies

//root topology
topology = {4 , 2} ;

Lines beginning from “//” are comments. The parameter topology defines the dedicated processor system topology, i.e. its rank and size on each dimension. Parameter format topology = {4, 2} defines 4*2 processor matrix. If “virtual” processor system was defined by IsUserPS and UserPS parameters, dedicated processor system topology must coincide with “virtual” processor system topology.

If the value of the parameter is network, it means that the dedicated processor system is the network of workstations with bus architecture, and the value transputer means that the dedicated processor system uses transputer system as communication network.

Start time and send byte time are characteristics of communication hardware. These characteristics are used for message transmission time calculations. For example for workstation network the linear approximation is used to calculate the time of transmission of n bytes:

T = (start time) + n * (send byte time),

where:

start time	- start time of the data transmission;
send byte time	- time of 1 byte transmission;

Power parameter defines the ratio of the productivity of the processor on dedicated system to the productivity of the computer where the trace for predictor was done.

6 Predictor command line options

To start the predictor the following command line may be used:

predictor <par_file> <trace_file> <html_file> [<processor>]

where:

<par_file>	- parameter file name, containing configuration information of target computer;
<trace_file>	- name of the file with trace information;
<html_file>	- name of the file where HTML-pages with the program results are placed.
<processor>	- processor topology that is a number of processors for each dimension of processor grid. This parameter is similar to topology parameter in the configuration file and overrides it.

All command line parameters except <processor> are obligatory.

For convenience to start predictor the following command is provided:

dvm pred [processor matrix] <DVM-program name>

Program execution is controlled by following environment variables (see dvm.bat file):

Pred_sys - configuration file name, which describe target computer;
Pred_vis - name of Web browser.

For predictor normal work with trace <DVM-program name>.ptr is needed. It can be obtained by dvm runpred command.

Processing result:
Result of predictor work is the file <DVM-program name><processor matrix>.html, which contains a html file. If Web browser name (for example, Netscape) is set, then it is invoked just after program termination. In such a case the command execution is completed after finishing browser operating.

7 Results of predictor’s work

Structure of the output HTML-file is the same as the interval structure in the program. Every HTML-file fragment corresponds to some interval and contains data characterizing the interval, the integral characteristics of the program execution on the interval and also links to the fragments with the information about nested intervals. HTML-file is a tree of intervals with special buttons to traverse the tree in any direction.

Any DVM-program can be divided into the intervals of 4 types. Whole program execution is considered as interval of highest (zero) level and is separate type of the interval. The interval can include some next (first) level intervals. This type intervals are parallel loops, named parallel intervals (PAR), sequential loops, named sequential intervals (SEQ) and any statement sequences, marked by a programmer and named user's intervals (USER). First level intervals can include any set of second level intervals and so on.

Every HTML-file fragment containing interval characteristics begins with “yellow field” which consists of interval key (File - source DVM-program file name, Line - the first interval statement line number, Type - interval’s type and, may be, integer expression, specified by programmer – Expr). Besides “yellow field” contains two interval characteristics: Count – the number of interval entries and Level – nested interval level.

At the end of simulation integral characteristics of program execution on every interval are calculated and written to HTML-file. There are following characteristics of “blue field”:

	Productive time: CPU			– productive processor time (without system overhead);
	Productive time: SYS			– productive system overhead;
	Productive time: I/O			– sum of input/output operation times inside the interval without communication overhead;

Lost time

– total lost time inside the interval;

Lost time: Insufficient parallelism

– total lost time due to insufficient parallelism (duplicated computations on some processors);

		Lost time: Insufficient parallelism:URS				– time of losses because of insufficient parallelism (without system losses);
		Lost time: Insufficient parallelism:SYS				– time of system losses because of insufficient parallelism;

Lost time: Communication

– total communication time (that is the sum of communication times of all collective operations);

Lost time: Communication: SYN

– time of losses because of dissynchronization of communication operations;

Lost time: Idle

– processor idle time;

Load_Imbalance			– time of imbalance of processors loading during execution of parallel calculations;
Synchronization			– time of losses because of dissynchronization of collective operations;
Overlap			– total time of overlapping of communications and calculations;

“Green field” contains characteristics of collective operations (I/O, Reduction, Shadow, Remote access and Redistribution). There are following characteristics for every operation: number of operations, communication time, time of losses because of dissynchronization of communication operations and time of overlapping of communications and calculations.

8 Protocol of predictor’s work

While predictor is running logfile.txt file in the current directory accumulates the information about predictor stages, non-processed by predictor functions, about errors encountered during the predictor’s work.

The file can be viewed by any text editor.