Predictor of DVM-program performance |
- last edited 19.10.00 -
Contents
1 Introduction
2 Trace file generation
3
Representation of the program as an hierarchy of intervals
4
Compilation and execution of the program
5
Preparation of the processor system configuration file
6 Predictor
command line options
7 Results of
predictors work
8 Protocol
of predictors work
The predictor is intended for performance analysis and performance debugging of DVM-programs without usage of a real parallel computer (access to which is usually limited or complicated). With the predictor user can get the predicted time characteristics of execution of his program on MPP or workstation cluster in more or less details.
The predictor is the system for processing the trace information gathered by DVM-program Run-Time system (Lib-DVM library) during the program execution on a workstation and consists of two major components: trace interpreter (PRESAGE) and time estimator (RATER). The trace interpreter, using trace information and user-defined parameters, calculates and displays extrapolated time characteristics for the program execution on target computer (MPP or workstation cluster), calling functions of time estimator, which simulates parallel DVM-program execution. In fact, time estimator is a model of Lib-DVM library, low level message passing system (MPI), used by the library and hardware.
The performance of parallel programs on multiprocessor computers with distributed memory is determined by the following major factors:
The predictor allows to a user to obtain numerical estimations of influence of each factor above. A prediction precision is determined by the following implementation restrictions.
Firstly, the predictor considers, that ratio of execution times of any program part on target computer and a workstation is constant and predictor startup parameter. Secondly, when communication time of any group operation (reduction, shadow edge renewing, remote access buffer loading and so on) is predicted, characteristics of communication network, the number and size of sent messages are taken into account only. Inter-influence of communications of different group operations is not taken into account.
These restrictions are insignificant if a programmer goal is to achieve affectivity of his program execution on different architecture and performance computers. If his goal is to obtain time characteristics on certain computer, the data, given by the predictor should be interpreted taking into account program performance characteristics really obtained on given target computer.
Before running the predictor it is necessary to execute the analyzed DVM-program on a workstation in order to gather the trace information (trace file), which is used by the predictor as input data. Gathering the trace information consists of the following stages:
3 Representation of the program as an hierarchy of intervals
The whole program is an interval of the highest (zero) level. This interval can contain several intervals of the next (first) level. Parallel loop, sequential loop, and any statement sequence marked by the programmer and whose execution is always started from the first statement and completed on last statement can be considered as an interval. First level intervals can contain second level intervals etc.
Time characteristics are calculated not only for the whole program, but also for each interval in it. Repeated execution of an interval can be considered as the execution of a separate program containing the multiplied interval statement sequence, which was executed during the real parallel program execution, on the same processors. Actually, the characteristics of the intervals, which are executed many times, are accumulated during each execution of the intervals. Intervals that belong to the same higher level interval are identified by the source file name and a line number in the file, which corresponds to the beginning of the interval, and optionally by some integer number attributed to the interval by the programmer.
Interval program splitting is controlled on the program compilation stage. Programmer can set such compilation modes when parallel loops or sequential loops, containing parallel loops, or all sequential loops in the program, or marked program statement sequences will be intervals. Compilers options, that control program splitting on intervals are explained in corresponding user manuals (see C-DVM compiler user guide, Fortran-DVM compiler user guide).
It is necessary to note that when a program performance is debugged, a user must not run the program with real large volume of computations needed for using the program to solve real tasks. He can, for example, limit a number of regular repeated external iterations up to one-two iterations. The program efficiency coefficient, significantly depending on losses on the program parts, which are executed before first iteration or after last iteration completion, can be significantly decreased. But the user can specify external iteration execution as separate interval and debug this interval performance only.
4 Compilation and execution of the program
After splitting the program into intervals it is necessary to perform the following steps:
For convenience, these steps can be performed with the next commands:
dvm c <C-DVM-program name>
dvm f <F-DVM-program name>
dvm runpred < DVM- program name>
Processing result: File <DVM-program name>.ptr is created in the current directory.
Note 1. The following parameters are set automatically during the program execution:
Is_DVM_TRACE=1; | - trace on; |
FileTrace=1; | - accumulate trace in files; |
MaxIntervalLevel=3; | - maximum level of nested intervals; |
PreUnderLine=0; | - do not underline call in trace file; |
PostUnderLine=0; | - do not underline ret in trace file; |
MaxTraceLevel=0; | - maximum trace depth for nested functions. |
Parameters PreUnderLine, PostUnderLine and MaxTraceLevel say to LIB_DVM that it is not necessary to accumulate lines of underscores in trace and it is not necessary to trace nested LIB_DVM calls, it gives much smaller size of the trace file.
Note 2. To run the parallel program with explicitly defined processor configuration or with dynamic set up for allocated processors it is necessary to define corresponding virtual processor system by IsUserPS and UserPS parameters.
For example, to define 2*2 virtual processor system use following parameter values:
IsUserPS=1; | - use virtual processor system definition, |
UserPS=2,2; | - virtual processor system topology. |
5 Preparation of the processor system configuration file
Not only the trace of the program execution on one processor, but also configuration information that describe target computer are input data for simulating time characteristics of the program execution
Conceptually, the work of the predictor consists of three basic stages:
Configuration information saved in predictor.par file defines the characteristics of the dedicated (simulated) multiprocessor system and has the following structure:
// System type = network | transputer
type = network;// Communication characteristics (mks)
start time = 75;
send byte time = 0.2;// Comparative processors performance
power = 1.00;// topologies
//root topology
topology = {4 , 2} ;
Lines beginning from // are comments. The parameter topology defines the dedicated processor system topology, i.e. its rank and size on each dimension. Parameter format topology = {4, 2} defines 4*2 processor matrix. If virtual processor system was defined by IsUserPS and UserPS parameters, dedicated processor system topology must coincide with virtual processor system topology.
If the value of the parameter is network, it means that the dedicated processor system is the network of workstations with bus architecture, and the value transputer means that the dedicated processor system uses transputer system as communication network.
Start time and send byte time are characteristics of communication hardware. These characteristics are used for message transmission time calculations. For example for workstation network the linear approximation is used to calculate the time of transmission of n bytes:
T = (start time) + n * (send byte time),
where:
start time | - start time of the data transmission; |
send byte time | - time of 1 byte transmission; |
Power parameter defines the ratio of the productivity of the processor on dedicated system to the productivity of the computer where the trace for predictor was done.
6 Predictor command line options
To start the predictor the following command line may be used:
predictor <par_file> <trace_file> <html_file> [<processor>]
where:
<par_file> | - parameter file name, containing configuration information of target computer; |
<trace_file> | - name of the file with trace information; |
<html_file> | - name of the file where HTML-pages with the program results are placed. |
<processor> | - processor topology that is a number of processors for each dimension of processor grid. This parameter is similar to topology parameter in the configuration file and overrides it. |
All command line parameters except <processor> are obligatory.
For convenience to start predictor the following command is provided:
dvm pred [processor matrix] <DVM-program name>
Program execution is controlled by following environment variables (see dvm.bat file):
For predictor normal work with trace <DVM-program name>.ptr is needed. It can be obtained by dvm runpred command.
Processing result:
Result of predictor work is the file <DVM-program
name><processor matrix>.html, which contains a html
file. If Web browser name (for example, Netscape) is set, then it
is invoked just after program termination. In such a case the
command execution is completed after finishing browser operating.
Structure of the output HTML-file is the same as the interval structure in the program. Every HTML-file fragment corresponds to some interval and contains data characterizing the interval, the integral characteristics of the program execution on the interval and also links to the fragments with the information about nested intervals. HTML-file is a tree of intervals with special buttons to traverse the tree in any direction.
Any DVM-program can be divided into the intervals of 4 types. Whole program execution is considered as interval of highest (zero) level and is separate type of the interval. The interval can include some next (first) level intervals. This type intervals are parallel loops, named parallel intervals (PAR), sequential loops, named sequential intervals (SEQ) and any statement sequences, marked by a programmer and named user's intervals (USER). First level intervals can include any set of second level intervals and so on.
Every HTML-file fragment containing interval characteristics begins with yellow field which consists of interval key (File - source DVM-program file name, Line - the first interval statement line number, Type - intervals type and, may be, integer expression, specified by programmer Expr). Besides yellow field contains two interval characteristics: Count the number of interval entries and Level nested interval level.
At the end of simulation integral characteristics of program execution on every interval are calculated and written to HTML-file. There are following characteristics of blue field:
Processors | dedicated processor system topology; | |||||
Efficiency | efficiency coefficient that is equal to ratio of Productive time to Total time; | |||||
Execution time | execution time that is equal to maximum value of the interval execution times on all processors ; | |||||
Total time | total time of processor usage, i.e. product of Execution time by number of processors; | |||||
Productive time | productive time, that is equal to Productive time: CPU plus Productive time:SYS plus Productive time: I/O; |
Productive time: CPU | productive processor time (without system overhead); | |||||
Productive time: SYS | productive system overhead; | |||||
Productive time: I/O | sum of input/output operation times inside the interval without communication overhead; |
Lost time | total lost time inside the interval; |
Lost time: Insufficient parallelism | total lost time due to insufficient parallelism (duplicated computations on some processors); |
Lost time: Insufficient parallelism:URS | time of losses because of insufficient parallelism (without system losses); | |||||
Lost time: Insufficient parallelism:SYS | time of system losses because of insufficient parallelism; |
Lost time: Communication | total communication time (that is the sum of communication times of all collective operations); |
Lost time: Communication: SYN | time of losses because of dissynchronization of communication operations; |
Lost time: Idle | processor idle time; |
Load_Imbalance | time of imbalance of processors loading during execution of parallel calculations; | |||||
Synchronization | time of losses because of dissynchronization of collective operations; | |||||
Overlap | total time of overlapping of communications and calculations; |
Green field contains characteristics of collective operations (I/O, Reduction, Shadow, Remote access and Redistribution). There are following characteristics for every operation: number of operations, communication time, time of losses because of dissynchronization of communication operations and time of overlapping of communications and calculations.
8 Protocol of predictors work
While predictor is running logfile.txt file in the current directory accumulates the information about predictor stages, non-processed by predictor functions, about errors encountered during the predictors work.
The file can be viewed by any text editor.