Prefictor of DVM-program performance. Preliminary design

Predictor of DVM-program performance
Preliminary design
* October, 2000 *

- last edited 20.10.00 -

Contents

1 Functions of Predictor
2 The Content of Predictor
3 Approach and Principle of Predictor Implementation

3.1 Representation of the program as hierarchy of intervals
3.2 Processing trace information

3.2.1 Accumulation of the information about intervals
3.2.2 Simulation of parallel execution of DVM-program
3.2.3 Calculation of the main performance characteristics

1 Functions of Predictor

The predictor is intended for performance analysis and debugging of DVM-programs without usage of a real parallel computer (access to which is usually limited or complicated). With the predictor user can get the predicted temporal characteristics of execution of his program on MPP or workstation cluster in more or less detail.

The performance of parallel programs on multiprocessor computers with distributed memory is determined by the following major factors:

program parallelism - a part of parallel calculations in the total volume of calculations;
balance of processor load during parallel calculations;
time needed for execution of interprocessor communications;
degree of overlapping of interprocessor communications with calculations.

The DVM-system has an information whether sequential or parallel part of the program is executed on any processor at any moment. This approach is an essential advantage in comparison with the approach based on explicit use of communication libraries (MPI, PVM). Besides, all synchronization operations of the program are known. Therefore there is an opportunity to quantify the influence of four above factors on the program execution efficiency. It is impossible to distinguish sequential calculations from parallel ones and to determine a degree of parallelism of the program if the parallel program is submitted as a system of cooperating processes using communication libraries explicitly.

The opportunity to distinguish sequential and parallel parts of the program during its execution on the multiprocessor computer allows the predictor to give a user the following basic parameters of the parallel program execution:

execution time;
efficiency coefficient;
lost time.

Execution time is the maximum from the times of the program execution on each processor.

To calculate the main characteristic of parallel execution (efficiency coefficient) it is necessary to calculate two amount of time. First, a productive time required for the program execution on serial computer. Second, a total processor time, calculated as a product of execution time by number of processors. Efficiency coefficient is a ratio of the productive time to the total processor time.

The lost time is the total processor time of parallel execution subtracted by the productive time. If the programmer is not satisfied with efficiency coefficient he should analyze components of the lost time and their origin.

There are following components of the lost time:

Losses because of insufficient parallelism that causes replication of execution on several processors (insufficient parallelism);
Losses because of execution of interprocessor communication (communication);
Losses because of idle time of the processors on which the program execution has been completed earlier than on other (idle).

To estimate the total potential losses, which can arise because of non-simultaneous start of collective operations on different processors, the special characteristic - synchronization should be calculated and given the user. A main reason of these losses is imbalance of processors loading during execution of parallel loops. For estimating the potential losses because of imbalance, the generalized characteristic imbalance should be given the user. Dissynchronization can occur not only due to imbalance, but also due to differences in completion times of collective operations on different processors. To evaluate the potential dissynchronization the programmer is provided with a special characteristic – time variation of collective operation completion.

An important characteristic, showing the degree of overlapping of interprocessor communications with calculations, is the time of overlapping.

For more detail analysis of the program efficiency user can apply special language features to split program into intervals and to get performance characteristics for each interval.

2 The Content of Predictor

The predictor is the system for processing the trace information gathered by LIB-DVM system during DVM-program execution on a workstation (in case of several workstations there is special utility to unite several traces in one). This system using trace information and parameters given by the user calculates and produces for him the predicted temporal characteristics of the program execution on MPP or on workstation cluster, using the library, which simulates a parallel execution of DVM-programs.

3 Approach and Principle of Predictor Implementation

3.1 Representation of the program as hierarchy of intervals

The execution of the program can be represented as a sequence of intervals. By default, the program is considered as one interval. Also user can define intervals by means of C-DVM and Fortran DVM languages (Constraint: when predictor is used, the integer expression with an interval must not appear inside parallel loop).

There is also opportunity to set a mode of compilation, when all parallel loops, or all sequential loops, containing parallel loops or all sequential loops are declared as intervals.

The user can also split any interval into smaller intervals or unite neighbor intervals (in order of execution) in a new one, i.e. to present the program as hierarchy of intervals of several levels (the whole program is an interval of highest level).

The mechanism of splitting the program into intervals serves for more detail analysis of behavior of the program during its execution. Looking through results with the help of the predictor, user can set a depth of details to leave out of consideration intervals of prescribed levels.

3.2 Processing trace information

For simplification of the further description we shall enter the following notions. An interval will be named simple, if it does not contain other intervals (nested intervals). We refer to intervals including nested intervals as composite. While processing trace information some intervals are active intervals (entered but not exited). The active interval of the lowest level, we shall name a current interval.

The trace information saved in file during the DVM-program execution is processed then on a workstation as follows.

3.2.1 Accumulation of the information about intervals

During processing the trace, the information on each interval is gathered (type of the interval, the number of the interval, the number of the level, number of the interval entries, source file name and a line number corresponding to the beginning of the interval). Also for current intervals the following times describing its execution on a serial computer are calculated (the processor performance is supposed to be equal to processors performance of a target system):

Productive processor time (Productive_CPU_time_usr) (It is determined on user regions of the intervals.)
Productive system time of system calculations (Productive_CPU_time_sys) (It is determined on system regions of the intervals.)
Productive processor time (Productive_CPU_time) (It is calculated as the sum of the productive times of user and system calculations.)
Time of input/output (I/O_time).
When these times are collected the excessive overheads needed for output of the trace information into the file during program execution on the workstation also is taken into account.

3.2.2 Simulation of parallel execution of DVM-program

The simulation of basic LIB-DVM functions is performed, using the trace information accumulated in the file and parameters of the multiprocessor system given by the user. It allows to determine times necessary for execution of interprocessor communication, and also information about distribution of calculations between processors. Thus, the following times describing parallel execution of intervals are calculated (these characteristics are generalized for all processors):

Processor time (CPU_time)
Synchronization (Synchronization) for all types of collective operations (Start_reduction, Wait_reduction, Start_shadow, Wait_shadow, Remote_access, Redistribution and I/O)
Time variation (Time_variation) of collective operation completion
Communication times for all types of collective operations (Start_reduction, Wait_reduction, Start_shadow, Wait_shadow, Remote_access, Redistribution and I/O).
Total communication time (Communication) (Is the sum of communication times of all collective operations.)
Time of losses because of insufficient parallelism during system calculations (Insufficient_parallelism_sys).
Time of losses because of insufficient parallelism during user calculations (Insufficient_parallelism_usr).
Total time of losses because of insufficient parallelism (Insufficient_parallelism) (For any interval equal to the sum of times of losses because of insufficient parallelism during user and system calculations.)
Time of reduction overlapping (Reduction_overlap).
Time of shadow edges renewing overlap (Shadow_overlap).
Time of overlapping of communications by calculations (Overlap) (Is calculated as the sum of two base characteristics – time of reduction overlapping and time of shadow edges renewing overlap.)
Idle (Idle).

At the exit from composite intervals the correction of all characteristics described in sections 3.2.1. and 3.2.2. (except the characteristic Idle) is performed. Also each characteristic of any interval of the i-th level is calculated by means of adding to it the same characteristics of all nested intervals of the (i+1)-th level.

After correction and also at the exit from simple intervals the characteristic imbalance (Load_Imbalance) is calculated.

3.2.3 Calculation of the main performance characteristics

These characteristics concern all parallel program and its intervals. According to the degree of details, given by the user (level of intervals), the following information on predicted performance execution of the program can be given:

Time of input/output (I/O_time).
Imbalance (Load_Imbalance).
Synchronization (Synchronization) and all its components.
Time variation (Time_variation).
Losses because of insufficient parallelism (Insufficient_parallelism) with all its components.
Communication (Communication) and its components for all types of collective operations.
Time of overlapping (Overlap) with components.
Productive processor time (Productive_CPU_time).
Productive time (Productive_time) is sum of two components: productive processor time (Productive_CPU_time) and time of input/output (I/O_time).
Lost time (Lost_time) are calculated as the sum of its components: insufficient parallelism (Insufficient_parallelism), communication (Communication) and idle (Idle).
Total processor time (Total_time) is sum of productive time (Productive_time) and lost time (Lost_time).
Time of execution (Execution_time) is a ratio of total processor time (Total_time) to number of processors.
Efficiency coefficient (Efficiency) is a ratio of productive time (Productive_time) to total processor time (Total_time).