Performance Analyzer. Preliminary design

Parallel Program Performance Analyzer (PPPA)
Preliminary design
* 21 June 2000 *

Contents

1 Functions of Performance Analyzer
2 The Content of PPPA
3 Approach and Principle of Performance Analyzer

3.1 Splitting program into intervals
3.2 Accumulation of the information
3.3 Processing the information

3.3.1 Calculation of the characteristics of execution on each processors
3.3.2 Calculation of dissynchronization characteristic and time variation of collective operations
3.3.3 Calculation of overlap time
3.3.4 Calculation of the comparative characteristics of execution on different processors
3.3.5 Calculation of the main characteristics

1 Functions of Performance Analyzer

The performance analyzer is intended for the analysis and debugging of DVM-programs execution efficiency. With the performance analyzer the user can get the execution time characteristics of his program in more or less detail.

The efficiency of execution of the parallel programs on multiprocessor computers with the distributed memory is determined by the following major factors:

program parallelism - a part of parallel calculations in the total volume of calculations;
balance of processor load during parallel calculations;
time of interprocessor communications;
degree of overlapping of interprocessor communications and calculations.

The DVM-system has an information whether sequential or parallel part of the program is executed on any processor at any moment. This approach is an essential advantage in comparison with the approaches based on explicit use of communication libraries (MPI, PVM). Besides, all synchronization operations of the program are known. Therefore there is an opportunity to quantify the influence of four above factors on the program execution efficiency. It is impossible to distinguish sequential calculations from parallel and to determine a degree of parallelism of the program, if the parallel program is submitted as a system of cooperating processes using communication libraries explicitly.

The opportunity to distinguish sequential and parallel parts of the program during its execution on the multiprocessor computer allows the performance analyzer to give the user the following basic parameters of the parallel program execution:

execution time;
parallelism efficiency coefficient;
lost time.

Execution time is the maximum from the times of the program execution on each processor.

To calculate the main characteristic of parallel execution (parallelism efficiency coefficient) it is necessary to calculate two amount of time. First, an effective time required for the program execution on serial computer. Second, a total time of parallel execution, calculated as a product of execution time by number of processors. It is important to calculate total time of the parallel execution in this way (instead of the sum of times of execution on all processors) because all processors are allocated to the program at the moment of program start and are released after the end of program execution. Parallelism efficiency coefficient is a ratio of the effective time to the total time of parallel execution.

The lost time is the total time of parallel execution subtracted by the effective time. If the programmer is not satisfied with parallelism efficiency coefficient he should analyze which components the lost time consists of.

Thus, there are following independent components of the lost time:

Losses because of parallelism insufficiency that causes replication of execution on several processors;
Losses because of lost time of the processors, where the program execution is completed earlier then on others;
Losses because of interprocessor communication.

These components are calculated using total time of parallel execution. Therefore the lost time includes also the times, when the processors spent idling, i.e. variations between the execution times on different processors results in increasing the lost time. These variations can occur due to different time of execution of common actions (reduction operations, execution of sequential or parallel parts) on different processors. Besides such imbalance of processors loading at different stages of the parallel program execution can result in growing processor synchronization time. Therefore imbalance and dissynchronization characteristics should be calculated and given the user. Besides for collective operations requiring processors synchronization (reduction operations and loading of buffers), potential dissynchronization time is determined, as the time, which may be lost because of non-simultaneous start of collective operation execution on different processors. Dissynchronization can occur not only due to imbalance, but also due to distinctions of completion times of collective operations on different processors. To evaluate the dissynchronization a programmer is provided by a special characteristic - time variation of collective operation completion.

For more detailed analysis of the program efficiency user should be able to get characteristics of participation of each processor in execution of the parallel program. Besides the user can apply special language features to split program into intervals and to get performance characteristics for each interval.

2 The Content of PPPA

The performance analyzer consists of two subsystems - accumulation subsystem and subsystem of information processing.

The first subsystem provides accumulation of execution characteristics of parallel program on each processor. This subsystem is called from Lib-DVM during the parallel program execution. Besides C-DVM and Fortran DVM languages have the features for description of intervals of the program execution, for which the user would like to get the performance characteristics. The compiler inserts accumulation subsystem calls at the beginning and the end of each interval. The information from every processor outputs into a file upon the termination of the program.

The second subsystem running on a workstation, processes the information gathered on parallel computer and outputs performance characteristics requested by user.

3 Approach and Principle of Performance Analyzer

3.1 Splitting program into intervals

The execution of the program can be represented as a sequence of intervals. By default, the program is one interval. Also user can define intervals by means of C-DVM and Fortran DVM languages. There is also opportunity to set a mode of compilation, when all sequential loops, containing parallel loops, or all sequential loops or marked sequences of program statements may be declared as intervals.

The user can also split any interval into smaller intervals or to unite neighbor intervals (in order of execution) in a new one, i.e. to present the program as hierarchy of intervals of several levels (the whole program is an interval of highest level).

The mechanism of splitting the program into intervals serves for more detailed analysis of behavior of the program during its execution. Looking through results with the help of the performance analyzer, user can set the depth of details to leave out of consideration intervals of the requested.

3.2 Accumulation of the information

The following information is stored in memory and written into a file on each processor:

The information on runtime system and hardware-software environment (dimensions of multiprocessor system, number of processors, type of communication system).
The information on each interval (type of the interval, the number of the interval, the number of the level, number of entries to the interval, the name of a source file and the number of a line in it, corresponding to the beginning of the interval) and the amounts of time concerning this interval:

Time of execution Execution_Time

Processor time of program calculations User_CPU_time

Time of system calculations Sys_CPU_time

Time of losses because of insufficient parallelism Insufficient_par

Times of the message exchange during the communication function execution Communication

Time of real synchronization Real_synchronization – a part of message passing time, lost, because a receive operation has started on one processor before the corresponding send operation on other processor

Time of exchanges with files I/O time.

The information on times of operations requiring synchronization (reduction, updating of shadow edges, loading of buffers).

3.3 Processing the information

The information saved in file during parallel execution of the program is processed then on a workstation as follows.

3.3.1 Calculation of the characteristics of execution on each processors

At this stage idle, load imbalance and lost times are calculated at every processor.

Processor idle time (Idle_time) is calculated as a difference between maximal interval execution time at any processor and interval execution time at the current processor. This characteristic is determined if the interval has been always executed at the same processors.
Load imbalance (Load_imbalance) is calculated as a difference between maximal sum of two processor times (user program processor time and system processor time) and sum of the same two processor times at the current processor.
Lost time (Lost_time) is calculated as a sum of it components - looses because of insuficient parallelizm, communications and idles.

3.3.2 Calculation of dissynchronization characteristic and time variation of collective operations

For collective operations requiring synchronization of processors (reduction, updating of shadow edges), the appropriate dissynchronization times are determined for each processor - times which may be lost during message passing due to non-simultaneous start of collective operation execution on different processors. The variation of completion times is calculated for the collective operations. This characteristic shows possible dissynchronization losses arising when collective operations complete not simultaneously on different processors.

When collective operation dissynchronization is calculated, at first the time Tmax of the operation start on the last of processors is calculated. Then for each processor, started the collective operation at time Ti, dissynchronization is calculated. It is equal to a difference Tmax- Ti. Time variation of collective operations is calculated similarly, but the times of the operation completion are used. These characteristics are calculated for all collective operations excepting times of asynchronous operation starts.

3.3.3 Calculation of overlap time

Time of reduction overlap (Reduction_overlap) is the sum of time intervals which begin from the last completion of a start reduction operation (Start_reduction) on any processor and finish when other processors start corresponding wait_reduction operation (Wait_reduction). Time of shadow renewing overlap (Shadow_renewing_overlap) and times of overlap of other collective operations which have of start and wait parts are calculated in the same way.

3.3.4 Calculation of the comparative characteristics of execution on different processors

For each of the program execution characteristics, accumulated on different processors, two processors are determined, which parameters are maximal and minimal. For each characteristic the numbers of these two processors and three values of the characteristic - minimal, maximal and average are given.

3.3.5 Calculation of the main characteristics

These characteristics concern all parallel program and its intervals and are calculated using the characteristics received on each processor. According to the degree of details, given by the user, (level of intervals) the following information on performance execution of the program can be given.

Efficiency coefficient (Efficiency) is a ratio of productive time to total processor time;
Time of execution (Execution_time) is the maximal time from the times of interval execution on each processor;
A number of processors provided (Processors) is established equal to the value of the same characteristic, calculated for each processor. If it is differ for some processors, then the characteristic is considered to be undetermined;
Total processor time (Total_time) is production of the time of execution (Execution_time) by a number of processors provided (Processors);
Productive time (Productive_time) is sum of three components: processor time of user program (CPU), system processor time (Sys) and time of input/output (I/O);
Lost time (Lost_time) and its components (Insufficient_parallelism, Communication, Idle) are calculated as the sum of the corresponding characteristics for all processors.

Time of imbalance (Load_Imbalance), real synchronization (Real_synchronization), potential synchronization (Synchronization), time variation (Time_variation), time of overlapping (Overlap) and all their components are calculated similarly.