|created: april 2001||- last edited 06.10.02 -|
1.1 Parallel programming models
Three parallel programming models are now prevalent in large, scalable systems (see Fig. 1.1): message-passing model (MPM), shared-memory model (SMM) and data parallel model (DPM).
Message-passing model. In message passing model each process has its own local address space. Common data processing and synchronization are performed by message passing. Summarizing and standardization of different message-passing libraries resulted in MPI  standard development.
Shared-memory model. In shared memory model processes share common address space. Because there are no limitations to common data usage a programmer must explicitly specify common data and regulate an access to the data using synchronization tools. In high level languages logically independent threads are defined at the level of functional tasks or at level of loop iterations. Summarizing and standardization of shared memory models resulted in OpenMP  standard development.
Data parallel model. In data parallel model a process notion is absent and, as result, explicit message passing or explicit synchronization are absent. In the model data are distributed on nodes (processors) of computing system. Sequential program is translated by a compiler either in message passing model program or shared memory model program (Fig. 1.1). The computations are distributed according to the owner-computes rule: each processor performs only the computations of own data, that is the data, allocated on the processor.
In comparison with two previous models DPM has obvious advantages. A programmer is freed from tedious efforts of distributing the global array onto local arrays, explicit management of sending and receiving message, or explicit synchronization of parallel processes. But application area of the model is the object of research. The research results show that a performance of many algorithms of scientific computing in DPM model is comparable with the performance of the realization in MPM and SMM models.
HPF1  development was the first attempt of DPM standardization. MPM and SMM model standardization is done on the base of large experience of implementations and practical applications summarizing. HPF1 standard was developed on the base of theoretical researches and 2-3 experimental implementations. Moreover, the standard was based on automatic parallelization of calculation and automatic synchronization of shared data access. First HPF1 implementations showed inefficiency of the standard for modern methods of calculations (in particular, for irregular calculations). In the next version of the standard HPF2  a step was done to "manual" control of parallel execution performance. In particular, the distribution of computations and common reduction variable specification were defined.
Fig. 1.1. Three models of parallel programming
1.2 DVM-approach to parallel program development
DVM-system provides unified toolkit to develop parallel programs of scientific-technical calculations in C and Fortran 77.
DVM parallel model. DVM parallel model is based on data parallel model. The DVM name reflects two names of the model - Distributed Virtual Memory and Distributed Virtual Machine. These two names show that DVM model is adopted both for shared memory systems and for distributed memory systems. DVM high level model allows not only to decrease cost of parallel program development but provides unified formalized base for supporting Run-Time System, debugging, performance analyzing and prediction.
Languages and compilers. As distinct from HPF the goal of full automation of computation parallelization and the common data access synchronization is not assumed in DVM system. Using high level specifications, a programmer has full control of parallel program performance. On the other hand, in the process of Fortran DVM design and development the compatibility with the subsets of HPF1 and HPF2 standards was kept.
Unified parallel model is built in C and Fortran 77 language on the base of the constructions, that are "transparent" for standard compilers, that allows to have single version of the program for sequential and parallel execution. C-DVM and Fortran DVM compilers translates DVM-program in C or Fortran 77 program correspondingly, including parallel execution Run-Time Support system calls. So only requirement to a parallel system is availability of C and Fortran 77 compilers.
Execution and debugging technique. Unified parallel model allows to have unified Run-Time Support system for both languages and, as result, unified system of debugging, performance analyzing and prediction. There are following modes of DVM-program execution and debugging:
Following debugging modes are provided in pseudo-parallel and parallel execution modes:
2 Language overview
2.1 Programming model and model of parallelism
Fortran DVM language is the extension of Fortran 77 language . The extension is implemented via special comments, named directives. FDVM directives may be conditionally divided on three subsets:
FDVM parallel model is based on specific form of data parallelism called SPMD (Single Program, Multiple Data). In this model the same program is executed by all the processors concerned, but each processor performs its own subset of statements in accordance with the data distribution.
First, in FDVM model a user defines multidimensional arrangement of virtual processors, which sections data and computations will be mapped on. The section can be varied from the whole processor arrangement up to a single processor.
Then the arrays to be distributed over processors (distributed data) are determined. These arrays are specified by data mapping directives (section 4). The other variables (distributed by defaults) are mapped by one copy per each processor (replicated data). A value of replicated variable must be the same one on all the processors concerned. Single exception is the variables in parallel constructions (see sections 5.1.3, 5.1.4 and 7.5).
FDVM model defines two parallelism levels:
Data parallelism is implemented by distribution of tightly nested loops over the processors (section 5). Each iteration of such loop is executed on one processor entirely. The statements located outside parallel loop are executed according own computation rules (section 5.2).
Task parallelism is implemented by distribution of data and independent computations over sections of processor arrangement (section 7).
When calculating the value of own variable, the processor may need in values of as own as other (remote) variables. All remote variables must be specified in remote data access directives (section 6).
2.2 Syntax of FDVM directives
The syntax of FDVM directives is described using a Backus-Naur form and the following notations:
|is||is by definition|
|or||an alternative construct|
|[ ]||encloses optional construct|
|[ ]||encloses an optionally repeated construct which may occur zero or more times|
|x-list||x [ , x ]|
Syntax of the directive.
|directive-line||is CDVM$ dvm-directive|
|or *DVM$ dvm-directive|
No statements may be interspersed within a continued directive. A directive line must not appear within a continued statement. An example of a directive continuation follows. Note that column 6 must be blank, except when signifying continuation.
CDVM$ ALIGN SPACE1(I,J,K)
CDVM$* WITH SPACE(J,K,I)
3 Virtual processor arrangements. PROCESSORS directive
The PROCESSORS directive declares one or more rectangular virtual processor arrangements.
|processors-directive||is PROCESSORS processors-decl-list|
|processors-decl||is processors-name ( explicit-shape-spec-list )|
|explicit-shape-spec||is [ lower-bound : ] upper-bound|
The intrinsic function NUMBER_OF_PROCESSORS( ) can be used to determine the number of real processors, provided to a program.
It is possible to use several virtual processor arrangements of different shape if the number of processors in every arrangement is equal to the value of function NUMBER_OF_PROCESSORS( ). If two virtual processor arrangements have the same shape, then corresponding elements of the arrangements are referred to the same virtual processor.
Example 3.1. Declaration of virtual processor arrangements.
CDVM$ PROCESSORS P( N ) CDVM$ PROCESSORS Q( NUMBER_OF_PROCESSORS( ) ), CDVM$* R(2, NUMBER_OF_PROCESSORS( )/2)
The value N has to be equal to the value of the function NUMBER_OF_PROCESSORS ( ).
The processor arrangements are local objects of the procedure. Data arrays with COMMON and SAVE attributes can be mapped on the local processor arrangements, if whenever the procedure is called, the local processor arrangement has the same shape.
4 Data mapping
FDVM supports distribution by blocks (equal and non-equal), inherited distribution, dynamic array distribution and distribution via alignment.
4.1 DISTRIBUTE and REDISTRIBUTE directives
|distribute-directive||is dist-action distributee dist-directive-stuff|
|or dist-action [ dist-directive-stuff ] :: distributee-list|
|dist-directive-stuff||is dist-format-list [ dist-onto-clause ]|
|or GEN_BLOCK ( block-size-array )|
|or WGT_BLOCK ( block-weight-array , nblock )|
|dist-onto-clause||is ONTO dist-target|
[( processors-section-subscript-list )]
|processors-section-subscript||is [ subscript ] : [ subscript ]|
The ONTO clause specifies the virtual processor arrangement or its section. If ONTO clause is omitted, than array distribution is performed by base virtual processor arrangement, that is a parameter of program startup. When REDISTRIBUTE directive without ONTO clause is executed in ON-block, the array is distributed on the section of processor arrangement of this ON-block (see section 7).
Several arrays (A1, A2, ) can be distributed at the same mode by the single directive of the form:
CDVM$ DISTRIBUTE dist-directive-stuff :: A1, A2,
In that case the arrays must have the same rank, but can have different sizes of dimensions.
Let us consider distribution formats for one dimension of the array (one-dimensional array A(N)) and for one dimension of the processor arrangement (one-dimensional array R(P)). Multi-dimensional distributions are considered in section 4.1.5.
4.1.1 BLOCK format
A block of [(N-1)/P] +1 elements are allocated on each processor. It is possible for some ratio between N and P that several last processors do not contain any the array elements.
Example 4.1. Distribution by BLOCK format.
|CDVM$ PROCESSORS R( 4 )||3||3||3|
|REAL A (12), B(13), C(11)||R(2)||4||5||4|
|CDVM$ DISTRIBUTE A (BLOCK) ONTO R||8|
|CDVM$ DISTRIBUTE (BLOCK) ONTO R :: B||8||10||8|
|CDVM$ DISTRIBUTE C (BLOCK)|
4.1.2 GEN_BLOCK format
Distribution by blocks of different sizes allows affecting on processor loading balance for algorithms performing different volume of computations for different parts of arrays.
Let NB( 1:P ) be an integer array. The following directive
CDVM$ DISTRIBUTE A( GEN_BLOCK(NB)) ONTO R
splits array A on P blocks. The block i of size NB( i ) is mapped on processor R( i ).
Example 4.2. Distribution by blocks of different size.
|CDVM$ PROCESSORS R( 4 )||2|
|CDVM$ DISTRIBUTE A ( GEN_BLOCK( BS ) ) ONTO R||9|
|DATA BS / 2, 4, 4, 2 /||R(4)||11|
4.1.3 WGT_BLOCK format
The WGT_BLOCK format specifies distribution by blocks according to their relative "weights".
Let WGT_BLOCK(WB, NBL) format is specified.
WB(i) defines weight of i -th block for 1£ i £ NBL. The blocks are distributed on P processors with balancing of sums of block weights on every processor. The condition
P £ NBL
must be satisfied.
The processor weight is defined as a sum of weights of all the blocks distributed on it . The array dimension is distributed proportionally to processor weights.
BLOCK format is special space of WGT_BLOCK(WB,P) format, where WB(i) = 1 for 1£ i £ P and NBL = P.
GEN_BLOCK format is special case of WGT_BLOCK format with some precision.
The example 4.2 can be rewritten using WGT_BLOCK format in the following way.
Example 4.3. Distribution by blocks according to weights.
CDVM$ PROCESSORS R( 4 )
DOUBLE PRECISION WB(12)
CDVM$ DISTRIBUTE A ( WGT_BLOCK( WB, 12 ) ) ONTO R
DATA WB / 2., 2., 1., 1., 1., 1., 1., 1., 1., 1., 2., 2. /
In the example 4.3 P = 4 and distribution is identical to the example 4.2.
As distinct from distribution by non-equal blocks, distribution by WGT_BLOCK format can be performed for any number of processors from range 1 £ P £ NBL. For given example the size of processor array R can be varied from 1 to 12.
4.1.4 Format of *
Format of * means, that a dimension is localized on each processor (non-distributed or local dimension).
4.1.5 Multidimensional distributions
For multidimensional distributions mapping format is specified for each dimension. The following correspondence is established between dimensions of the array to be distributed and the processor arrangement.
Let the processor arrangement has n dimensions. Number the dimensions not formatted as * from left to right d1, ..., dk. Then dimension di will be mapped on i-th dimension of processor arrangement. The condition k£ n must be satisfied.
Example 4.4. One-dimensional distribution.
|CDVM$ PROCESSORS R1( 2 )||Blocks A||Processors|
|CDVM$ DISTRIBUTE A (BLOCK, *) ONTO R1||1||A( 1: 50,1:100)||1||1|
Example 4.5. Two-dimensional distribution.
|CDVM$ PROCESSORS R2( 2, 2 )||Blocks A||Processors|
|CDVM$ DISTRIBUTE A (BLOCK,BLOCK) ONTO R2||1||2||1||1||2|
4.2 Distribution of dynamic arrays
4.2.1 Dynamic arrays in Fortran 77 program
There are no features to deal with dynamic arrays in Fortran 77. So a user is forced to simulate dynamic memory by so called working arrays. The dynamic memory is declared as a one-dimensional array of large size. The dynamic arrays of different shape are defined as continuous segments in the working array.
Example 4.6. Working array usage.
REAL W(100000) READ (6 , *) N, M C the arrays of size N*N and M*M are required in the program CALL SUB1(W(1), N, W(1+N*N), M) END SUBROUTINE SUB1(A, N, B, M) DIMENSION A(N,N), B(M,M) . . . END
An analysis of available programs showed that there is no certain discipline when dealing with simulated dynamic arrays. In particular, array allocation in memory is not fixed explicitly. An access to dynamic array is implemented by reference to the working array. Therefore the compiler can't to determine the array shape.
4.2.2 Dynamic arrays in FDVM model. POINTER directive
Suggested model is a subset of dynamic array model in Fortran 90. It allows to execute this model without modifications in three program environments:
For dynamic arrays, distributed by default, FDVM allows to use any methods of dynamic memory simulation. For the dynamic arrays, distributed by DISTRIBUTE and ALIGN directives, FDVM defines the following discipline of allocation and usage of dynamic arrays.
* All explicitly distributed arrays are allocated in dynamic memory pool with the name HEAP
REAL HEAP (MAXM)
where MAXM is a number of dynamic memory words.
* Data type and the rank of the dynamic array are fixed by the following FDVM directive
|pointer-directive||is type , POINTER ( dimension-list ) :: pointer-name-list|
The specification type defines data type of the dynamic array. The following constraints exist for the variables specified in POINTER directive.
* The sizes of each dimension and dynamic array allocation in HEAP are fixed by the following statement
pointer = ALLOCATE ( sdim ,... )
|pointer||- is a pointer to integer variable (scalar or array element) with POINTER attribute|
|sdim||- is a integer one-dimensional array of the size ndim. ndim is a rank of multidimensional array, allocated in dynamic memory HEAP. The sdim( i ) value defines the size of i-th dimension. The size of allocated segment is equal to sdim( 1 ) ΄ sdim( 2 ) ΄ ....΄ sdim( ndim ).|
Integer function ALLOCATE returns a number of initial element of allocated segment of dynamic memory HEAP. The ALLOCATE function is programmed by a user, therefore it can have additional arguments besides obligatory sdim one.
* Only the following kind of a dynamic array reference is allowed in a procedure, where dynamic array allocation is performed
Moreover, this reference can be used only as the actual argument of a function or subroutine call.
Let several pools with identifiers ID1 , , IDn. are used in a program. There is no necessity to rewrite the program with single pool HEAP of dynamic memory. It is enough to write following specification
CDVM$ HEAP ID1 , , IDn
But every pool IDi must satisfy requirements above for HEAP pool. The pool can keep only the arrays, distributed by DISTRIBUTE and ALIGN directives.
4.2.3 DISTRIBUTE and REDISTRIBUTE directives for dynamic arrays
Distribution of dynamic arrays is performed by DISTRIBUTE and REDISTRIBUTE directives. The syntax of directives is extended in the following way:
|distributee||is . . .|
If a scalar or the array with POINTER attribute are specified as distributee, then distribution is postponed up to execution of ALLOCATE function, assigning a value to the pointer. Instead of ALLOCATE function, the array creation and distribution is done according to DISTRIBUTE directive formats.
The REDISTRIBUTE directive for dynamic array can be performed only after execution of the ALLOCATE function, assigning a value to correspondent POINTER variable.
If dynamic array pointer is an element of a pointer array, the dynamic array can be distributed by REDISTRIBUTE directive only. As only a reference to pointer name is allowed in REDISTRIBUTE directive, the element of the pointer array should be assigned previously to scalar variable-pointer. The array with PT(I) pointer can be distributed by the following statements sequence:
P1 = PT( I ) CDVM$ REDISTRIBUTE P1( BLOCK, BLOCK )
It is necessary to transform the program from example 4.5 in the following FDVM program.
Example 4.7. Mapping of FDVM dynamic arrays.
REAL HEAP(100000) INTEGER ALLOCATE CDVM$ REAL, POINTER (:,:) :: PA, PB INTEGER PA, PB C descriptors of dynamic arrays INTEGER DESCA(2), DESCB(2) CDVM$ DISTRIBUTE (BLOCK,BLOCK) :: PA, PB C arrays of size N*N and M*M are required in the program READ (6 , *) N, M C shape of the first array DESCA(1) = N DESCA(2) = N C allocation and distribution of the first array PA = ALLOCATE(DESCA,1) C shape of the second array DESCB(1) = M DESCB(2) = M C allocation and distribution of the second array PB = ALLOCATE(DESCB, N*N+1) CALL SUB1(HEAP(PA), N, HEAP(PB), M) END SUBROUTINE SUB1(A, N, B, M) DIMENSION A(N,N), B(M,M) CDVM$ DISTRIBUTE (BLOCK,BLOCK) :: A, B . . . END FUNCTION ALLOCATE(DESC, P) INTEGER DESC(2), P ALLOCATE = P END
Other examples of dynamic array distribution see in section 7.7.
4.3 Distributing by aligning
Aligning array A with distributed array B brings in accordance to each element of array A an element or a section of array B. When array B is distributed array A will be distributed simultaneously. If element of B is mapped on the processor, the element of A, corresponding to element B via alignment, will be also mapped on the same processor.
Method of mapping via alignment performs the following two functions.
4.3.1 ALIGN and REALIGN directives
The following directives describe array aligning:
|align-directive||is align-action alignee align-directive-stuff|
|or align-action [ align-directive-stuff ] :: alignee-list|
|align-directive-stuff||is ( align-source-list ) align-with-clause|
|align-with-clause||is WITH align-spec|
|align-spec||is align-target ( align-subscript-list )|
|align-dummy-use||is [ primary-expr
* ] align-dummy
[ add-op primary-expr ]
|or ( int-expr )|
Let the alignment of two arrays is specified by the directive
CDVM$ ALIGN A(d1, ,dn) WITH B(ard1, ,ardm)
where di is specification of i-th
dimension of aligned array A,
ardj is specification of j-th dimension of base array B.
If di is specified by integer variable I, then there must be at most one dimension of array B, specified by linear function ardj = a*I + b.
Let i-th dimension of array A has bounds LAi : HAi , and j-th dimension of array B, specified by linear function a*I + b , has the bounds LBj : HBj. Since the parameter I is defined on the value set LAi : HAi, then the following conditions must be satisfied:
a*LAi + b ³ LBj , a* HAi + b £ HBj
If di is * , the i-th dimension of array A will be local on each processor independently from array B distribution (it is analogue of local dimension in DISTRIBUTE directive).
If ardi is * , then the array A will be replicated along j-th dimension of the array B.
If ardi is int-expr, then array A is aligned with the section of the array B.
Example 4.8. Aligning arrays
REAL A(10),B(10,10),C(22,22),D(20),E(20),F(10),G(20),H(10,10) CDVM$ DISTRIBUTE B (BLOCK,BLOCK) CDVM$ DISTRIBUTE D (BLOCK) C aligning with the array section C (the vector is aligned with the first line of A) CDVM$ ALIGN A( I ) WITH B(1,I) C replication of the vector aligning it with each line CDVM$ ALIGN F( I ) WITH B(*,I) C the matrix is collapsed; C each matrix column corresponds to the vector element CDVM$ ALIGN C(*,I) WITH D( I ) C alignment of vector with vector using stretching CDVM$ ALIGN E( I ) WITH D(2*I) C alignment vector with vector using reverse ordering CDVM$ ALIGN G( I ) WITH D(-I+21) C alignment matrix with matrix using rotation and stretching CDVM$ ALIGN H(I,J) WITH C(2*J,2*I)
Several arrays (A1, A2, ) can be aligned with one B array in the same manner by the directive of the form:
CDVM$ ALIGN (d1, ,dn) WITH B(ard1, ,ardm) :: A1, A2,
At that A1, A2 arrays must have the same rank (n), but can have different sizes of dimensions.
Let the sequence of alignments A f1 B f2 C, be specified; f2 is alignment of the array B with the array C, and f1 is alignment of the array A with the array B. The arrays A and B are considered as aligned with array C by definition. The array B is aligned by function f2 directly and array A is aligned by composite function f1(f2) indirectly. Therefore applying REALIGN directive to the array B doesn't cause redistribution of array A.
Generally a set of ALIGN specifications is a set of trees. At that every tree root must be distributed by DISTRIBUTE or REDISTRIBUTE directives. When REDISTRIBUTE directive is executed, the whole alignment tree is redistributed.
4.3.2 TEMPLATE directive
If values of linear function a*I + b are beyond base array dimension, it is necessary to define a dummy array - referred to as an alignment template using the following directive.
|template-directive||is TEMPLATE template-decl-list|
|template-decl||is template-name [ ( explicit-shape-spec-list ) ]|
Then it is necessary to align both arrays with the template. The template is distributed by DISTRIBUTE and REDISTRIBUTE directives. The template elements are mapped among processors without real memory allocation. They specify a processor on which elements of aligned arrays must be mapped.
Consider the following example.
Example 4.9. Aligning with template.
REAL A(100), B(100), C(100) CDVM$ TEMPLATE TABC(102) CDVM$ ALIGN B( I ) WITH TABC( I ) CDVM$ ALIGN A( I ) WITH TABC( I+1 ) CDVM$ ALIGN C( I ) WITH TABC( I+2 ) CDVM$ DISTRIBUTE TABC ( BLOCK ) . . . DO 10 I = 2, 98 A(I) = C(I-1) + B(I+1) 10 CONTINUE
To avoid exchange between processors, it is necessary to allocate the elements A(I), C(I-1) and B(I+1) on the same processor. It is impossible to align arrays C and B with array A, because alignment functions I-1 and I+1 cause bounds violation of array A. Therefore the template TABS is declared. The elements of arrays A, B and C, which must be allocated on the same processor, are aligned with the same element of the template.
4.3.3 Aligning dynamic arrays
To specify dynamic arrays alignment the syntax of ALIGN and REALIGN directives is extended in the following way.
|alignee||is . . .|
|align-target||is . . .|
If a variable with POINTER attribute is specified as aligned array (alignee) in ALIGN directive, then directive executing is postponed up to execution of ALLOCATE function, defining the variable value. The REALIGN directive may be executed only after execution of the ALLOCATE function.
Example 4.10. Aligning dynamic arrays.
REAL HEAP(100000) CDVM$ REAL, POINTER (:,:) :: PX, PY INTEGER PX, PY, DESC(2) CDVM$ ALIGN PY(I,J ) WITH PX(I,J) CDVM$ DISTRIBUTE PX (BLOCK,BLOCK) . . . PX = ALLOCATE(DESC, ...) PY = ALLOCATE(DESC, ...) . . . CDVM$ REDISTRIBUTE PX (BLOCK,*)
Let sequence of alignments by ALIGN directives is specified
P1 f1 P2 f2 . . . fN-1 PN
fi is aligning function,
Pi is a pointer to dynamic array.
Then the order of dynamic array allocation (ALLOCATE function execution) must be reverse, i.e.:
PN = ALLOCATE(...)
. . .
P2 = ALLOCATE(...)
P1 = ALLOCATE(...)
If dynamic array pointer is an element of a pointer array, the dynamic array can be aligned by REALIGN directive only. As only a reference to pointer name is allowed in REALIGN directive, the element of pointer array should be assigned previously to scalar variable-pointer. The array with pointer PT(I) can be aligned with the array with pointer PT(J) by the following statements sequence:
P1 = PT(I) P2 = PT(J) CDVM$ REALIGN P1(I,J) WITH P2(I+1,J)
4.4 DYNAMIC and NEW_VALUE directives
The arrays, redistributed by REDISTRIBUTE and REALIGN directives, should be specified in DYNAMIC directive.
|dynamic-directive||is DYNAMIC alignee-or-distributee-list|
If after REDISTRIBUTE and REALIGN directive execution new values will be assigned to the arrays, additional (optimizing) directive NEW_VALUE must precede these directives.
The directive cancels reassigning the redistributed array values.
If the array is specified in DYNAMIC directive and there is no DISTRIBUTE or ALIGN specification for it, its distribution is postponed up to the first REDISTRIBUTE or REALIGN statement. It is neccessary in two cases.
4.5 Default distribution
If the data are not specified in DISTRIBUTE or ALIGN directive , they are distributed on each processor (full replication). The same distribution can be defined by DISTRIBUTE directive with format of * for each dimension. But in that case the access to the data will be less effective.