|created: april 2001||- last edited 03.05.01 -|
7 Task parallelism
DVM parallel model joins data parallelism and task parallelism.
Data parallelism is implemented by distribution of arrays and loop iterations over virtual processor subsystem. Virtual processor subsystem can include as whole processor arrangement as its section.
Task parallelism is implemented by independent computations on sections of processor arrangement.
Let us define a set of virtual processors, where a procedure is executed, as current virtual processor system. For main program the current system consists of whole set of virtual processors.
The separate task group is defined by the following directives.
Several tasks can be described in a procedure. Nested tasks are not allowed.
7.1 Declaration of task array
A task array is described by the following directive:
|task-directive||is TASK task-list|
|task||is task-name ( max-task )|
TASK directive declares one-dimensional task array, which then will be mapped on the processor arrangement sections.
7.2 Mapping tasks on processors. MAP directive
The task mapping on processor arrangement section is performed by MAP directive
|map-directive||is MAP task-name (index-task)|
Several tasks can be mapped on the same section.
7.3 Array distribution on tasks
Array distribution on tasks is performed by REDISTRIBUTE directive with the following extension:
|dist-target||is . . .|
|or task-name ( task-index)|
The array is distributed on processor arrangement section, provided to the specified task.
7.4 Distribution of computations. TASK_REGION directive
Distribution of statement blocks on the tasks is described by TASK_REGION construct:
|[ on-block ]...|
|task-region-directive||is TASK_REGION task-name [ , reduction-clause ]|
|end-task-region-directive||is END TASK_REGION|
|on-directive||is ON task-name ( task-index ) [ , new-clause ]|
|end-on-directive||is END ON|
Task region and each on-block are sequences of statements with single entry (a first statement) and single exit (after last statement). TASK_REGION construct is semantically equivalent to parallel section construction for common memory model. The difference is that statement block in task region can be executed on several processors in data parallelism model.
Distribution of the parallel loop iterations on tasks is performed by the following construct:
( do-variable )
ON task-name ( do-variable ) [ , new-clause ]
Distributed computation unit is an iteration of one-dimensional parallel loop. The difference from usual parallel loop is the distribution of the iteration on processor arrangement section, the section being defined by reference to the element of the task array.
Specifications reduction-clause and new-clause have the same semantics as for parallel loop. Reduction variable value must be calculated in each task. After task completion (END TASK_REGION) in the case of synchronous specification the reduction over all values of reduction variable on all the tasks are automatically performed. In the case of asynchronous specification the reduction is started by REDUCTION_START directive.
7.5 Data localization in tasks
A task is on-block or loop iteration. The tasks of the same group have the following constraints on data
After the task completion each array must have same distribution as before the task startup. If the array distribution is changed in the task, it must be restored after the task completion.
7.6 Fragment of static multi-block problem
The program fragment, describing realization of three-block problem (fig.6.6) is presented below.
CDVM$ PROCESSORS P(NUMBER_OF_PROCESSORS( )) C arrays A1,A2,A3 - the function values on the previous iteration C arrays B1,B2,B3 - the function values on the current iteration REAL A1(M,N1+1), B1(M,N1+1) REAL A2(M1+1,N2+1), B2(M1+1,N2+1) REAL A3(M2+1,N2+1), B3(M2+1,N2+1) C declaration of task array CDVM$ TASK MB( 3 ) C aligning arrays of each block CDVM$ ALIGN B1(I,J) WITH A1(I,J) CDVM$ ALIGN B2(I,J) WITH A2(I,J) CDVM$ ALIGN B3(I,J) WITH A3(I,J) C CDVM$ DISTRIBUTE :: A1, A2, A3 CDVM$ REMOTE_GROUP RS C distribution of tasks on processor arrangement sections and C distribution of arrays on tasks C ( each section contains third of all the processors) NP = NUMBER_OF_PROCESSORS( ) / 3 CDVM$ MAP MB( 1 ) ONTO P(1:NP) CDVM$ REDISTRIBUTE (*,BLOCK) ONTO MB( 1 ) :: A1 CDVM$ MAP MB( 2 ) ONTO P(NP+1:2*NP) CDVM$ REDISTRIBUTE (*,BLOCK) ONTO MB( 2 ) :: A2 CDVM$ MAP MB( 3 ) ONTO P(2*NP+1:3*NP) CDVM$ REDISTRIBUTE (*,BLOCK) ONTO MB( 3 ) :: A3 . . . DO 10 IT = 1, MAXIT . . . CDVM$ PREFETCH RS C exchanging edges of adjacent blocks . . . C distribution of computations (statement blocks) on tasks CDVM$ TASK_REGION MB CDVM$ ON MB( 1 ) CALL JACOBY( A1, B1, M, N1+1 ) CDVM$ END ON CDVM$ ON MB( 2 ) CALL JACOBY( A2, B2, M1+1, N2+1 ) CDVM$ END ON CDVM$ ON MB( 3 ) CALL JACOBY( A3, B3, M2+1, N2+1 ) CDVM$ END ON CDVM$ END TASK_REGION 10 CONTINUE
7.7 Fragment of dynamic multi-block problem
Let us consider the fragment of the program, which is dynamically tuned on the number of blocks and the size of each block.
C NA - maximal number of blocks PARAMETER ( NA=20 ) CDVM$ PROCESSORS R(NUMBER_OF_PROCESSORS( )) C memory for dynamic arrays REAL HEAP(100000) C sizes of dynamic arrays INTEGER SIZE(2,NA) C arrays of pointers for A and B CDVM$ REAL, POINTER (:,:) :: PA, PB, P1, P2 INTEGER P1, P2, PA(NA), PB(NA) CDVM$ TASK PT( NA ) CDVM$ ALIGN :: PB, P2 CDVM$ DISTRIBUTE :: PA, P1 . . . NP = NUMBER_OF_PROCESSORS( ) C distribution of arrays on tasks C dynamic allocation of the arrays and execution of postponed C DISTRIBUTE and ALIGN directives IP = 1 DO 20 I = 1, NA CDVM$ MAP PT( I ) ONTO R(IP:IP+1) PA(I) = ALLOCATE ( SIZE(1,I), HEAP ) P1 = PA(I) CDVM$ REDISTRIBUTE (*,BLOCK) ONTO PT( I ) :: P1 PB(I) = ALLOCATE ( SIZE(1,I), HEAP ) P2 = PB(I) CDVM$ REALIGN P2(I,J) WITH P1(I,J) IP = IP + 2 IF( IP .GT. NP ) THEN IP = 1 20 CONTINUE . . . C distribution of computations on tasks CDVM$ TASK_REGION PT CDVM$ PARALLEL ( I ) ON PT( I ) DO 50 I = 1,NA CALL JACOBY(HEAP(PA(I)), HEAP(PB(I)), SIZE(1, I), SIZE(2, I)) 50 CONTINUE CDVM$ END TASK_REGION
The arrays (blocks) are cyclically distributed on two processor sections. If NA > NP/2, then several arrays will be distributed on some sections. The loop iterations, distributed on the same section, will be executed sequentially in data parallelism model.
8 COMMON and EQUIVALENCE
The arrays, distributed by default, can be used in COMMON blocks and EQUIVALENCE statements without restrictions.
The arrays, distributed by DISTRIBUTE or ALIGN directive, can't be used in EQUIVALENCE statements. Moreover, these arrays can't be associated with other data objects. Explicitly distributed arrays can be the components of COMMON block under the following conditions:
Example 8.1. Explicitly distributed array in COMMON block.
Declaration in main program.
PROGRAM MAIN CDVM$ DISTRIBUTE B (*,BLOCK) COMMON /COM1/ X, Y(12), B(12,30)
Declaration in subroutine. The error is another number of components.
SUBROUTINE SUB1 CDVM$ DISTRIBUTE B1 (*,BLOCK) COMMON /COM1/ X, Y(12), Z, B1(12,30)
Declaration in subroutine. The error is another distribution of the array.
SUBROUTINE SUB2 CDVM$ DISTRIBUTE B2 (BLOCK,BLOCK) COMMON /COM1/ X, Y(12), B2(12,30)
Declaration in subroutine. The error is another configuration of the array.
SUBROUTINE SUB3 CDVM$ DISTRIBUTE B3 (*,BLOCK) COMMON /COM1/ X, Y(12), B(30,12)
Declaration in subroutine. There is no error.
SUBROUTINE SUB4 CDVM$ DISTRIBUTE B4 (*,BLOCK) COMMON /COM1/ X, Y(12), B(12,30)
A procedure, called inside parallel loop, must not have side effects and contains processor exchanges (purest procedure). As a consequence, the purest procedure doesn't contain:
If the actual argument is explicitly mapped array (distributed by DISTRIBUTE or ALIGN), it should be passed without shape changing. It means, that actual argument is the reference to the array beginning, and configurations of actual and corresponding formal arguments are the same.
If the actual argument is a distributed array, then corresponding formal argument must have explicit or inherited distribution.
Explicit distribution is described by DISTRIBUTE and ALIGN directives with the following restriction: a formal argument can be aligned only with other formal argument. The explicit distribution of formal argument means, that before the procedure call a user must provide actual argument distribution in exact correspondence with formal argument distribution.
Inherited distribution is described by the directive
|inherit-directive||is INHERIT dummy-array-name-list|
Inherited distribution means that formal argument inherits a distribution of actual argument for each procedure call. Inherited distribution doesn't require from a user to distribute actual argument in correspondence with the formal argument.
REDISTRIBUTE and REALIGN directives can be applied to formal arguments, if actual and formal arguments have DYNAMIC attribute.
In the procedure local arrays can be distributed by DISTRIBUTE and ALIGN directives. A local array can be aligned with formal argument. The DISTRIBUTE directive distributes the local array on the processor subsystem, on which the procedure was called (current subsystem). If a processor arrangement section is specified in DISTRIBUTE directive, then the number of the processors must be equal to the number of processors of the current subsystem. The number of current subsystem processors is defined by intrinsic function ACTIVE_NUM_PROCS( ).
The special case is distributed local array with SAVE attribute. The following conditions must be satisfied for the array:
Example 9.1. Distribution of the local arrays and formal arguments.
SUBROUTINE DIST( A, B, C, N ) CDVM$ PROCESSORS PA(ACTIVE_NUM_PROCS( )) DIMENSION A(N,N), B(N,N), C(N,N), X(N,N), Y(N,N) C explicit distribution of formal argument CDVM$ DISTRIBUTE A (*,BLOCK) C aligned formal argument CDVM$ ALIGN B(I,J) WITH A(I,J) C inherited distribution of the formal argument CDVM$ INHERIT C C aligning local array with formal argument CDVM$ ALIGN X(I,J) WITH C(I,J) C distribution of the local array CDVM$ DISTRIBUTE Y (*,BLOCK) ONTO PA . . . END
The statements of standard Fortran77 are used for data input/output in FDVM.
FDVM allows only restricted form of input/output statements for distributed arrays:
The statements of distributed array input/output cannot be used in parallel loop and in task region.
Input/output statements for replicated variables have the following restrictions:
Input statement, INQUIRE statement, and any other input/output statement with controlled parameter IOSTAT may not be used in a parallel loop.
Usage of input/output statements in tasks. If each task (on-block) uses its own files, for input/output the restrictions above is in active. If several tasks use the same file additional differences from Fortran 77 standard occur. When dealing with sequential access file all tasks input data from the file beginning but output records from different tasks are placed to the file in arbitrary order.
Note that FDVM program performing unformatted I/O of distributed arrays is not compatible with serial Fortran 77 program in common case. Data written by one program may not be read by the other one because of difference in record length.
11 Compatibility with HPF
FDVM data mapping directives are based on the following HPF2 directives: DISTRIBUTE, REDISTRIBUTE, ALIGN, REALIGN, PROCESSORS, TEMPLATE, DYNAMIC, SHADOW and INHERIT. The following restrictions exist for a syntax and semantics of the directives.
The directives of computation distribution are semantically a subset of the corresponding HPF2 directives. In particular, the PARALLEL directive is a subset of INDEPENDENT directive in HPF2.
The directives of remote data specifications have no analogues in HPF2, as it is assumed, that HPF2 compiler defines access to remote data automatically.
Thus the set of FDVM directives can be transformed to the set of HPF2 directives automatically.
12 The difference between FDVM1.0 and FDVM2.0 versions
The FDVM 1.0 version is a subset of the FDVM 2.0 version. The following new possibilities are provided: