Fortran DVM language description. Part 3 (7-12)

Part 3 (7-12)

created: april 2001

- last edited 03.05.01 -

DVM parallel model joins data parallelism and task parallelism.

Data parallelism is implemented by distribution of arrays and loop iterations over virtual processor subsystem. Virtual processor subsystem can include as whole processor arrangement as its section.

Task parallelism is implemented by independent computations on sections of processor arrangement.

Let us define a set of virtual processors, where a procedure is executed, as current virtual processor system. For main program the current system consists of whole set of virtual processors.

The separate task group is defined by the following directives.

Declaration of task array (TASK directive).
Mapping task array on the sections of the processor arrangement (MAP directive).
Distribution of arrays over tasks (REDISTRIBUTE directive)
Distribution of computations (blocks of statements or iterations of parallel loop) over tasks (TASK_REGION construct).

Several tasks can be described in a procedure. Nested tasks are not allowed.

7.1 Declaration of task array

A task array is described by the following directive:

task-directive	is TASK task-list

task	is task-name ( max-task )

TASK directive declares one-dimensional task array, which then will be mapped on the processor arrangement sections.

7.2 Mapping tasks on processors. MAP directive

The task mapping on processor arrangement section is performed by MAP directive

map-directive	is MAP task-name (index-task)
	ONTO processors-name(processors-section-subscript-list)

Several tasks can be mapped on the same section.

7.3 Array distribution on tasks

Array distribution on tasks is performed by REDISTRIBUTE directive with the following extension:

dist-target	is . . .
	or task-name ( task-index)

The array is distributed on processor arrangement section, provided to the specified task.

7.4 Distribution of computations. TASK_REGION directive

Distribution of statement blocks on the tasks is described by TASK_REGION construct:

block-task-region	is task-region-directive
	on-block
	[ on-block ]...
	end-task-region-directive

task-region-directive	is TASK_REGION task-name [ , reduction-clause ]

end-task-region-directive	is END TASK_REGION

on-block	is on-directive
	block
	end-on-directive

on-directive	is ON task-name ( task-index ) [ , new-clause ]

end-on-directive	is END ON

Task region and each on-block are sequences of statements with single entry (a first statement) and single exit (after last statement). TASK_REGION construct is semantically equivalent to parallel section construction for common memory model. The difference is that statement block in task region can be executed on several processors in data parallelism model.

Distribution of the parallel loop iterations on tasks is performed by the following construct:

loop-task-region	is task-region-directive
	parallel-task-loop
	end-task-region-directive

parallel-task-loop	is parallel-task-loop-directive
	do-loop

parallel-task-loop-directive	is PARALLEL ( do-variable ) ON task-name ( do-variable ) [ , new-clause ]

Distributed computation unit is an iteration of one-dimensional parallel loop. The difference from usual parallel loop is the distribution of the iteration on processor arrangement section, the section being defined by reference to the element of the task array.

Specifications reduction-clause and new-clause have the same semantics as for parallel loop. Reduction variable value must be calculated in each task. After task completion (END TASK_REGION) in the case of synchronous specification the reduction over all values of reduction variable on all the tasks are automatically performed. In the case of asynchronous specification the reduction is started by REDUCTION_START directive.

Constraint:

If the reduction operation is performed between tasks, these tasks must be distributed on disjoined sections of the processor arrangement.

7.5 Data localization in tasks

A task is on-block or loop iteration. The tasks of the same group have the following constraints on data

there are no data dependencies;
all used and computed data are allocated (localized) on processor arrangement section of the given task;
task can update only the values of arrays, distributed on the section, reduction variable values and the NEW-variable values.

After the task completion each array must have same distribution as before the task startup. If the array distribution is changed in the task, it must be restored after the task completion.

7.6 Fragment of static multi-block problem

The program fragment, describing realization of three-block problem (fig.6.6) is presented below.

CDVM$	PROCESSORS  P(NUMBER_OF_PROCESSORS( ))
C	arrays A1,A2,A3 - the function values on the previous iteration 
C	arrays B1,B2,B3 - the function values on the current iteration
REAL  A1(M,N1+1),    B1(M,N1+1)
	REAL  A2(M1+1,N2+1), B2(M1+1,N2+1)
	REAL  A3(M2+1,N2+1), B3(M2+1,N2+1)
C	declaration of task array
CDVM$	TASK  MB( 3 ) 
C	aligning arrays of each block
CDVM$	ALIGN  B1(I,J)  WITH  A1(I,J)
CDVM$	ALIGN  B2(I,J)  WITH  A2(I,J)
CDVM$	ALIGN  B3(I,J)  WITH  A3(I,J)
C	
CDVM$	DISTRIBUTE :: A1, A2, A3
CDVM$	REMOTE_GROUP  RS
C	distribution of tasks on processor arrangement sections and 
C	distribution of arrays on tasks
C	( each section contains third of all the processors)
	NP = NUMBER_OF_PROCESSORS( ) / 3
CDVM$	MAP  MB( 1 )  ONTO  P(1:NP)
CDVM$	REDISTRIBUTE  (*,BLOCK)  ONTO  MB( 1 ) :: A1
CDVM$	MAP  MB( 2 )  ONTO  P(NP+1:2*NP)
CDVM$	REDISTRIBUTE  (*,BLOCK)  ONTO  MB( 2 ) :: A2
CDVM$	MAP  MB( 3 )  ONTO  P(2*NP+1:3*NP)
CDVM$	REDISTRIBUTE  (*,BLOCK)  ONTO  MB( 3 ) :: A3
	.   .   .
	DO  10  IT = 1, MAXIT
	.   .   .
CDVM$	PREFETCH  RS
C	exchanging edges of adjacent blocks
	.   .   .
C	distribution of computations (statement blocks) on tasks
CDVM$	TASK_REGION  MB
CDVM$	ON  MB( 1 )
		CALL JACOBY( A1, B1, M, N1+1 )
CDVM$	END ON 
CDVM$	ON  MB( 2 )
		CALL JACOBY( A2, B2, M1+1, N2+1 )
CDVM$	END ON 
CDVM$	ON  MB( 3 )
		CALL JACOBY( A3, B3, M2+1, N2+1 )
CDVM$	END ON 
CDVM$	END TASK_REGION
10	CONTINUE

7.7 Fragment of dynamic multi-block problem

Let us consider the fragment of the program, which is dynamically tuned on the number of blocks and the size of each block.

C	NA - maximal number of blocks
	PARAMETER ( NA=20 )
CDVM$	PROCESSORS  R(NUMBER_OF_PROCESSORS( ))
C	memory for dynamic arrays
	REAL  HEAP(100000)
C	sizes of dynamic arrays
	INTEGER  SIZE(2,NA)
C	arrays of pointers for A and B
CDVM$	REAL, POINTER (:,:) :: PA, PB, P1, P2
	INTEGER  P1, P2, PA(NA), PB(NA)
CDVM$	TASK  PT( NA )
CDVM$	ALIGN :: PB, P2
CDVM$	DISTRIBUTE :: PA, P1
	.   .   .
	NP = NUMBER_OF_PROCESSORS( )
C	distribution of arrays on tasks
C	dynamic allocation of the arrays and execution of postponed
C	DISTRIBUTE and ALIGN directives
	IP = 1
	DO  20  I = 1, NA
CDVM$	MAP  PT( I )  ONTO  R(IP:IP+1)
	PA(I) = ALLOCATE ( SIZE(1,I), HEAP )
	P1 = PA(I)
CDVM$	REDISTRIBUTE  (*,BLOCK)  ONTO  PT( I ) :: P1
	PB(I) = ALLOCATE ( SIZE(1,I), HEAP )
	P2 = PB(I)
CDVM$	REALIGN  P2(I,J)  WITH  P1(I,J)
	IP = IP + 2
	IF( IP .GT. NP )  THEN  IP = 1
20	CONTINUE
	.   .   .
C	distribution of computations on tasks
CDVM$	TASK_REGION  PT
CDVM$	PARALLEL  ( I )  ON  PT( I )
	DO 50  I = 1,NA
	   CALL JACOBY(HEAP(PA(I)), HEAP(PB(I)), SIZE(1, I), SIZE(2, I))
50	CONTINUE
CDVM$	END  TASK_REGION

The arrays (blocks) are cyclically distributed on two processor sections. If NA > NP/2, then several arrays will be distributed on some sections. The loop iterations, distributed on the same section, will be executed sequentially in data parallelism model.

8 COMMON and EQUIVALENCE

The arrays, distributed by default, can be used in COMMON blocks and EQUIVALENCE statements without restrictions.

The arrays, distributed by DISTRIBUTE or ALIGN directive, can't be used in EQUIVALENCE statements. Moreover, these arrays can't be associated with other data objects. Explicitly distributed arrays can be the components of COMMON block under the following conditions:

COMMON block must be described in main program unit.
Every occurrence of the COMMON block must have the same number of components and each corresponding component must have a storage sequences of the same size;
If explicitly mapped array is the component of the COMMON block, then the array declarations in different program units must specify the same data type and shape. The DISRIBUTE and ALIGN directives, applied to the array, must have the identical parameters.

Example 8.1. Explicitly distributed array in COMMON block.

Declaration in main program.

	PROGRAM  MAIN
CDVM$	DISTRIBUTE  B (*,BLOCK)
	COMMON /COM1/  X, Y(12), B(12,30)

Declaration in subroutine. The error is another number of components.

	SUBROUTINE  SUB1
CDVM$	DISTRIBUTE  B1 (*,BLOCK)
	COMMON /COM1/  X, Y(12), Z, B1(12,30)

Declaration in subroutine. The error is another distribution of the array.

	SUBROUTINE  SUB2
CDVM$	DISTRIBUTE  B2 (BLOCK,BLOCK)
	COMMON /COM1/  X, Y(12), B2(12,30)

Declaration in subroutine. The error is another configuration of the array.

	SUBROUTINE  SUB3
CDVM$	DISTRIBUTE  B3 (*,BLOCK)
	COMMON /COM1/  X, Y(12), B(30,12)

Declaration in subroutine. There is no error.

	SUBROUTINE  SUB4
CDVM$	DISTRIBUTE  B4 (*,BLOCK)
	COMMON /COM1/  X, Y(12), B(12,30)

9 Procedures

Procedure call inside parallel loop.

A procedure, called inside parallel loop, must not have side effects and contains processor exchanges (purest procedure). As a consequence, the purest procedure doesn't contain:

input/output statements;
FDVM directives;
assignment of values of COMMON block variables.

Procedure call outside parallel loop.

If the actual argument is explicitly mapped array (distributed by DISTRIBUTE or ALIGN), it should be passed without shape changing. It means, that actual argument is the reference to the array beginning, and configurations of actual and corresponding formal arguments are the same.

Formal arguments.

If the actual argument is a distributed array, then corresponding formal argument must have explicit or inherited distribution.

Explicit distribution is described by DISTRIBUTE and ALIGN directives with the following restriction: a formal argument can be aligned only with other formal argument. The explicit distribution of formal argument means, that before the procedure call a user must provide actual argument distribution in exact correspondence with formal argument distribution.

Inherited distribution is described by the directive

inherit-directive

is INHERIT dummy-array-name-list

Inherited distribution means that formal argument inherits a distribution of actual argument for each procedure call. Inherited distribution doesn't require from a user to distribute actual argument in correspondence with the formal argument.

REDISTRIBUTE and REALIGN directives can be applied to formal arguments, if actual and formal arguments have DYNAMIC attribute.

Local arrays.

In the procedure local arrays can be distributed by DISTRIBUTE and ALIGN directives. A local array can be aligned with formal argument. The DISTRIBUTE directive distributes the local array on the processor subsystem, on which the procedure was called (current subsystem). If a processor arrangement section is specified in DISTRIBUTE directive, then the number of the processors must be equal to the number of processors of the current subsystem. The number of current subsystem processors is defined by intrinsic function ACTIVE_NUM_PROCS( ).

The special case is distributed local array with SAVE attribute. The following conditions must be satisfied for the array:

DISTRIBUTE and ALIGN directives have the identical parameters in each procedure call.
The array cannot be used in REDISTRIBUTE and REALIGN directives.

Example 9.1. Distribution of the local arrays and formal arguments.

	SUBROUTINE  DIST( A, B, C, N )
CDVM$	PROCESSORS  PA(ACTIVE_NUM_PROCS( ))
	DIMENSION  A(N,N), B(N,N), C(N,N), X(N,N), Y(N,N)
C	explicit distribution of formal argument
CDVM$	DISTRIBUTE  A (*,BLOCK)
C	aligned formal argument
CDVM$	ALIGN  B(I,J)  WITH  A(I,J)
C	inherited distribution of the formal argument
CDVM$	INHERIT  C
C	aligning local array with formal argument
CDVM$	ALIGN  X(I,J)  WITH  C(I,J)
C	distribution of the local array
CDVM$	DISTRIBUTE  Y (*,BLOCK)  ONTO  PA
	.   .   .
	END

10 Input/Output

The statements of standard Fortran77 are used for data input/output in FDVM.

FDVM allows only restricted form of input/output statements for distributed arrays:

An input/output list may contain only one name of distributed array and may not contain other input/output items.
Only “*” format specifier may be used in formatted I/O statement.
A control information list may not contain ERR, END and IOSTAT specifiers.
Any variable specified in a control information list must be replicated variable.

The statements of distributed array input/output cannot be used in parallel loop and in task region.

Input/output statements for replicated variables have the following restrictions:

A control information list may not contain ERR and END specifiers.
Only the following simplified form of the implicit loop is allowed
(A(i1,i2,...,I), I = n1,n2)
when inputting replicated assumed-size array.

Input statement, INQUIRE statement, and any other input/output statement with controlled parameter IOSTAT may not be used in a parallel loop.

Usage of input/output statements in tasks. If each task (on-block) uses its own files, for input/output the restrictions above is in active. If several tasks use the same file additional differences from Fortran 77 standard occur. When dealing with sequential access file all tasks input data from the file beginning but output records from different tasks are placed to the file in arbitrary order.

Note that FDVM program performing unformatted I/O of distributed arrays is not compatible with serial Fortran 77 program in common case. Data written by one program may not be read by the other one because of difference in record length.

11 Compatibility with HPF

FDVM data mapping directives are based on the following HPF2 directives: DISTRIBUTE, REDISTRIBUTE, ALIGN, REALIGN, PROCESSORS, TEMPLATE, DYNAMIC, SHADOW and INHERIT. The following restrictions exist for a syntax and semantics of the directives.

DISTRIBUTE and REDISTRIBUTE directives have the formats BLOCK, GEN_BLOCK and “*” only.
There is no aligning by triplet in ALIGN and REALIGN directives.
There are permitted only two forms of formal argument specification : explicit distribution and inherited distribution.
EQUIVALENCE statement can't be applied to distributed arrays.

The directives of computation distribution are semantically a subset of the corresponding HPF2 directives. In particular, the PARALLEL directive is a subset of INDEPENDENT directive in HPF2.

The directives of remote data specifications have no analogues in HPF2, as it is assumed, that HPF2 compiler defines access to remote data automatically.

Thus the set of FDVM directives can be transformed to the set of HPF2 directives automatically.

12 The difference between FDVM1.0 and FDVM2.0 versions

The FDVM 1.0 version is a subset of the FDVM 2.0 version. The following new possibilities are provided:

A model of dynamic arrays. This model can be automatically transformed for execution in Fortran 77, FDVM and HPF2 languages (section 4.2).
Task parallelism (section 7).
GEN_BLOCK and WGT_BLOCK format for array distribution (section 4.1.2, 4.1.3).
Distribution of a loop with regular data dependence (section 6.2.4).
Extended REMOTE_ACCESS specification(section 6.3).
Overlapping computations and data exchange (section 6.2.5, 6.3.3, 6.4.2)

References

Message-Passing Interface Forum, Document for a Standard Message-Passing Interface, 1993. Version 1.0.
OpenMP Consortium: OpenMP Fortran Application Program Interface, Version 1.0, October 1997.
High Performance Fortran Forum. High Performance Fortran Language Specification. Version 1.0
High Performance Fortran Forum. High Performance Fortran Language Specification. Version 2.0, January 31, 1997.
ANSI X3.9-1978 Programming Language Fotran. New York 1978.

Fortran DVM - contents

Part 1 (1-4)

Part 2 (5-6)

Part 3 (7-12)

Part 4 (Annexes)