OpenACCDialect.md | Explore in Territory

# 'acc' Dialect

The `acc` dialect is an MLIR dialect for representing the OpenACC
programming model. OpenACC is a standardized directive-based model which
is used with C, C++, and Fortran to enable programmers to expose
parallelism in their code. The descriptive approach used by OpenACC
allows targeting of parallel multicore and accelerator targets like GPUs
by giving the compiler the freedom of how to parallelize for specific
architectures. OpenACC also provides the ability to optimize the
parallelism through increasingly more prescriptive clauses.

This dialect models the constructs from the 
[OpenACC 3.3 specification](https://www.openacc.org/sites/default/files/inline-images/Specification/OpenACC-3.3-final.pdf)

This document describes the design of the OpenACC dialect in MLIR. It
lists and explains design goals and design choices along with their
rationale. It also describes specifics with regards to acc dialect
operations, types, and attributes.

[TOC]

## Dialect Design Goals

* Needs to have complete representation of the OpenACC language.
	- A frontend requires this in order to properly generate a
	representation of possible `acc` pragmas in MLIR. Additionally,
	this dialect is expected to be further lowered when materializing
	its semantics. Without a complete representation, a frontend might
	choose a lower abstraction (such as direct runtime call) - but this
	would impact the ability to do analysis and optimizations on the
	dialect.
* Allow representation at the same semantic level as the OpenACC
language while having capability to represent nuances of the source
language semantics (such as Fortran descriptors) in an agnostic manner.
	- Using abstractions that closely model the OpenACC language
	simplifies frontend implementation. It also allows for easier
	debugging of the IR. However, sometimes source language specific
	behavior is needed when materializing OpenACC. In these cases, such
	as privatization of C++ objects with default constructor, the
	frontend fills in the `recipe` along with the `private` operation
	which can be packaged neatly with the `acc` dialect operations.
* Be able to regenerate the semantic equivalent of the user pragmas from
the dialect (including bounds, names, clauses, modifiers, etc).
	- This is a strong measure of making sure that the dialect is not
	lossy in semantics. It also allows capability to generate
	appropriate and useful debug information outside of the frontend.
* Be dialect agnostic so that it can be used and coexist with other
dialects including but not limited to `hlfir`, `fir`, `llvm`, `cir`.
	- Directive-based models such as OpenACC are always used with a
	source language, so the `acc` dialect coexisting with other
	dialect(s) is necessary by construction. Through proper
	abstractions, neither the `acc` dialect nor the source language
	dialect should have dependencies on each other; where needed,
	interfaces should be used to ensure `acc` dialect can verify
	expected properties.
* The dialect must allow dataflow to be modeled accurately and
performantly using MLIR's existing facilities.
	- Appropriate dataflow modeling is important for analyses and IR
	reasoning - even something as simple as walking the uses. Therefore
	operations, like data operations, are expected to generate results
	which can be used in modeling behavior. For example, consider an
	`acc copyin` clause. After the `acc.copyin` operation, a pointer
	which lives on devices should be distinguishable from one that lives
	in host memory.
* Be friendly to MLIR optimization passes by implementing common
interfaces.
	- Interfaces, such as `MemoryEffects`, are the key way MLIR
	transformations and analyses are designed to interact with the IR.
	In order for the operations in the `acc` dialect to be optimizable
	(either directly or even indirectly by not blocking optimizations
	of nested IR), implementing relevant common interfaces is needed.

The design philosophy of the acc dialect is one where the design goals
are adhered to. Current and planned operations, attributes, types must
adhere to the design goals.

## Operation Categories

The OpenACC dialect includes both high-level operations (which retain
the same semantic meaning as their OpenACC language equivalent),
intermediate-level operations (which are used to decompose clauses
from constructs), and low-level operations (to encode specifics
associated with source language in a generic way).

The high-level operations list contains the following OpenACC language
constructs and their corresponding operations:
* `acc parallel` &rarr; `acc.parallel`
* `acc kernels` &rarr; `acc.kernels`
* `acc serial` &rarr; `acc.serial`
* `acc data` &rarr; `acc.data`
* `acc loop` &rarr; `acc.loop`
* `acc enter data` &rarr; `acc.enter_data`
* `acc exit data` &rarr; `acc.exit_data`
* `acc host_data` &rarr; `acc.host_data`
* `acc init` &rarr; `acc.init`
* `acc shutdown` &rarr; `acc.shutdown`
* `acc update` &rarr; `acc.update`
* `acc set` &rarr; `acc.set`
* `acc wait` &rarr; `acc.wait`
* `acc atomic read` &rarr; `acc.atomic.read`
* `acc atomic write` &rarr; `acc.atomic.write`
* `acc atomic update` &rarr; `acc.atomic.update`
* `acc atomic capture` &rarr; `acc.atomic.capture`

This second group contains operations which are used to represent
either decomposed constructs or clauses for more accurate modeling:
* `acc routine` &rarr; `acc.routine` + `acc.routine_info` attribute
* `acc declare` &rarr; `acc.declare_enter` + `acc.declare_exit` or
`acc.declare`
* `acc {construct} copyin` &rarr; `acc.copyin` (before region) +
`acc.delete` (after region)
* `acc {construct} copy` &rarr; `acc.copyin` (before region) +
`acc.copyout` (after region)
* `acc {construct} copyout` &rarr; `acc.create` (before region) +
`acc.copyout` (after region)
* `acc {construct} attach` &rarr; `acc.attach` (before region) +
`acc.detach` (after region)
* `acc {construct} create` &rarr; `acc.create` (before region) +
`acc.delete` (after region)
* `acc {construct} present` &rarr; `acc.present` (before region) +
`acc.delete` (after region)
* `acc {construct} no_create` &rarr; `acc.nocreate` (before region) +
`acc.delete` (after region)
* `acc {construct} deviceptr` &rarr; `acc.deviceptr`
* `acc {construct} private` &rarr; `acc.private`
* `acc {construct} firstprivate` &rarr; `acc.firstprivate`
* `acc {construct} reduction` &rarr; `acc.reduction`
* `acc cache` &rarr; `acc.cache`
* `acc update device` &rarr; `acc.update_device`
* `acc update host` &rarr; `acc.update_host`
* `acc host_data use_device` &rarr; `acc.use_device`
* `acc declare device_resident` &rarr; `acc.declare_device_resident`
* `acc declare link` &rarr; `acc.declare_link`
* `acc exit data delete` &rarr; `acc.delete` (with `structured` flag as
false)
* `acc exit data detach` &rarr; `acc.detach` (with `structured` flag as
false)
* `acc {construct} {data_clause}(var[lb:ub])` &rarr; `acc.bounds`

The low-level operations are:
* `acc.private.recipe`
* `acc.reduction.recipe`
* `acc.firstprivate.recipe`
* `acc.global_ctor`
* `acc.global_dtor`
* `acc.yield`
* `acc.terminator`
The low-level operations semantics and reasoning are further explained
in sections below.

### Data Operations

#### Data Clause Decomposition
The data clauses are decomposed from their constructs for better
dataflow modeling in MLIR. There are multiple reasons for this which
are consistent with the dialect goals:
* Correctly represents dataflow. Data clauses have different effects
at entry to region and at exit from region.
* Friendlier to add attributes such as `MemoryEffects` to a single
operation. This can better reflect semantics (like the fact that an
`acc.copyin` operation only reads host memory)
* Operations can be moved or optimized individually (eg `CSE`).
* Easier to keep track of debug information. Line location can point to
the text representing the data clause instead of the construct.
Additionally, attributes can be used to keep track of variable names in
clauses without having to walk the IR tree in attempt to recover the
information (this makes acc dialect more agnostic with regards to what
other dialect it is used with).
* Clear operation ordering since all data operations are on same
list.

Each of the `acc` dialect data operations represents either the
entry or the exit portion of the data action specification. Thus,
`acc.copyin` represents the semantics defined in section
`2.7.7 copyin clause` whose wording starts with
`At entry to a region`. The decomposed exit operation `acc.delete`
represents the second part of that section, whose wording starts with
`At exit from the region`. The `delete` action may be performed
after checking and updating of the relevant reference counters noted.

The `acc` data operations, even when decomposed, retain their original
data clause in an operation operand `dataClause` for possibility to
recover this information during debugging. For example, `acc copy`,
does not translate to `acc.copy` operation, but instead to `acc.copyin`
for entry and `acc.copyout` for exit. Both the decomposed operations
hold a `dataClause` field that specifies this was an `acc copy`.

The link between the decomposed entry and exit operations is the ssa
value produced by the entry operation. Namely, it is the `accPtr` result
which is used both in the `dataOperands` of the operation used for the
construct and in the `accPtr` operand of the exit operation.

#### Bounds

OpenACC data clauses allow the use of bounds specifiers as per
`2.7.1 Data Specification in Data Clauses`. However, array dimensions
for the data are not always required in the clause if the source
language's type system captures this information - the user can just
specify the variable name in the data clause. So the `acc.bounds`
operation is an important piece to ensure uniform representation of both
explicit user set dimensions and implicit type-based dimensions. It
contains several key features to allow properly encoding sizes in a
manner flexible and agnostic to the source language's dialect:
* Multi-dimensional arrays can be represented by using multiple ordered
`acc.bounds` operations.
* Bounds are required to be zero-normalized. This works well with the
`PointerLikeType` requirement in data clauses - since a lowerbound of 0
means looking at data at the zero offset from pointer. This requirement
also works well in ensuring the `acc` dialect is agnostic to source
language dialect since it prevents ambiguity such as the case of Fortran
arrays where the lower bound is not a fixed value.
* If the source dialect does not encode the dimensions in the type (eg
`!fir.array<?x?xi32>`) but instead encodes it in some other way (such as
through descriptors), then the frontend must fill in the `acc.bounds`
operands with appropriate information (such as loads from descriptor).
The `acc.bounds` operation also permits lossy source dialect, such
as if the frontend uses aggressive pointer decay and cannot represent
the dimensions in the type system (eg using `!llvm.ptr` for arrays).
Both of these aspects show `acc.bounds`' operation's flexibility to
allow the representation to be agnostic since the `acc` dialect is not
expected to be able to understand how to extract dimension information
from the types of the source dialect.
* The OpenACC specification allows either extent or upperbound in the
data clause depending on whether it is Fortran or C and C++. The
`acc.bounds` operation is rich enough to accept either or both - for
convenience in lowering to the dialect and for ability to precisely
capture the meaning from the clause.
* The stride, either in units or bytes, can be also captured in the
`acc.bounds` operation. This is also an important part to be able to
accept a source language's arrays without forcing the frontend to
normalize them in some way. For example, consider a case where in a
parent function, a whole array is mapped to device. Then only a view of
a non-1 stride is passed to child function (eg Fortran array slice with
non-1 stride). A `copy` operation of this data in child should be able
to avoid remapping this array. If instead the operation required
normalizing the array (such as making it contiguous), then unexpected
disjoint mapping of the same host data would be error-prone since it
would result in multiple mappings to device.

#### Counters

The data operations also maintain semantics described in the OpenACC
specification related to runtime counters. More specifically, consider
the specification of the entry portion of `acc copyin` in section 2.7.7:
```
At entry to a region, the structured reference counter is used. On an
enter data directive, the dynamic reference counter is used.
- If var is present and is not a null pointer, a present increment
action with the appropriate reference counter is performed.
- If var is not present, a copyin action with the appropriate reference
counter is performed.
- If var is a pointer reference, an attach action is performed.
```
The `acc.copyin` operation includes these semantics, including those
related to attach, which is specified through the `varPtrPtr` operand.
The `structured` flag on the operation is important since the
`structured reference counter` should be used when the flag is true; and
the `dynamic reference counter` should be used when it is false.

At exit from structured regions (`acc data`, `acc kernels`), the
`acc copyin` operation is decomposed to `acc.delete` (with the
`structured` flag as true). The semantics of the `acc.delete` are
also consistent with the OpenACC specification noted for the exit
portion of the `acc copyin` clause:
```
At exit from the region:
- If the structured reference counter for var is zero, no action is
taken.
- Otherwise, a detach action is performed if var is a pointer reference,
and a present decrement action with the structured reference counter is
performed if var is not a null pointer. If both structured and dynamic
reference counters are zero, a delete action is performed.
```

### Types

There are a few acc dialect type categories to describe:
* type of acc data clause operation input `varPtr`
	- The type of `varPtr` must be pointer-like. This is done by
	attaching the `PointerLikeType` interface to the appropriate MLIR
	type. Although memory/storage concept is a lower level abstraction,
	it is useful because the OpenACC model distinguishes between host
	and device memory explicitly - and the mapping between the two is
	done through pointers. Thus, by explicitly requiring it in the
	dialect, the appropriate language frontend must create storage or
	use type that satisfies the mapping constraint.
* type of result of acc data clause operations
	- The type of the acc data clause operation is exactly the same as
	`varPtr`. This was done intentionally instead of introducing an
	`acc.ref/ptr` type so that IR compatibility and the dialect's
	existing strong type checking can be maintained. This is needed
	since the `acc` dialect must live within another dialect whose type
	system is unknown to it. The only constraint is that the appropriate
	dialect type must use the `PointerLikeType` interface.
* type of decomposed clauses
	- Decomposed clauses, such as `acc.bounds` and `acc.declare_enter`
	produce types to allow their results to be used only in specific
	operations.

### Recipes

Recipes are a generic way to express source language specific semantics.

There are currently two categories of recipes, but the recipe concept
can be extended for any additional low-level information that needs
to be captured for successful lowering of OpenACC. The two categories
are:
* recipes used in the context of privatization associated with a
construct
* recipes used in the context of additional specification of data
semantics

The intention of the recipes is to specify how materialization of
action, such as privatization, should be done when the semantics
of the action needs interpreted and lowered, such as before generating
LLVM dialect.

The recipes used for privatization provide a source-language independent
way of specifying the creation of a local variable of that type. This
means using the appropriate `alloca` instruction and being able to
specify default initialization or default constructor.

### Routine

The routine directive is used to note that a procedure should be made
available for the accelerator in a way that is consistent with its
modifiers, such as those that describe the parallelism. In the acc
dialect, an acc routine is represented through two joint pieces - an
attribute and an operation:
* The `acc.routine` operation is simply a specifier which notes which
symbol (or string) the acc routine is needed for, along with parallelism
associated. This defines a symbol that can be referenced in attribute.
* The `acc.routine_info` attribute is an attribute used on the source
dialect specific operation which specifies one or multiple `acc.routine`
symbols. Typically, this is attached to `func.func` which either 
provides the declaration (in case of externals) or provides the
actual body of the acc routine in the dialect that the source language
was translated to.

### Declare

OpenACC `declare` is a mechanism which declares a definition of a global
or a local to be accessible to accelerator with an implicit lifetime
as that of the scope where it was declared in. Thus, `declare` semantics
are represented through multiple operations and attributes:
* `acc.declare` - This is a structured operation which contains an
MLIR region and can be used in similar manner as acc.data to specify
an implicit data region with specific procedure lifetime. This is
typically used inside `func.func` after variable declarations.
* `acc.declare_enter` - This is an unstructured operation which is
used as a decomposed form of `acc declare`. It effectively allows the
entry operation to exist in a scope different than the exit operation.
It can also be used along `acc.declare_exit` which consumes its token
to define a scoped region without using MLIR region. This operation is
also used in `acc.global_ctor`.
* `acc.declare_exit` - The matching equivalent of `acc.declare_enter`
except that it specifies exit semantics. This operation is typically
used inside a `func.func` at the exit points or with `acc.global_dtor`.
* `acc.global_ctor` - Lives at the same level as source dialect globals
and is used to specify data actions to be done at program entry. This
is used in conjunction with source dialect globals whose lifetime is
not just a single procedure.
* `acc.global_dtor` - Defines the exit data actions that should be done
at program exit. Typically used to revert the actions of
`acc.global_ctor`.

The attributes:
* `acc.declare` - This is a facility for easier determination of
variables which are `acc declare`'d. This attribute is used on
operations producing globals and on operations producing locals such as
dialect specific `alloca`'s. Having this attribute is required in order
to appear in a data mapping operation associated with any of the
`acc.declare*` operations.
* `acc.declare_action` - Since the OpenACC specification allows
declaration of variables that have yet to be allocated, this attribute
is used at the allocation and deallocation points. More specifically,
this attribute captures symbols of functions to be called to perform
an action either pre-allocate, post-allocate, pre-deallocate, or
post-deallocate. Calls to these functions should be materialized when
lowering OpenACC semantics to ensure proper data actions are done
after the allocation/deallocation.

## OpenACC Transforms and Analyses

The design goal for the `acc` dialect is to be friendly to MLIR
optimization passes including CSE and LICM. Additionally, since it is
designed to recover original clauses, it makes late verification and
analysis possible in the MLIR framework outside of the frontend.

This section describes a few MLIR-level passes for which the `acc`
dialect design should be friendly for. This section is currently
solely outlining the possibilities intended by the design and not
necessarily existing passes.

### Verification

Since the OpenACC dialect is not lossy with regards to its
representation, it is possible to do OpenACC language semantic checking
at the MLIR-level. What follows is a list of various semantic checks
needed.

This first list is required to be done in the frontend because the `acc`
dialect operations must be valid when constructed:
* Ensure that only listed clauses are allowed for each directive.
* Ensure that only listed modifiers are allowed for each clause.

However, the following are semantic checks that can be done at the
MLIR-level (either in a separate pass or as part of the operation
verifier):
* Specify the validity checks that each modifier needs. (eg num_gangs
may need a positive integer).
* Ensure valid clause nesting.
* Validate clause restrictions which cannot appear with others.
* Validate that no conflicting clauses are used on variables.

Note that some of these checks can be even more precise when done at the
MLIR level because optimizations like inlining and constant propagation
expose detail that wouldn't have been visible in the frontend.

### Implicit Data Attributes

The OpenACC specification includes a section on `2.6.2 Variables with
Implicitly Determined Data Attributes`. What this section describes are
the data actions that should be applied to a variable for which
user did not specify a data action for. The action depends on the
construct being used and also on the default clause. However, the point
to note here is that variables which are live-in into the acc region
must employ some data mapping so the data can be passed to accelerator.

One possible optimizations that affects data attributes needed is
`Scalar Replacement of Aggregates (SROA)`. The `acc` dialect should
not prevent this from happening on the source dialect.

Because it is intended to be possible to apply optimizations across an
`acc` region, the analysis/transformation pass that applies the implicit
data attributes should be run as late as possible - ideally right before
any outlining process which uses the `acc` region body to create an
accelerator procedure. It is expected that existing MLIR facilities,
such as `mlir::Liveness` will work for the `acc` region and thus can be
used to perform this analysis.

### Redundant Clause Elimination

The data operations are modeled in a way where data entry operations
look like loads and data exit operations look like stores. Thus these
operations are intended to be optimized in the following ways:
* Be able to eliminate redundant operations such as when an `acc.copyin`
dominates another.
* Be able to hoist/sink such operations out of loops.

## Operations TOC

[include "Dialects/OpenACCDialectOps.md"]
llvm/mlir/docs/Dialects/OpenACCDialect.md