============================================
Implementation plans for ``-fbounds-safety``
============================================
.. contents::
:local:
Gradual updates with experimental flag
======================================
The feature will be implemented as a series of smaller PRs and we will guard our
implementation with an experimental flag ``-fexperimental-bounds-safety`` until
the usable model is fully available. Once the model is ready for use, we will
expose the flag ``-fbounds-safety``.
Possible patch sets
-------------------
* External bounds annotations and the (late) parsing logic.
* Internal bounds annotations (wide pointers) and their parsing logic.
* Clang code generation for wide pointers with debug information.
* Pointer cast semantics involving bounds annotations (this could be divided
into multiple sub-PRs).
* CFG analysis for pairs of related pointer and count assignments and the likes.
* Bounds check expressions in AST and the Clang code generation (this could also
be divided into multiple sub-PRs).
Proposed implementation
=======================
External bounds annotations
---------------------------
The bounds annotations are C type attributes appertaining to pointer types. If
an attribute is added to the position of a declaration attribute, e.g., ``int
*ptr __counted_by(size)``, the attribute appertains to the outermost pointer
type of the declaration (``int *``).
New sugar types
---------------
An external bounds annotation creates a type sugar of the underlying pointer
types. We will introduce a new sugar type, ``DynamicBoundsPointerType`` to
represent ``__counted_by`` or ``__sized_by``. Using ``AttributedType`` would not
be sufficient because the type needs to hold the count or size expression as
well as some metadata necessary for analysis, while this type may be implemented
through inheritance from ``AttributedType``. Treating the annotations as type
sugars means two types with incompatible external bounds annotations may be
considered canonically the same types. This is sometimes necessary, for example,
to make the ``__counted_by`` and friends not participate in function
overloading. However, this design requires a separate logic to walk through the
entire type hierarchy to check type compatibility of bounds annotations.
Late parsing for C
------------------
A bounds annotation such as ``__counted_by(count)`` can be added to type of a
struct field declaration where count is another field of the same struct
declared later. Similarly, the annotation may apply to type of a function
parameter declaration which precedes the parameter count in the same function.
This means parsing the argument of bounds annotations must be done after the
parser has the whole context of a struct or a function declaration. Clang has
late parsing logic for C++ declaration attributes that require late parsing,
while the C declaration attributes and C/C++ type attributes do not have the
same logic. This requires introducing late parsing logic for C/C++ type
attributes.
Internal bounds annotations
---------------------------
``__indexable`` and ``__bidi_indexable`` alter pointer representations to be
equivalent to a struct with the pointer and the corresponding bounds fields.
Despite this difference in their representations, they are still pointers in
terms of types of operations that are allowed and their semantics. For instance,
a pointer dereference on a ``__bidi_indexable`` pointer will return the
dereferenced value same as plain C pointers, modulo the extra bounds checks
being performed before dereferencing the wide pointer. This means mapping the
wide pointers to struct types with equivalent layout won’t be sufficient. To
represent the wide pointers in Clang AST, we add an extra field in the
PointerType class to indicate the internal bounds of the pointer. This ensures
pointers of different representations are mapped to different canonical types
while they are still treated as pointers.
In LLVM IR, wide pointers will be emitted as structs of equivalent
representations. Clang CodeGen will handle them as Aggregate in
``TypeEvaluationKind (TEK)``. ``AggExprEmitter`` was extended to handle pointer
operations returning wide pointers. Alternatively, a new ``TEK`` and an
expression emitter dedicated to wide pointers could be introduced.
Default bounds annotations
--------------------------
The model may implicitly add ``__bidi_indexable`` or ``__single`` depending on
the context of the declaration that has the pointer type. ``__bidi_indexable``
implicitly adds to local variables, while ``__single`` implicitly adds to
pointer types specifying struct fields, function parameters, or global
variables. This means the parser may first create the pointer type without any
default pointer attribute and then recreate the type once the parser has the
declaration context and determined the default attribute accordingly.
This also requires the parser to reset the type of the declaration with the
newly created type with the right default attribute.
Promotion expression
--------------------
A new expression will be introduced to represent the conversion from a pointer
with an external bounds annotation, such as ``__counted_by``, to
``__bidi_indexable``. This type of conversion cannot be handled by normal
CastExprs because it requires an extra subexpression(s) to provide the bounds
information necessary to create a wide pointer.
Bounds check expression
-----------------------
Bounds checks are part of semantics defined in the ``-fbounds-safety`` language
model. Hence, exposing the bounds checks and other semantic actions in the AST
is desirable. A new expression for bounds checks has been added to the AST. The
bounds check expression has a ``BoundsCheckKind`` to indicate the kind of checks
and has the additional sub-expressions that are necessary to perform the check
according to the kind.
Paired assignment check
-----------------------
``-fbounds-safety`` enforces that variables or fields related with the same
external bounds annotation (e.g., ``buf`` and ``count`` related with
``__counted_by`` in the example below) must be updated side by side within the
same basic block and without side effect in between.
.. code-block:: c
typedef struct {
int *__counted_by(count) buf; size_t count;
} sized_buf_t;
void alloc_buf(sized_buf_t *sbuf, sized_t nelems) {
sbuf->buf = (int *)malloc(sizeof(int) * nelems);
sbuf->count = nelems;
}
To implement this rule, the compiler requires a linear representation of
statements to understand the ordering and the adjacency between the two or more
assignments. The Clang CFG is used to implement this analysis as Clang CFG
provides a linear view of statements within each ``CFGBlock`` (Clang
``CFGBlock`` represents a single basic block in a source-level CFG).
Bounds check optimizations
--------------------------
In ``-fbounds-safety``, the Clang frontend emits run-time checks for every
memory dereference if the type system or analyses in the frontend couldn’t
verify its bounds safety. The implementation relies on LLVM optimizations to
remove redundant run-time checks. Using this optimization strategy, if the
original source code already has bounds checks, the fewer additional checks
``-fbounds-safety`` will introduce. The LLVM ``ConstraintElimination`` pass is
design to remove provable redundant checks (please check Florian Hahn’s
presentation in 2021 LLVM Dev Meeting and the implementation to learn more). In
the following example, ``-fbounds-safety`` implicitly adds the redundant bounds
checks that the optimizer can remove:
.. code-block:: c
void fill_array_with_indices(int *__counted_by(count) p, size_t count) {
for (size_t i = 0; i < count; ++i) {
// implicit bounds checks:
// if (p + i < p || p + i + 1 > p + count) trap();
p[i] = i;
}
}
``ConstraintElimination`` collects the following facts and determines if the
bounds checks can be safely removed:
* Inside the for-loop, ``0 <= i < count``, hence ``1 <= i + 1 <= count``.
* Pointer arithmetic ``p + count`` in the if-condition doesn’t wrap.
* ``-fbounds-safety`` treats pointer arithmetic overflow as deterministically
two’s complement computation, not an undefined behavior. Therefore,
getelementptr does not typically have inbounds keyword. However, the compiler
does emit inbounds for ``p + count`` in this case because
``__counted_by(count)`` has the invariant that p has at least as many as
elements as count. Using this information, ``ConstraintElimination`` is able
to determine ``p + count`` doesn’t wrap.
* Accordingly, ``p + i`` and ``p + i + 1`` also don’t wrap.
* Therefore, ``p <= p + i`` and ``p + i + 1 <= p + count``.
* The if-condition simplifies to false and becomes dead code that the subsequent
optimization passes can remove.
``OptRemarks`` can be utilized to provide insights into performance tuning. It
has the capability to report on checks that it cannot eliminate, possibly with
reasons, allowing programmers to adjust their code to unlock further
optimizations.
Debugging
=========
Internal bounds annotations
---------------------------
Internal bounds annotations change a pointer into a wide pointer. The debugger
needs to understand that wide pointers are essentially pointers with a struct
layout. To handle this, a wide pointer is described as a record type in the
debug info. The type name has a special name prefix (e.g.,
``__bounds_safety$bidi_indexable``) which can be recognized by a debug info
consumer to provide support that goes beyond showing the internal structure of
the wide pointer. There are no DWARF extensions needed to support wide pointers.
In our implementation, LLDB recognizes wide pointer types by name and
reconstructs them as wide pointer Clang AST types for use in the expression
evaluator.
External bounds annotations
---------------------------
Similar to internal bounds annotations, external bound annotations are described
as a typedef to their underlying pointer type in the debug info, and the bounds
are encoded as strings in the typedef’s name (e.g.,
``__bounds_safety$counted_by:N``).
Recognizing ``-fbounds-safety`` traps
-------------------------------------
Clang emits debug info for ``-fbounds-safety`` traps as inlined functions, where
the function name encodes the error message. LLDB implements a frame recognizer
to surface a human-readable error cause to the end user. A debug info consumer
that is unaware of this sees an inlined function whose name encodes an error
message (e.g., : ``__bounds_safety$Bounds check failed``).
Expression Parsing
------------------
In our implementation, LLDB’s expression evaluator does not enable the
``-fbounds-safety`` language option because it’s currently unable to fully
reconstruct the pointers with external bounds annotations, and also because the
evaluator operates in C++ mode, utilizing C++ reference types, while
``-fbounds-safety`` does not currently support C++. This means LLDB’s expression
evaluator can only evaluate a subset of the ``-fbounds-safety`` language model.
Specifically, it’s capable of evaluating the wide pointers that already exist in
the source code. All other expressions are evaluated according to C/C++
semantics.
C++ support
===========
C++ has multiple options to write code in a bounds-safe manner, such as
following the bounds-safety core guidelines and/or using hardened libc++ along
with the `C++ Safe Buffer model
<https://discourse.llvm.org/t/rfc-c-buffer-hardening/65734>`_. However, these
techniques may require ABI changes and may not be applicable to code
interoperating with C. When the ABI of an existing program needs to be preserved
and for headers shared between C and C++, ``-fbounds-safety`` offers a potential
solution.
``-fbounds-safety`` is not currently supported in C++, but we believe the
general approach would be applicable for future efforts.