chromium/docs/accessibility/browser/how_a11y_works_3.md

# How Chrome Accessibility Works, Part 3

This document explains the technical details behind Chrome accessibility
code by starting at a high level and progressively adding more levels of
detail.

See [Part 1](how_a11y_works.md) and
[Part 2](how_a11y_works_2.md) first.

[TOC]

## Abstracting platform-specific APIs

In [Part 1](how_a11y_works.md) we talked about how each platform has its own
accessibility API. Chromium originally had the platform-specific accessibility
APIs scattered throughout the code, but today a large fraction of the APIs for
Windows (including IAccessible, IAccessible2, and UI Automation), Linux, and
macOS have all been isolated and abstracted in one place that makes it
relatively easy to write cross-platform accessibility code.

These abstractions are all in the
[ui/accessibility/platform](
https://source.chromium.org/chromium/chromium/src/+/main:ui/accessibility/platform/)
directory.

First, gfx::NativeViewAccessible is a typedef used throughout Chromium to
represent an instance of the platform-specific accessible object on the current
platform. It's defined alongside gfx::NativeView, gfx::NativeEvent, and other
similar types that have equivalents on each platform. Note that these are not
wrappers or abstractions; they're just typedefs enabling you to write a function
that returns an instance of the appropriate type on each platform. For
accessibility, gfx::NativeViewAccessible is defined to be IAccessible* on
Windows, id on Mac (where 'id' is the type for a generic Objective-C object,
which has to implement the informal NSAccessibility protocol), and AtkObject* on
Linux.

The main class in ui/accessibility/platform is AXPlatformNode. When you call
AXPlatformNode::Create, you'll get back an object that implements the
correct interfaces for the platform you're running on - currently Windows,
macOS, and desktop Linux are supported.

For each AXPlatformNode, you need to provide an AXPlatformNodeDelegate - an
instance of a class that you implement in order to provide all of the
accessibility details about that node, in a cross-platform way.

While AXPlatformNodeDelegate is pure virtual, a base class is provided,
AXPlatformNodeDelegateBase, with default implementations of nearly all
of the virtual functions. You can inherit from AXPlatformNodeDelegateBase
and override only a few functions in order to easily get a working object.

As a brief sketch, if you had a custom-drawn button and you wanted to
make it accessible, you could define a subclass like this:

```C++
class MyButtonAXPlatformNodeDelegate
    : public AXPlatformNodeDelegateBase {
    MyButtonAXPlatformNodeDelegate()
        : AXPlatformNodeDelegateBase() {
        ...
    }

    const AXNodeData& GetData() const override {
        ...
    }

    int GetChildCount() const override {
        ...
    }

    gfx::NativeViewAccessible ChildAtIndex(int index) const override {
        ...
    }

    gfx::NativeViewAccessible GetParent() const override {
        ...
    }
};
```

Then to construct the accessible object, you could just write this:
```C++
MyButtonAXPlatformNodeDelegate delegate;
AXPlatformNode* accessible = AXPlatformNode::Create(&delegate);
```

## Events

In the Chromium codebase, accessibility events are notifications sent from
the browser to assistive technology that something has happened. This is
the mechanism by which assistive technology can provide real-time feedback
as the user is interacting with the browser. Some common events found on
nearly all platforms include:

* Focus changed
* Control value changed
* Bounding box changed
* Children changed (a node added, removed, or reordered one or more children)
* Load complete (a web page finished loading)

While many platforms share the same types of events, they're not standardized
at all, and platforms have very different names for events and different
semantics around which events are fired when, and where. As a few examples:

* On macOS, there are separate events for expanding and collapsing a row
  in a table or tree, vs expanding or collapsing a pop-up menu
* On Android there's a separate event for the checked state changing, while
  on other platforms there's just a generic state changed event
* On Windows there are SHOW and HIDE events that need to be fired when
  a node or subtree is created or destroyed

In many cases, assistive technology is co-developed with initial accessibility
support in a platform's native widget toolkit - for example, TalkBack was
co-developed with the accessibility support in Android Views,
and NSAccessibility was co-developed
with AppKit's initial accessibility support. One thing that invariably seems to
happen is that event notifications get added to let assistive technology know
about changes to state of the app.

Then when a new app comes along and needs to do some custom drawing or otherwise
implement some custom accessibility code, implementing those events ends up
being tricky. If the right events aren't fired in exactly the right order, the
assistive technology gets confused, since it was only built and tested with one
event sequence.

This forms an implicit contract between the server and client, but it's one
that's rarely properly documented.

For a cross-platform product like Chromium that needs to support the right
set of events to fire across so many platforms this gets very tricky. In
the early days we tried to have Blink fire the superset of all events needed
on any platform, but this often resulted in duplicate events or subtle
bugs, and a tendency for an event-related fix for one platform to accidentally
break another platform.

Chrome's solution to this now is what we call "implicit" events. Blink, and
other parts of the codebase that build an accessibility tree simply notify that
an accessibility node is dirty, or an entire subtree is dirty. The
infrastructure crawls the dirty nodes and creates a tree mutation and propagates
it to all client interfaces.

At the level of the client interface, we generate implicit events based on
changes to the accessibility tree as observed from that client's perspective,
using a class called AXEventGenerator.

This allows us to keep the code that implements a particular contract in one
place and eliminate subtle differences between different types of content.

### AXEventGenerator

AXEventGenerator is based on the idea of applying atomic updates to an
accessibility tree. As described in
[Data structures used by the accessibility cache](
how_a11y_works_2.md#Data-structures-used-by-the-accessibility-cache)
in part 2, an AXTree is a "live tree" that's currently being served,
and an AXTreeUpdate is a serializable data structure that represents
either a snapshot of a tree or an atomic update to apply to an existing
tree. When AXTree applies an atomic AXTreeUpdate, it allows listeners to
get callbacks for any changes that happened to the tree. In particular,
it keeps both the old and new data for each changing node temporarily
so that listeners can trigger actions based on changes.

AXEventGenerator is thus an AXTree listener. It considers every node
that changed in the tree and figures out what events to fire. It builds
up the set of events and continues modifying it until the atomic update
is finished, enabling it to consolidate and remove duplication.

As one example of that, a live region is a portion of a web page that
may trigger assistive technology to notify whenever an update occurs.
On some platforms, Chromium needs to fire a "live region changed"
announcement on the root of the live region whenever it changes.
AXEventGenerator keeps track of any changes that happen within a live
region and ensures that exactly one "live region changed" event is
fired on the live region root.

There are a small number of exceptions - events that can't be fired
via AXEventGenerator. These are things that can't be inferred just
from tree changes. One such example is the "autocorrection occurred"
event. When the browser performs an autocorrection while the user is
typing, the state change just looks like any other edit. The event
ensures assistive technology can announce the autocorrection.

### Focus events

Focus events are one of the most important types of events, because
changing focus is often one of the most important events for assistive
technology to announce, and the focused node is the one that will be
the target of any input events.

However, one of the challenges with focus events is that there's only
one element on the entire desktop that has focus at any one time, but
individual windows or iframes might not always be aware of the global
state of the entire desktop at the time they experience a focus change
within their scope. This can lead to a race condition.

As an example, suppose that a user clicks a button in a web page, which after a
couple of seconds pops up a dialog and brings focus to an OK button.  At the
same time, the user clicks on a different window to activate it, moving
focus to that window's active element.

Because the windows come from different processes, the two focus events
(from the first window's dialog, and from the second window's active element)
could arrive at the browser process in either order. Here's an illustration
of this race condition:

![This diagram illustrates a race condition where the user clicks a button
to open a dialog in one window, then before it opens activates another
window that focuses a text field. The focus events could arrive to the
browser process in either order.](
figures/focus_race.png)

From the standpoint of the browser, there's always only one node that has
focus. What's important here is that accessibility is completely consistent
with the browser in terms of reporting the correct node that has focus.

The solution here is that only the browser is the source of truth when it
comes to which window has focus. Once we know which window has focus, each
accessibility tree tells us which node has focus within that tree.

As a result, when a focus change happens in an AXTree, we can't just fire
a platform-specific focus event directly. Instead, we use that as a cue to
compute global focus and fire an update if needed. Here's an outline of the
algorithm:

* Anytime focus changes in any accessibility tree, OR when the focused
  window or iframe changes, recompute the focus.
* To compute focus, start with the focused window (or active window,
  depending on the platform). If focus is in web content,
  see what node is focused there. If that node is an iframe, recursively
  jump into that iframe to see what's focus.
* Take the resulting deepest focused node and compare it to the last focused
  node we computed. If it's different, fire a platform-specific accessibility
  focus event.

This ensures that accessibility focus events are always reliable and in sync.

No other accessibility events have the same issue. Events like value changed,
selection changed, etc., are safe to fire even if a window is in the
background. Some assistive technology may be paying attention to background
windows.

## Actions

In Chromium accessibility terminology, Actions flow the opposite direction from
events. Actions are when assistive technology wants to modify or interact with
the app on behalf of the user, such as clicking a button, selecting text, or
changing a control value.

Note that screen readers rely very heavily on events, and partially on
actions. Users often use a combination of accessibility actions along with the
keyboard to directly drive an application, or have the screen reader warp the
mouse cursor directly to an element and simulate a click on that element.

In contrast, assistive technology such as voice control makes heavy use of
actions and relies much less on events. Voice control relies heavily on
actions that enable directly changing control values, entering text,
activating buttons and links, and scrolling the page.

Other assistive technology such as magnifiers are in-between - they may
follow focus events a lot but make heavy use of scroll actions.

For the most part implementing actions is relatively straightforward.
The action is received by the part of the code that implements the
platform-specific accessibility APIs. It forwards the action to the
corresponding accessibility wrapper node in Blink, and that node
calls the appropriate internal APIs to directly manipulate the underlying
element, such as clicking a button or changing the value of a control.

One minor complication is that on many platforms, actions are supposed to
return a success/failure code. Since actions are obviously implemented
asynchronously, Chromium can't know for sure if an action succeeded, so it
has to return success if an action seems valid, even though there's a
chance it might not actually succeed.

## Hit testing

One specific special case of an action is a hit test. This is an API where
the assistive technology gives the x, y coordinates of a location on the
screen and asks the application (Chromium) to return which accessible
object is at that location.

Applications of hit testing include:

* Touch exploration on a touch-screen, or features to describe the
  element as you're hovering over it with the mouse
* Using accessibility debuggers where you can click on an element and
  get its accessibility properties

Unfortunately on some platforms a hit test is a synchronous API. This is
a challenge because it's difficult to properly compute the correct element
at a location given just the accessibility tree, but blocking to wait for
a proper hit test in the render process can lead to deadlock and jankiness.
So Chromium employs the following approach:

* The first time a hit test is received, it does an approximate hit test
  based on the bounding boxes in the accessibility tree. This often returns
  the correct result, but could fail in cases of complex layering or
  non-rectangular objects.
* Subsequently, it makes an async call to the render process to do a proper
  hit test and get the correct resulting element, and also the visible
  bounding box of that element.
* The next hit test that's received, if the coordinates are within the
  bounding box of the most recent proper hit test result, it returns that
  result, which is correct. If the coordinates are outside of that bounding
  box, go back to the first step.

This algorithm works very well in practice when the user is moving the mouse
or dragging their finger across the screen, because we get dozens of hit
tests per second. At the edges of objects, the wrong result may be returned
for a few milliseconds, but as soon as the async result comes back, the
correct result is then returned.

So for interactive use, it's quite seamless and reliable for users, while
still providing reasonable behavior in the less common circumstances where
a single hit test is called.

## Relative coordinates

Up until now we've hinted about the fact that every node in the accessibility
tree stores a bounding box, but we haven't gone into much detail as to
how that bounding box is stored.

If we always stored the bounding box in screen coordinates, then every time
a window is dragged or scrolled, or any time any part of the page moves or
scrolls, all of the affected bounding boxes would need to be recomputed,
which would involve a lot of recomputation and sending information from
render processes to the browser process.

To minimize that work, in Chromium accessibility nodes store relative
coordinates.

In particular, every node stores the following fields in a struct
called AXRelativeBounds:

```C++
struct AX_BASE_EXPORT AXRelativeBounds final {
  int offset_container_id;
  Rect bounds;
  Optional<Transform> transform;
};
```

The first field is the ID of the node's container, which can be any ancestor of
a node. That's the node that the bounds are relative to.

The next field is the local bounding rect, relative to that container.

The last field is an optional 4x4 transformation matrix, which can be
used to encode things like scale factors or even 3-D rotations. If this
concept is unfamiliar to you, search for tutorials on 4x4 transformation
matrices in the context of 3-D computer graphics.

Computing the global bounding rect of a node is meant to be straightforward.
Start with the local rect. As long as the node isn't the root, keep walking
to the container node, applying the transformation matrix and adding the
bounds origin as you go.

In addition, there are a couple of other fields relevant to the bounds
computation that are stored as sparse attributes in AXNodeData. These also
affect the bounds computation.

* bool clips_children;
* int x_scroll_offset;
* int y_scroll_offset;

For more information on bounding boxes, clipping, and offscreen, see
[Offscreen, Invisible and Size](offscreen.md).

## Text bounding boxes

Most platform-specific accessibility APIs have a number of features
specifically to deal with text. Some of those APIs allow querying the
bounding box of an arbitrary range of text - often the text caret or
selection, but not necessarily. Applications include:

* Highlighting text as it's read aloud
* Scrolling one particular text range into view
* Drawing highlights around the caret or selection to make it easier
  for users to see them

Because these APIs are synchronous, they must be served directly out of the
accessibility cache. That means that the accessibility cache needs to have
enough information to be able to retrieve the bounding box of any arbitrary
range of text on-screen.

It would require quite a bit of memory to store the bounding box of
every individual character. To save memory, the following representation
is used:

In the accessibility tree, we keep track of text nodes called "inline text
boxes". This corresponds to a similar concept in Blink, which is also
sometimes called a "text run". The idea is that given a single text node,
the text can be broken down into a sequence of text runs that each have
the following properties:

* Each text run is on a single line
* Text goes a single direction (left-to-right, for example)
* The characters in that text run are all contiguous

In the most common scenario, a single text node contains multiple
lines of text (potentially due to automatic wrapping with soft
line breaks). In the accessibility tree that node would have multiple
inline text box children, one for each line.

Imagine we have the following paragraph, that's very narrow so it
wraps as follows:

```
The quick brown fox
jumps over the
lazy dog.
```

In the accessibility tree, it might be represented like this:

```
Paragraph
    Static Text "The quick brown fox jumps over the lazy dog."
        Inline text box "The quick brown fox "
        Inline text box "jumps over the "
        Inline text box "lazy dog."
```

Each inline text box comes with its own bounding box and text direction.
Then, to store the bounding box of every character, all we need to
do is store the width of each character. Since we know all of the
characters are written continuously in a line going the same direction,
we can use the bounds of the inline text box and the width of each
character to compute the bounding box of any individual character.

The AXPosition class abstracts most of this computation.

## Iframes

The last piece of complexity to address is that up until now we've
assumed that a single web page corresponds to a single frame, so a
web page is a single process.

In Chromium, for security reasons iframes can also be running in
separate processes. This isn't always the case - for one thing, if
system resources are low, Chromium won't keep creating new processes,
and also, frames from the same origin (i.e. from the same website)
need to be in the same process so they can communicate synchronously
via JavaScript. But, frames from different sites can be in different
processes so accessibility code needs to deal with that.

The essential challenge is that each frame, which may be in its own
process, needs to maintain an accessibility tree - but the end result
needs to be stitched together into a final resulting accessibility
tree in the browser process. Iframes are mostly just an implementation
detail; users and assistive technology are rarely concerned with this
detail.

In order to stitch frames together:

* Each accessibility tree gets a globally unique ID, we call it an AXTreeID.
  For security reasons this is an UnguessableToken.
* An iframe element in an accessibility tree contains the AXTreeID of its
  child frame.
* In the browser process, we keep a hash map of all of the trees, and
  also cache the reverse direction (e.g. the map from the root of a
  tree to its parent node).

In order to reduce complexity, Chromium accessibility is built around the
concept that every frame is its own accessibility tree, no matter whether
the frame is in a different process or not. The advantage of this approach
is that the same codepath is used whether iframes are in the same process
or a remote process. If iframes break, they all break - that simplifies
testing and reduces the number of cases to consider.

The concept of embedding one accessibility tree in another using an
AXTreeID is also exploited even more in Chrome OS accessibility, where
it's used to embed Android applications and more.