chromium/docs/gpu/gpu_expectation_files.md

# GPU Expectation Files

This file goes over details of the expectation files which are critical for
ensuring that GPU tests only run where they should and that flakes are
suppressed to avoid red bots.

[TOC]

## Overview

The GPU Telemetry-based integration tests (tests that use the
`telemetry_gpu_integration_test` target)
[utilize expectation files](gpu_expectations) in order to define when certain
tests should not be run or are expected to fail. The core expectation format is
defined by [typ](typ_expectations), although there are some Chromium-specific
extensions as well. Each expectation consists of the following fields, separated
by a space:

1. An optional bug identifier. While optional, it is heavily encouraged that GPU
   expectations have this field filled.
1. A set of tags that the expectation applies to. This is technically optional,
   as omitting tags will cause the expectation to be applied everywhere, but
   there are very few, if any, instances where tags will not be specified for
   GPU expectations.
1. The name of the test that the expectation applies to. A single wildcard (`*`)
   character is allowed at the end of the string, but use of a wildcard anywhere
   but the end of the string is an error.
1. A set of expected results for the test. This technically supports multiple
   values, but for GPU purposes, it will always be a single value.

Additionally, comments are supported, which begin with `#`.

Thus, a sample expectation entry might look like:

```
# Flakes regularly but infrequently.
crbug.com/1234 [ win amd ] foo/test [ RetryOnFailure ]
```

[gpu_expectations]: https://chromium.googlesource.com/chromium/src/+/main/content/test/gpu/gpu_tests/test_expectations
[typ_expectations]: https://chromium.googlesource.com/catapult.git/+/main/third_party/typ/typ/expectations_parser.py

## Core Format

The following are further details on each of the parts of an expectation that
are part of the core expectation file format.

### Bug Identifier

An optional string(s) pointing to the bug(s) tracking the reason why the
expectation exists. For GPU uses, this is usually a single bug, but multiple
space-separated strings are supported.

The format of the string is enforced by [these](bug_regexes) regular
expressions, so CLs that introduce malformed bugs will not be submittable.

[bug_regexes]: https://chromium.googlesource.com/chromium/src/+/e26d89a52627f8910b79a95668dfa48e5fe8fa06/content/test/gpu/gpu_tests/test_expectations_unittest.py#66

### Tags

One or more tags are used to specify which configuration(s) an expectation
applies to. For GPU tests, this is often things such as the OS, the GPU vendor,
or the specific GPU model.

Tag sets are defined at the top of the expectation file using `# tags:`
comments. Each comment defines a different set of mutually exclusive tags, e.g.
all of the OS tags are in a single set. An expectation is only allowed to use
one tag from each set, but can use tags from an arbitrary number of sets. For
example, `[ win win10 ]` would be invalid since both are OS tags, but
`[ win amd release ]` would be valid since there is one tag each from the OS,
GPU, and browser type tag sets.

Additionally, tags used for expectations with the same test must be unambiguous
so that the same test cannot have multiple expectations applied to it at once.
Take the following expectations as an example:

```
[ mac intel ] foo/test [ Failure ]
[ mac debug ] foo/test [ RetryOnFailure ]
```

These expectations would be considered to be conflicting since `[ mac intel ]`
does not make any distinctions about the browser type, and `[ mac debug ]` does
not make any distinctions about the GPU type. As written, `foo/test` running
on a configuration that produced the `mac`, `intel`, and `debug` tags would try
to use both expectations.

This can be fixed by adding a tag from the same tag set but with a different
value so that the configurations are no longer ambiguous.
`[ mac intel release ]` would work since a configuration cannot be both
`release` and `debug` at the same time. Similarly, `[ mac amd debug ]` would
work since a configuration cannot be both `intel` and `amd` at the same time.

Such conflicts will be caught and reported by presubmit tests, so you should not
have to worry about accidentally landing bad expectations, but you will need to
fix any found conflicts before you can submit your CL.

#### Adding/Modifying Tags

Actually updating the test harness to generate new tags is out of scope for this
documentation. However, if a new tag needs to be added to an expectation file
or an existing one modified (e.g. renamed), it is important to note that the
tag header should not be manually modified in the expectation file itself.

Instead, modify the header in [validate_tag_consistency.py] and run
`validate_tag_consistency.py apply` to apply the new header to all expectation
files. This ensures that all files remain in sync.

Tag consistency is checked as part of presubmit, so it will be apparent if you
accidentally modify the tag header in a file directly.

[validate_tag_consistency.py]: https://chromium.googlesource.com/chromium/src/+/main/content/test/gpu/validate_tag_consistency.py

### Test Name

A single string with either a test name or part of a test name suffixed with a
wildcard character. Note that the test name is just the test case as reported
by the test harness, not the fully qualified name that is sometimes reported in
places such as the "Test Results" tab on bots.

As an example,
`gpu_tests.webgl1_conformance_integration_test.WebGL1ConformanceIntegrationTest.WebglExtension_EXT_blend_minmax`
is a fully qualified name, while `WebglExtension_EXT_blend_minmax` is what would
actually be used in the expectation file for the `webgl1_conformance` suite.

### Expected Results

Usually one, but potentially multiple, results that are expected on the
configuration that the expectation is for. Like tags, expected results are
defined at the top of each expectation file and have the same caveat about
addition/modification with the helper script. However, unlike tags, there is
only one set of values which are not expected to be added to/changed on any
sort of regular basis. The following expected results are used by GPU tests:

#### Skip

Skips the test entirely. The benefit of this is that no time is wasted on a bad
test. However, it also means that it is impossible to check if the test is still
failing or not by just looking at historical results. This is problematic for
humans, but even more problematic for scripts we have to automatically remove
expectations that are no longer needed.

As such, it is heavily discouraged to add new Skip expectations except under the
following circumstances:

1. The test is invalid on a configuration for some reason, e.g. a feature is not
   and will not be supported on a certain OS, and so should never be run. These
   sorts of expectations are expected to be permanent.
1. The act of running the test is significantly detrimental to other tests, e.g.
   running the test kills the test device. These are expected to be temporary,
   so the root cause should be fixed relatively quickly.

If presubmit thinks you are adding new Skip expectations, it will warn you, but
the warning can be ignored if the addition falls into one of the above
categories or it is a false positive, such as due to modifying tags on an
existing expectation.

#### Failure

Lets the test run normally, but hides the fact that it failed during result
reporting. This is the preferred way to suppress frequent failures on bots, as
it keeps the bots green while still reporting results that can be used later.

#### RetryOnFailure

Allows the test to be retried up to two additional times before being marked as
failing, as by default GPU tests do not retry on failure. This is preferred if
the test fails occasionally, but not enough to warrant marking it as failing
consistently.

#### Slow

Only has an effect in a subset of test suites. Currently, those are suites that
use a heartbeat mechanism instead of a fixed timeout:

* `webgpu_cts`
* `webgl1_conformance`
* `webgl2_conformance`

Since these tests use a relatively short timeout that gets refreshed as long as
the test does not hang, they are more susceptible to timeouts if the test does a
lot of work or other parallel tests are using a large number of resources. In
these cases, the `Slow` expectation can be used to increase the heartbeat
timeout for a test, reducing the chance that one of these timeouts is hit.

If the reported failure for a test is along the lines of "Timed out waiting for
websocket message", prefer to use a `Slow` expectation first over a `Failure` or
`RetryOnFailure` one.

## Extensions

In addition to the normal expectation functionality, Chromium has several
extensions to the expectation file format.

### Unexpected Pass Finder Annotations

Chromium has several unexpected pass finder scripts (sometimes called stale
expectation removers) to automatically reclaim test coverage by modifying
expectation files. These mostly work as intended, but can occasionally make
changes that don't align with what we actually want. Thus, there are several
annotations that can be inserted into expectation files to adjust the behavior
of these scripts.

#### Disable

There are several annotations that can be used to prevent the scripts from
automatically removing expectations. All of these start with `finder:disable`
with some suffix.

`finder:disable-general` prevents the expectation from being removed under any
circumstances.

`finder:disable-stale` prevents the expectation from being removed if it is
still applicable to at least one bot, but all queried results point to the
expectation no longer being needed. This is most likely to be used for
expectations for very infrequent flakes, where the flake might not occur within
the data range that we query.

`finder:disable-unused` prevents the expectation from being removed if it is
found to not be used on any bots, i.e. the specified configuration does not
appear to actually be tested. This is most likely to be used for expectations
for failures reported by third parties with their own testing configurations.

`finder:disable-narrowing` prevents the expectation from having its scope
automatically narrowed to only apply to configurations that are found to need
it. This is most likely to be used for expectations that are intentionally
broad to prevent failures that aren't planned on being fixed.

All of these annotations can either be used inline for a single expectation:

```
[ mac intel ] foo/test [ Failure ]  # finder:disable-general
```

or with their `finder:enable` equivalent for blocks:

```
# finder:disable-general
[ mac intel ] foo/test [ Failure ]
[ mac intel ] bart/test [ Failure ]
# finder:enable-general
```

Nested blocks are not allowed. The `finder:disable` annotations can be followed
with a description of why the disable is necessary, which will be output by the
script when it encounters a case where one of the disabled expectations would
have been removed if the annotation was not present:

```
# finder:disable-stale Very low flake rate
[ mac intel ] foo/test [ Failure ]
[ mac intel ] bar/test [ Failure ]
# finder:enable-stale
```

#### Group Start/End

There may be cases where groups of expectations should only be removed together,
e.g. if a flake affects a large number of tests but the chance of any individual
test hitting the flake is low. In these cases, the expectations can be grouped
together so one is only removed if all of them are being removed.

```
# finder:group-start Some group description or name
[ mac intel ] foo/test [ Failure ]
[ mac intel ] bar/test [ Failure ]
# finder:group-end
```

The group name/description is required and is used to uniquely identify each
group. This means that groups with the same name string in different parts of
the file will be treated as the same group, as if they were all in a single
group block together.

```
# finder:group-start group_name
[ mac ] foo/test [ Failure ]
[ mac ] bar/test [ Failure ]
# finder:group-end

...

# finder:group-start group_name
[ android ] foo/test [ Failure ]
[ android ] bar/test [ Failure ]
# finder:group-end
```

is equivalent to

```
# finder:group-start group_name
[ mac ] foo/test [ Failure ]
[ mac ] bar/test [ Failure ]
[ android ] foo/test [ Failure ]
[ android ] bar/test [ Failure ]
# finder:group-end
```