gtest_flake_tips.md | Explore in Territory

# Addressing Flaky GTests

## Understanding builder results

[LUCI Analysis](https://luci-analysis.appspot.com/p/chromium/clusters) lists the
top flake clusters of tests along with any associated bug and failure counts in
different contexts.

## Reproducing the flaky test

If debugging via bot is too slow or you otherwise need to drill further into the
cause of the flake, you can try to reproduce the flake locally. Reproducing the
flake can be difficult, so it can help to try and replicate the test environment
as closely as possible.

Copy the gn args from one of the bots where the flake occurs, and try to choose
a bot close to your system, i.e. linux-rel if you're building on linux. To get
the gn args, you can again click on the timestamp in the flake portal to view
the bot run details, and search for the "lookup GN args" build step to copy the
args.

![bot_gn_args]

Build and run the test locally. Depending on the frequency of the flake, it may
take some time to reproduce. Some helpful flags:
 - --gtest_repeat=100
 - --gtest_also_run_disabled_tests (if the flaky test(s) you're looking at have
been disabled)

If you're unable to reproduce the flake locally, you can also try uploading your
patch with the debug logging and flaky test enabled to try running the bot to
reproduce the flake with more information.

Another good solution is to use
*Swarming* -- which will let you mimic bot conditions to better reproduce flakes
that actually occur on CQ bots.

### Swarming
For a more detailed dive into swarming you can follow this
[link](https://chromium.googlesource.com/chromium/src/+/master/docs/workflow/debugging-with-swarming.md#authenticating).

As an example, suppose we have built Chrome using the GN args from
above into a directory `out/linux-rel`, then we can simply run this command
within the `chromium/src` directory:

```
tools/run-swarmed.py out/linux-rel browser_tests -- --gtest_filter="*<YOUR_TEST_NAME_HERE>*" --gtest_repeat=20 --gtest_also_run_disabled_tests
```

This allows us to quickly iterate over errors using logs to reproduce flakes and
even fix them!

>TODO: Add more tips for reproducing flaky tests

## Debugging the flaky test

If the test is flakily timing out, consider any asynchronous code that may cause
race conditions, where the test subject may early exit and miss a callback, or
return faster than the test can start waiting for it (i.e. make sure event
listeners are spawned before invoking the event). Make sure event listeners are
for the proper event instead of a proxy (e.g. [Wait for the correct event in
test](https://chromium.googlesource.com/chromium/src/+/6da09f7510e94d2aebbbed13b038d71c511d6cbc)).

Consider possible bugs in the system or test infrastructure (e.g. [races in
glibc](https://bugs.chromium.org/p/chromium/issues/detail?id=1010318)).

For browsertest flakes, consider possible inter-process issues, such as the
renderer taking too long or returning something unexpected (e.g. [flaky
RenderFrameHostImplBrowserTest](https://bugs.chromium.org/p/chromium/issues/detail?id=1120305)).

For browsertest flakes that check EvalJs results, make sure test objects are not
destroyed before JS may read their values (e.g. [flaky
PaymentAppBrowserTest](https://chromium.googlesource.com/chromium/src/+/6089f3480c5036c73464661b3b1b6b82807b56a3)).

For browsertest flakes that involve dialogs or widgets, make sure that test
objects are not destroyed because focus is lost on the dialog (e.g [flaky AccessCodeCastHandlerBrowserTest](https://chromium-review.googlesource.com/c/chromium/src/+/3951132)).

## Preventing similar flakes

Once you understand the problem and have a fix for the test, think about how the
fix may apply to other tests, or if documentation can be improved either in the
relevant code or this flaky test documentation.


[bot_gn_args]: images/bot_gn_args.png
chromium/docs/testing/gtest_flake_tips.md