chromium/testing/perf/confidence/ratio_bootstrap_estimator.h

// Copyright 2024 The Chromium Authors
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.

#ifndef TESTING_PERF_CONFIDENCE_RATIO_BOOTSTRAP_ESTIMATOR_H_
#define TESTING_PERF_CONFIDENCE_RATIO_BOOTSTRAP_ESTIMATOR_H_

// This class provides construction of ratios between two measured stochastic
// quantities, including confidence intervals. The intended use is performance
// benchmarking (so that one can say with reasonable certainty that “binary A
// is 5% faster then binary B”), but it could in theory be used for other things
// as well.
//
// This is not a simple problem. The naïve way is to assume normality
// and use the standard rule-of-thumb that 95% CI is +/- 1.96 sigma,
// but benchmarks are frequently very non-normal (and we frequently
// do not have enough samples for the central limit theorem to come
// into play); in particular, measurements tend to have a long tail
// into slowness. The division makes it even worse, of course (dividing
// two normal distributions by each other does not become another normal
// distribution, even in the limit). Pinpoint tries to sidestep this using
// the Mann-Whitney U test, which has hardly any assumptions at all
// (its inputs do not even need to be numbers, just anything that
// can be ordered) but it has very low power, cannot produce confidence
// intervals and fundamentally measures something different from
// what people typically expect (the question “which one would most often
// win a race” is distinct from “which one has the lowest average”,
// giving a potentially problematic advantage to distributions with
// longer tails).
//
// This class uses the nonparametric bootstrap, a randomized method.
// A full introduction to the bootstrap is out of scope, but the basic
// idea is that instead of just doing r = sum(A) / sum(B), we resample the
// input data lots of times, and computing r over those resamples.
// The distribution of r then becomes more amenable to normal statistical
// methods. (This feels like cheating by just drawing more numbers out of
// thin air from the data we have, but it has a sound statistical basis.)
// We use the “bias-corrected and accelerated” (BCa) method for computing
// the confidence intervals, which is the generally recommended method
// these days; it automatically corrects for second-order bias and skew
// effects, with only fairly reasonable assumptions about the underlying
// distribution. We follow the exposition in this 2020 paper (which is
// basically an explanation of the original 1987 paper):
//
//   Efron, Narasimhan: “The automatic construction of bootstrap confidence
//   intervals”
//
// Even though the bootstrap is a very flexible method, it still makes
// some assumptions, in particular that the samples are independent and
// identically distributed. In practice, this may be violated because e.g.:
//
//   a) Something happened during one or more of the runs that impacted
//      their times (not identically distributed), e.g. a heavy cron job
//      or thermal throttling after the first N samples. This can cause
//      outliers, which bootstrapping is not inherently immune against.
//      If you have an uncontrolled measurement environment, consider
//      filtering outliers or warm-up runs to reduce the impact.
//   b) One or both of the sides had very lucky or unlucky code or data
//      placement, unrelated to the patch itself, causing a bias
//      (samples are not independent). This is generally hard to do something
//      about, but at least one thing to keep in mind is to always run with
//      binary names of the same length (i.e., don't call one of your binaries
//      _old, since argv[0] is put on the stack and this can perturb the
//      layout if you are unlucky).
//
// It also does not solve the problem of multiple comparisons (you could
// consider e.g. Bonferroni correction), and it does not do well with extremely
// high confidence levels unless you have lots of samples (e.g., CI=99.9999%
// with 100 samples will quickly hit the edge of the empirical distribution,
// which is not correct).
//
// I've spot-checked the results against the bcajack R package,
// which is the authors' own implementation of the methods in the paper;
// of course, in a system based on randomness, it's impossible to get
// exactly the same numbers. We don't the optimization of block-based
// jackknife for computing the acceleration (which takes it down from
// O(n²) to O(n)), as we generally only have a couple hundred samples.
// We also don't implement the calculation of error bars stemming from
// the resampling randomness (so-called internal error).
//
// This class is generally written to be independent of Blink, for slightly
// more general reuse (though the test is part of blink_unittests). It is
// not intended to be part of the main Chromium build, only used in
// auxiliary utilities.

#include <stdint.h>

#include <random>
#include <vector>

#include "third_party/blink/renderer/core/core_export.h"

class RatioBootstrapEstimator {};

#endif  // TESTING_PERF_CONFIDENCE_RATIO_BOOTSTRAP_ESTIMATOR_H_