// Copyright 2024 The Chromium Authors // Use of this source code is governed by a BSD-style license that can be // found in the LICENSE file. #ifndef TESTING_PERF_CONFIDENCE_RATIO_BOOTSTRAP_ESTIMATOR_H_ #define TESTING_PERF_CONFIDENCE_RATIO_BOOTSTRAP_ESTIMATOR_H_ // This class provides construction of ratios between two measured stochastic // quantities, including confidence intervals. The intended use is performance // benchmarking (so that one can say with reasonable certainty that “binary A // is 5% faster then binary B”), but it could in theory be used for other things // as well. // // This is not a simple problem. The naïve way is to assume normality // and use the standard rule-of-thumb that 95% CI is +/- 1.96 sigma, // but benchmarks are frequently very non-normal (and we frequently // do not have enough samples for the central limit theorem to come // into play); in particular, measurements tend to have a long tail // into slowness. The division makes it even worse, of course (dividing // two normal distributions by each other does not become another normal // distribution, even in the limit). Pinpoint tries to sidestep this using // the Mann-Whitney U test, which has hardly any assumptions at all // (its inputs do not even need to be numbers, just anything that // can be ordered) but it has very low power, cannot produce confidence // intervals and fundamentally measures something different from // what people typically expect (the question “which one would most often // win a race” is distinct from “which one has the lowest average”, // giving a potentially problematic advantage to distributions with // longer tails). // // This class uses the nonparametric bootstrap, a randomized method. // A full introduction to the bootstrap is out of scope, but the basic // idea is that instead of just doing r = sum(A) / sum(B), we resample the // input data lots of times, and computing r over those resamples. // The distribution of r then becomes more amenable to normal statistical // methods. (This feels like cheating by just drawing more numbers out of // thin air from the data we have, but it has a sound statistical basis.) // We use the “bias-corrected and accelerated” (BCa) method for computing // the confidence intervals, which is the generally recommended method // these days; it automatically corrects for second-order bias and skew // effects, with only fairly reasonable assumptions about the underlying // distribution. We follow the exposition in this 2020 paper (which is // basically an explanation of the original 1987 paper): // // Efron, Narasimhan: “The automatic construction of bootstrap confidence // intervals” // // Even though the bootstrap is a very flexible method, it still makes // some assumptions, in particular that the samples are independent and // identically distributed. In practice, this may be violated because e.g.: // // a) Something happened during one or more of the runs that impacted // their times (not identically distributed), e.g. a heavy cron job // or thermal throttling after the first N samples. This can cause // outliers, which bootstrapping is not inherently immune against. // If you have an uncontrolled measurement environment, consider // filtering outliers or warm-up runs to reduce the impact. // b) One or both of the sides had very lucky or unlucky code or data // placement, unrelated to the patch itself, causing a bias // (samples are not independent). This is generally hard to do something // about, but at least one thing to keep in mind is to always run with // binary names of the same length (i.e., don't call one of your binaries // _old, since argv[0] is put on the stack and this can perturb the // layout if you are unlucky). // // It also does not solve the problem of multiple comparisons (you could // consider e.g. Bonferroni correction), and it does not do well with extremely // high confidence levels unless you have lots of samples (e.g., CI=99.9999% // with 100 samples will quickly hit the edge of the empirical distribution, // which is not correct). // // I've spot-checked the results against the bcajack R package, // which is the authors' own implementation of the methods in the paper; // of course, in a system based on randomness, it's impossible to get // exactly the same numbers. We don't the optimization of block-based // jackknife for computing the acceleration (which takes it down from // O(n²) to O(n)), as we generally only have a couple hundred samples. // We also don't implement the calculation of error bars stemming from // the resampling randomness (so-called internal error). // // This class is generally written to be independent of Blink, for slightly // more general reuse (though the test is part of blink_unittests). It is // not intended to be part of the main Chromium build, only used in // auxiliary utilities. #include <stdint.h> #include <random> #include <vector> #include "third_party/blink/renderer/core/core_export.h" class RatioBootstrapEstimator { … }; #endif // TESTING_PERF_CONFIDENCE_RATIO_BOOTSTRAP_ESTIMATOR_H_