chromium/docs/memory/debugging_memory_issues.md

# Debugging Memory Issues

This page is designed to help Chromium developers debug memory issues.

When in doubt, reach out to [email protected].

[TOC]

## Investigating Reproducible Memory Regression

Let's say that there's a CL or feature that reproducibly increases memory usage
when it's landed/enabled, given a particular set of repro steps.

* Take a look at [the documentation](/docs/memory/README.md) for both
  taking and navigating memory-infra traces.
* Take two memory-infra traces. One with the reproducible memory regression, and
  one without.
* Load the memory-infra traces into two tabs.
* Compare the memory dump providers and look for the one that shows the
  regression. Follow the relevant link.
    * [The regression is in the Malloc MemoryDumpProvider.](#Investigating-Reproducible-Memory-Issues)
    * [The regression is in a non-Malloc
      MemoryDumpProvider.](#Regression-in-Non-Malloc-MemoryDumpProvider)
    * [The regression is only observed in **private
      footprint**.](#Regression-only-in-Private-Footprint)
    * [No regression is observed.](#No-observed-regression)

### Regression in Malloc MemoryDumpProvider

Repeat the above steps, but this time also [take a heap
dump](#Taking-a-Heap-Dump). Confirm that the regression is also visible in the
heap dump, and then compare the two heap dumps to find the difference. You can
also use
[diff_heap_profiler.py](https://cs.chromium.org/chromium/src/third_party/catapult/experimental/tracing/bin/diff_heap_profiler.py)
to perform the diff.

### Regression in Non-Malloc MemoryDumpProvider

Hopefully the MemoryDumpProvider has sufficient information to help diagnose the
leak. Depending on the whether the leaked object is allocated via malloc or new
- it usually should be, you can also use the steps for debugging a Malloc
MemoryDumpProvider regression.

### Regression only in Private Footprint

* Repeat the repro steps, but instead of taking a memory-infra trace, use
  the following tools to map the process's virtual space:
    * On macOS, use vmmap
    * On Windows, use SysInternal VMMap
    * On other OSes, use /proc/<pid\>/smaps.
* The results should help diagnose what's happening. Contact the
  [email protected] mailing list for more help.

### No observed regression

* If there isn't a regression in PrivateMemoryFootprint, then this might become
  a question of semantics for what constitutes a memory regression. Common
  problems include:
    * Shared Memory, which is hard to attribute, but is mostly accounted for in
      the memory-infra trace.
    * Binary size, which is currently not accounted for anywhere.

## Investigating Heap Dumps From the Wild

For a small set of Chrome users in the wild, Chrome will record and upload
anonymized heap dumps. This has the benefit of wider coverage for real code
paths, at the expense of reproducibility.

These heap dumps can take some time to grok, but frequently yield valuable
insight. At the time of this writing, heap dumps from the wild have resulted in
real, high impact bugs being found in Chrome code ~90% of the time.

For an example investigation of a real heap dump, see [this
link](/docs/memory/investigating_heap_dump_example.md).

* Raw heap dumps can be viewed in the trace viewer. [See detailed
  instructions.](/docs/memory-infra/heap_profiler.md#how-to-manually-browse-a-heap-dump).
  This interface surfaces all available information, but can be overwhelming and
  is usually unnecessary for investigating heap dumps.
    * Important note: Heap profiling in the field uses
      [Poisson process sampling](https://bugs.chromium.org/p/chromium/issues/detail?id=810748)
      with a rate parameter of 10000. This means that for large/frequent allocations
      [e.g. >100 MB], the noise will be quite small [much less than 1%]. But
      there is noise so counts will not be exact.
* The heap dump summary typically contains all information necessary to diagnose
  a memory issue.
  * The stack trace of the potential memory leak is almost always sufficient to
    tell the type of object being leaked, since most functions in Chrome
    have a limited number of calls to new and malloc.
* The next thing to do is to determine whether the memory usage is intentional.
  Very rarely, components in Chrome legitimately need to use many 100s of MBs of
  memory. In this case, it's important to create a
  [MemoryDumpProvider](https://cs.chromium.org/chromium/src/base/trace_event/memory_dump_provider.h)
  to report this memory usage, so that we have a better understanding of which
  components are using a lot of memory. For an example, see
  [Issue 813046](https://bugs.chromium.org/p/chromium/issues/detail?id=813046).
* Assuming the memory usage is not intentional, the next thing to do is to
  figure out what is causing the memory leak.
    * The most common cause is adding elements to a container with no limit.
      Usually the code makes assumptions about how frequently it will be called
      in the wild, and something breaks those assumptions. Or sometimes the code
      to clear the container is not called as frequently as expected [or at
      all]. [Example
      1](https://bugs.chromium.org/p/chromium/issues/detail?id=798012). [Example
      2](https://bugs.chromium.org/p/chromium/issues/detail?id=804440).
    * Retain cycles for ref-counted objects.
      [Example](https://bugs.chromium.org/p/chromium/issues/detail?id=814334#c23)
    * Straight up leaks resulting from incorrect use of APIs. [Example
      1](https://bugs.chromium.org/p/chromium/issues/detail?id=801702#c31).
      [Example
      2](https://bugs.chromium.org/p/chromium/issues/detail?id=814444#c17).

## Taking a Heap Dump

Navigate to chrome://flags and search for **memlog**. There are several options
that can be used to configure heap dumps. All of these options are also
available as command line flags, for automated test runs [e.g. telemetry].

* `#memlog` controls which processes are profiled. It's also possible to
  manually specify the process via the interface at `chrome://memory-internals`.
* `#memlog-in-process` makes the profiling service to be run within the
  Chrome browser process. Defaults to run the service as a separate dedicated
  process.
* `#memlog-sampling-rate` specifies the sampling interval in bytes. The lower
  the interval, the more precise is the profile. However it comes at the cost of
  performance. Default value is 100KB, that is enough to observe allocation
  sites that make allocations >500KB total, where total equals to a single
  allocation size times the number of such allocations at the same call site.
* `#memlog-stack-mode` describes the type of metadata recorded for each
  allocation. `native` stacks provide the most utility. The only time the other
  options should be considered is for Android official builds, most of which do
  not support `native` stacks.

Once the flags have been set appropriately, restart Chrome and take a
memory-infra trace. The results will have a heap dump.

## Investigating Memory Corruption

In case you can reproduce the corruption locally,
you are advised to run sanitizers (e.g.
[ASan](https://chromium.googlesource.com/chromium/src/+/HEAD/docs/asan.md))
to locate and fix UB.

Otherwise, you can look into
[minidump](https://sites.google.com/a/google.com/crash/users/how-to/manually-debug-a-minidump)
(link Googlers-only) if available.

### Known Memory Poisoning Patterns

Memory allocation goes through multiple states,
and its payload sometimes has a distinctive pattern.
You may also see some variance on lower bits, introduced by
e.g. an offset within `struct`.

#### Memory held by the OS

* All memory comes from the OS and returns back to the OS at some point.
* Access to memory that is already returned to the OS is likely a crash.
* Large allocations (>= ~1 MiB) tend to go back to the OS quickly when
  freed, while smaller allocations are mostly reused.

#### Memory held by the allocator

* The allocator holds the memory region borrowed from the OS in a free-list.
* Payload and behavior are implementation-specific.
* In Chrome, we use
  [PartitionAlloc](/base/allocator/partition_allocator/PartitionAlloc.md) as the
  main allocator.
  * We embed some data on payload and the original payload before `free()` may
    or may not be overwritten.
  * Writes to `free()`d memory may be caught as "free-list corruption".
* Following patterns can be written at this stage:
  * `0xCDCDCDCDCDCDCDCD`: when allocation gets returned to PartitionAlloc.
    * Shows up only in `PA_BUILDFLAG(EXPENSIVE_DCHECKS_ARE_ON)` builds.

#### Quarantined Memory

* Optionally, the allocator may keep `free()`d memory in quarantine
  for a while before returning it into a free-list to detect and mitigate
  UaF bugs.
* Following patterns can be written at this stage:
  * `0xCDCDCDCDCDCDCDCD`: PartitionAlloc's `FreeFlags::kZap`.
    * As of Aug. 2024 this is used by only [AMSC](https://docs.google.com/document/d/12OM0CSKgKv6NhM9YylSqAAXiV_f4uMgYgaH8KABUe-o/edit?usp=sharing).
  * `0xEFEFEFEFEFEFEFEF`: In [BRP](https://chromium.googlesource.com/chromium/src/+/HEAD/base/memory/raw_ptr.md) quarantine.
    * You are using a dangling pointer to access invalidated memory region.
  * `0xEFED????????8000`: In [LUD](https://docs.google.com/document/d/1xfGa_IMtFZiQ3beOmkncEafODwn4U90ZyL4NfPaAtDY/edit?usp=sharing&resourcekey=0-89BZl1SVILB6ylOHula0IA) quarantine.
    * (Googlers-only) You may have an access to `free()` stack trace on crashpad.
  * `0xECEC????????8000`: In [E-LUD](https://docs.google.com/document/d/1_9TSOtQuPR3NjorLDjAkuloi8lYqblb6Ykt5nbVnh9I/edit?usp=sharing) quarantine.


#### Memory allocation you officially own

In principle, once initialized you should only see values written
by your code while your allocation is alive.
However, in rare case, you may see values from Write-after-Free.

```txt
void YourFunc() {              | void TheirFunc() {
                               |   int* p1 = new int;
                               |   delete p1;
  // The allocator may         |
  // redistribute `p1` to `p2` |
  int* p2 = new int;           |
  *p2 = 123;                   |
                               |   // Write-after-Free
                               |   *p1 = 456;
  // 456 may show up           |
  printf("%d\n", *p2);         |
}                              | }
```

...or values from Double-Free.

```
void YourFunc() {              | void TheirFunc() {
                               |   int* p1 = new int;
                               |   delete p1;
  // The allocator may         |
  // redistribute `p1` to `p2` |
  int* p2 = new int;           |
  *p2 = 123;                   |
                               |   // Double-Free
                               |   delete p1;
                               |
                               |   // The allocator may
                               |   // redistribute `p2` to `p3`
                               |   int* p3 = new int;
                               |   *p3 = 456;
  // 456 may show up           |
  printf("%d\n", *p2);         |
}                              | }
```

* Following patterns can be written at this stage:
  * `0x0000000000000000`: [zero initialization](https://en.cppreference.com/w/cpp/language/zero_initialization).
  * `0x0000000000000000`: PartitionAlloc's `AllocFlags::kZeroFill`.
    * This payload is written as a part of memory allocation but requires
      explicit opt-in e.g. `calloc()`.
  - `0xABABABABABABABAB`: PartitionAlloc's newly allocated memory.
    * Shows up only in `PA_BUILDFLAG(EXPENSIVE_DCHECKS_ARE_ON)` builds.
    * MSan should be capable of catching this kind of reads to uninitialized
      regions.


#### Memory allocation owned by someone else

You may see random values written by someone else
if you keep using pointers to `free()`d region.

```
void YourFunc() {              | void TheirFunc() {
  int* p1 = new int;           |
  *p1 = 123;                   |
  delete p1;                   |
                               |   // The allocator may
                               |   // redistribute `p1` to `p2`
                               |   int* p2 = new int;
                               |   *p2 = 456;
  // Use-after-Free;           |
  // 456 may show up           |
  printf("%d\n", *p1);         |
}                              | }
```