chromium/docs/memory/oom.md

# Investigating Out of Memory crashes

A large fraction of process crashes in Chromium are due to Out Of Memory (OOM)
conditions. This page is meant to help Chromium developers understand stack
traces, and investigate. Note that some of the documentation here will only be
applicable to Google Chrome, as it is specific to the way Google's crash
reporting infrastructure aggregates and reports crashes.

Some of the following also assumes that the `malloc()` implementation is
PartitionAlloc, which is as of 2022 the case on most platforms.

[TOC]

## Identifying OOM crashes

When a process crashes due to an Out Of Memory condition, this is usually
signaled by the presence of `base::internal::OnNoMemoryInternal()` on the stack.

**Google Chrome only:** crash report infrastructure tags these as "[Out of
Memory]" based on this, and other function names. The full list is determined in
the (internal) crash server's code.

Since Chromium configures its memory allocators to prefer crashing rather than
returning `nullptr`, an OOM crash can be triggered from anywhere in the code,
and most commonly from within the allocator, or higher-level functions such as
`operator new` in C++.

## Distinguishing between underlying causes
### Different causes

A process can reach an OOM condition for several reasons:

* **The OS is truly out of memory**, regardless of how much memory the *current*
  process is using
* **Some limit inside the OS is reached**. For instance, on Windows, there
  exists a global "commit limit", which is the amount of memory that the system
  can commit. Note that it is possible to commit more memory than what is
  actually in use. This may also happen on Linux systems configured with no or
  limited "overcommit", though the majority of systems don't have a limit.
* **Virtual address space exhaustion**. This is most likely to happen for relatively
  large allocations, on 32 bit systems, where total addressable space is
  typically 2GiB (most Windows systems), 3GiB (e.g. some Windows configurations,
  Linux) or 4GiB (e.g. WoW64). However, it may also happen on 64 bit systems,
  either due to:
    * Limited virtual addressable space in the CPU/OS. For instance most Android
      ARM64 systems have only 40 bits of address space as of 2022.
    * "Cage" exhaustion. This is most likely to happen with PartitionAlloc on 64
      bit systems, where all allocations are grouped into a single contiguous
      virtual address space "cage".
* **Sandbox per-process memory limit**. For some process types (e.g. Renderers)
  and on most platforms, the sandbox enforces a maximum per-process memory
  limit. Given that this limit is typically set at the OS level, it may not be
  distinguishable from e.g. commit limit exhaustion.
* **Excessive allocation size**. Some allocators (notably PartitionAlloc)
  purposely limit the maximum allocation size.

### Identifying the cause

In the case of PartitionAlloc, it is possible to distinguish some of these cases:

* **Virtual address space exhaustion**. This is identified by the presence of
  `PartitionOutOfMemoryMappingFailure()` on the stack. It means that the
  allocator was unable to find enough address space, either for its internal
  memory allocation unit size, or the requested size. Since memory is *not*
  committed as this step, this signals an address space issue.
* **Commit**. This is identified by the presence of
  `PartitionOutOfMemoryCommitFailure()` on the stack. This signals that either
  the OS or the sandbox limit has been reached.
* **Excessive allocation size**. Shown by `PartitionExcessiveAllocationSize()`
  on the stack.


## What to do?

### Commit Limit Reached

The process is "truly" out of memory, or the system is. Some amount of these
crashes is expected, and the crashing location is not necessarily the
culprit. Indeed, as a rough approximation, the failing allocation is more likely
to be from a component naturally allocating a lot of memory, e.g. V8 or
rendering.

However, if there is a spike, and many stack traces come from an unusual
location (e.g. newly added code), this may signal a memory leak in the component
on the stack, or excessive temporary allocations.

Also, if `PartitionAllocDirectMap()` is on the stack, the memory allocation was
large. It may come from a large buffer, and potentially made worse by buffer
resizing. For instance, `std::vector` often double their size when out of
capacity. In which case, `reserve()`-ing the right size ahead of time may help.

### Excessive allocation size

Is the calling code expected to allocate more than 2GiB? Or it is an underflow
somewhere in the calling code?

### Virtual address space

On 32 bit systems, this is most likely to occur when overall memory usage is
high, or when the allocation size request is large. Is the calling code
allocating a very large buffer?

## Debugging

### General

On Windows, the allocation size is added into the exception record. In Google
Chrome's crash dashboard, this is shown in "Parameter[0]" of the exception
info. On other operating systems, the allocation size if put on the stack before
crashing, and thus visible in minidumps.

### PartitionAlloc and Google specific

1. Starting from a specific report, click on the bug icon to start a cloud lldb
   instance
2. Locate the `PartitionRoot<true>::OutOfMemory()` frame on the stack, move to it with `f 5`
3. Locate the stack addresses by printing registers `re re`
4. Show the stack content with `x <stack_pointer> <frame pointer>`

Below is an example for a crash on x86_64:

```
( lizeb ) bt
* thread #1, stop reason = EXC_BREAKPOINT (code=EXC_I386_BPT, subcode=0x10c45912f)
  * frame #0: 0x000000010c45912f Google Chrome Framework`base::internal::OnNoMemoryInternal(unsigned long) at memory.cc:62
    frame #1: 0x000000010c459149 Google Chrome Framework`base::TerminateBecauseOutOfMemory(unsigned long) at memory.cc:69
    frame #2: 0x000000010c4f39c6 Google Chrome Framework`OnNoMemory(unsigned long) at oom.cc:17
    frame #3: 0x000000010d7e5794 Google Chrome Framework`WTF::PartitionsOutOfMemoryUsing2G(unsigned long) at partitions.cc:281
    frame #4: 0x000000010d7e4d2c Google Chrome Framework`WTF::Partitions::HandleOutOfMemory(unsigned long) at partitions.cc:415
    frame #5: 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) at partition_root.cc:521
[...]
( lizeb ) f 5
frame #5: 0x000000010c4f7474 Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) at partition_root.cc:521
( lizeb ) re re
General Purpose Registers:
       rbp = 0x00007ffee7012c50
       rsp = 0x00007ffee7012bf0
       rip = 0x000000010c4f7474  Google Chrome Framework`base::PartitionRoot<true>::OutOfMemory(unsigned long) + 196 at partition_root.cc:522
21 registers were unavailable.
( lizeb ) x 0x00007ffee7012bf0 0x00007ffee7012c50
0x7ffee7012bf0: 76 61 5f 73 69 7a 65 00 00 00 00 07 00 00 00 00  va_size.........
0x7ffee7012c00: 61 6c 6c 6f 63 00 20 20 00 2d 2d 01 00 00 00 00  alloc.  .--.....
0x7ffee7012c10: 63 6f 6d 6d 69 74 00 20 00 a0 9d 01 00 00 00 00  commit. ........
0x7ffee7012c20: 73 69 7a 65 00 20 20 20 00 00 20 00 00 00 00 00  size.   .. .....
0x7ffee7012c30: aa aa aa aa aa aa aa aa 00 18 b0 12 01 00 00 00  ................
0x7ffee7012c40: 00 00 20 00 00 00 00 00 48 22 b0 12 01 00 00 00  .. .....H"......
```

The results here can help the PartitionAlloc team to identify issues, as
important metrics from PartitionAlloc are saved above. For instance virtual
address space usage is (in little endian) 0x70000000.