native_symbols.md | Explore in Territory

# Native Symbols

This doc describes how SuperSize breaks down native binaries into symbols.

[TOC]

## Overview

Native symbols are those with a `section` of:

 * `.text` (executable code)
 * `.rodata` (read-only data)
 * `.data` (writable data)
 * `.data.rel.ro` (data that is read-only after ELF relocations are applied)
 * `.bss` (symbols that are zero-initialized. These consume no space in the
    binary, and so are generally ignored despite still being collected.

There are 3 modes that SuperSize can use to break an ELF down into symbols:

 * `linker_map` - Uses linker map + build directory to create symbols.
 * `dwarf` - Uses debug information to create symbols.
 * `sections` - Creates one symbol for each ELF section.

## Mode: linker_map

This is the mode that produces the largest number of symbols, and thus is the
preferred mode. Information provided only by this mode:

 * Path information for symbols outside of .text
   * DWARF information is complete for .text symbols (maybe because stack
   symbolization is a primary use-case?), but incomplete or missing for symbols
   in other sections.
 * String literals (.rodata symbols that look like `"some string dat..."`).
   * Linker map files contain `** merge strings` entries, which tell us where
     to string tables exist within `.rodata`.
 * `object_path`, which is useful for attributing STL usages to individual
   source files.
 * Path aliases - when an inline symbol is used by multiple source files, we
   attribute the symbol's cost equally among the files.
 * Linker-generated symbols. E.g. Switch tables.

### Data Sources

 * `build.ninja` is parsed to get:
   * List of `.o` and `.a` files that were inputs to the linker.
   * Mapping of `.cc` -> `.o` files.
 * All `.o` (and `.a`) files are parsed:
   * with `nm` to get symbol list.
   * Non-ThinLTO: with `nm` to get list of string literals
   * ThinLTO: with `llvm-bcanalyzer` to get list of string literals
 * ELF file is parse with `nm` to get list of symbol names that were
   identical-code-folded to the same address.
 * Linker map (created via `-Wl,-Map=output.map`) parsed to get:
   * Full list of symbols that comprise the binary,
   * Location of string tables (`** merge strings` entries).
   * Non-ThinLTO: `object_path` (`.o` file) associated with each symbol
   * Note:
     * With ThinLTO, `object_path` points to a hashed filename within the thinlto
       cache (not useful).
     * When multiple symbols are folded together due to Identical Code Folding,
       the linker map file lists only one of them.
 * ELF file string tables are parsed by looking for `\0` bytes and creating
   string literal symbols for each string therein.

### Algorithm

1. Create initial symbol list from linker map.
2. Assign object paths by seeing which `.o` files define each symbol (match up
   the names).
   * When multiple files define the same symbol, create symbol aliases.
3. Create string literal symbols from string tables, and assign them paths based
   on which `.o` files define the same string literal.
4. Assign `source_path` using the `.o` -> `.cc` mapping from `build.ninja`.
   * This means that `.h` files are never listed as sources. No information
     about inlined symbols is gathered (by design).
5. Create symbol aliases when `nm` reports multiple symbols mapping to the same
   address.
6. Normalize `source_path` by removing generated path prefix (and adding
   `FLAG_GENERATED`) when applicable.
7. Normalize symbol names.

## Mode: dwarf

Creates symbols using only an ELF with debug information enabled. Requires
compiler flag `-gmlt` to enable full source paths (rather than just basename).

### Algorithm

1. Create initial symbol list with `nm --print-size`.
2. Add name aliases using output from `nm` (this could have been done at the
   same time as the previous step, but is done as a separate step in order to
   share logic with `linker_map` mode.
3. Uses `dwarfdump` to find all `DW_AT_compile_unit` and `DW_AT_ranges` entries
   and create a map of address range -> source path.
4. Assign source paths based to .text symbols based on symbol address.

### Why not use Bloaty?

[Bloaty](https://github.com/google/bloaty) is an excellent tool, and produces
size information with similar fidelity to "dwarf" mode, as it uses the same
data source. We did not use bloaty since "dwarfdump" was already readily
available and gave similar results. It would be nice to also have a "bloaty"
mode so that we could more direclty compare outputs.

## Mode: sections

This mode uses `readelf -s` to create one symbol for each ELF section. It is
used for native files where no debug information or linker map file is
available, and for native files whose ABI do not match the `--abi-filter`.

## Data Normalization

Some manipulation happens in order to make names and paths more human-readable.

 * `(anonymous::)` is removed from names (and stored as a symbol flag).
 * `[clone]` suffix removed (and stored as a symbol flag).
 * `vtable for FOO` -> `Foo [vtable]`
 * Mangling done by linkers is undone (e.g. prefixing with "unlikely.")
 * Names are processed into:
   * `name`: Name without template and argument parameters.
   * `template_name`: Name without argument parameters.
   * `full_name`: Name with all parameters.
 * LLVM function outlining creates many `OUTLINED_FUNCTION_*` symbols. These are
   renamed to `** outlined functions` or `** outlined functions * (count)`,
   and are de-duped so an address can have at most one such symbol.
   * Update: Outlining was ARM64-only, and has been disabled in our build due
     to performance regressions.
chromium/tools/binary_size/libsupersize/docs/native_symbols.md