This is a description of the profile.proto format.
# Overview
Profile.proto is a data representation for profile data. It is independent of
the type of data being collected and the sampling process used to collect that
data. On disk, it is represented as a gzip-compressed protocol buffer, described
at src/proto/profile.proto
A profile in this context refers to a collection of samples, each one
representing measurements performed at a certain point in the life of a job. A
sample associates a set of measurement values with a list of locations, commonly
representing the program call stack when the sample was taken.
Tools such as pprof analyze these samples and display this information in
multiple forms, such as identifying hottest locations, building graphical call
graphs or trees, etc.
# General structure of a profile
A profile is represented on a Profile message, which contain the following
fields:
* *sample*: A profile sample, with the values measured and the associated call
stack as a list of location ids. Samples with identical call stacks can be
merged by adding their respective values, element by element.
* *location*: A unique place in the program, commonly mapped to a single
instruction address. It has a unique nonzero id, to be referenced from the
samples. It contains source information in the form of lines, and a mapping id
that points to a binary.
* *function*: A program function as defined in the program source. It has a
unique nonzero id, referenced from the location lines. It contains a
human-readable name for the function (eg a C++ demangled name), a system name
(eg a C++ mangled name), the name of the corresponding source file, and other
function attributes.
* *mapping*: A binary that is part of the program during the profile
collection. It has a unique nonzero id, referenced from the locations. It
includes details on how the binary was mapped during program execution. By
convention the main program binary is the first mapping, followed by any
shared libraries.
* *string_table*: All strings in the profile are represented as indices into
this repeating field. The first string is empty, so index == 0 always
represents the empty string.
# Measurement values
Measurement values are represented as 64-bit integers. The profile contains an
explicit description of each value represented, using a ValueType message, with
two fields:
* *Type*: A human-readable description of the type semantics. For example “cpu”
to represent CPU time, “wall” or “time” for wallclock time, or “memory” for
bytes allocated.
* *Unit*: A human-readable name of the unit represented by the 64-bit integer
values. For example, it could be “nanoseconds” or “milliseconds” for a time
value, or “bytes” or “megabytes” for a memory size. If this is just
representing a number of events, the recommended unit name is “count”.
A profile can represent multiple measurements per sample, but all samples must
have the same number and type of measurements. The actual values are stored in
the Sample.value fields, each one described by the corresponding
Profile.sample_type field.
Some profiles have a uniform period that describe the granularity of the data
collection. For example, a CPU profile may have a period of 100ms, or a memory
allocation profile may have a period of 512kb. Profiles can optionally describe
such a value on the Profile.period and Profile.period_type fields. The profile
period is meant for human consumption and does not affect the interpretation of
the profiling data.
By convention, the first value on all profiles is the number of samples
collected at this call stack, with unit “count”. Because the profile does not
describe the sampling process beyond the optional period, it must include
unsampled values for all measurements. For example, a CPU profile could have
value[0] == samples, and value[1] == time in milliseconds.
## Locations, functions and mappings
Each sample lists the id of each location where the sample was collected, in
bottom-up order. Each location has an explicit unique nonzero integer id,
independent of its position in the profile, and holds additional information to
identify the corresponding source.
The profile source is expected to perform any adjustment required to the
locations in order to point to the calls in the stack. For example, if the
profile source extracts the call stack by walking back over the program stack,
it must adjust the instruction addresses to point to the actual call
instruction, instead of the instruction that each call will return to.
Sources usually generate profiles that fall into these two categories:
* *Unsymbolized profiles*: These only contain instruction addresses, and are to
be symbolized by a separate tool. It is critical for each location to point to
a valid mapping, which will provide the information required for
symbolization. These are used for profiles of compiled languages, such as C++
and Go.
* *Symbolized profiles*: These contain all the symbol information available for
the profile. Mappings and instruction addresses are optional for symbolized
locations. These are used for profiles of interpreted or jitted languages,
such as Java or Python. Also, the profile format allows the generation of
mixed profiles, with symbolized and unsymbolized locations.
The symbol information is represented in the repeating lines field of the
Location message. A location has multiple lines if it reflects multiple program
sources, for example if representing inlined call stacks. Lines reference
functions by their unique nonzero id, and the source line number within the
source file listed by the function. A function contains the source attributes
for a function, including its name, source file, etc. Functions include both a
user and a system form of the name, for example to include C++ demangled and
mangled names. For profiles where only a single name exists, both should be set
to the same string.
Mappings are also referenced from locations by their unique nonzero id, and
include all information needed to symbolize addresses within the mapping. It
includes similar information to the Linux /proc/self/maps file. Locations
associated to a mapping should have addresses that land between the mapping
start and limit. Also, if available, mappings should include a build id to
uniquely identify the version of the binary being used.
## Labels
Samples optionally contain labels, which are annotations to discriminate samples
with identical locations. For example, a label can be used on a malloc profile
to indicate allocation size, so two samples on the same call stack with sizes
2MB and 4MB do not get merged into a single sample with two allocations and a
size of 6MB.
Labels can be string-based or numeric. They are represented by the Label
message, with a key identifying the label and either a string or numeric
value. For numeric labels, the measurement unit can be specified in the profile.
If no unit is specified and the key is "request" or "alignment",
then the units are assumed to be "bytes". Otherwise when no unit is specified
the key will be used as the measurement unit of the numeric value. All tags with
the same key should have the same unit.
## Keep and drop expressions
Some profile sources may have knowledge of locations that are uninteresting or
irrelevant. However, if symbolization is needed in order to identify these
locations, the profile source may not be able to remove them when the profile is
generated. The profile format provides a mechanism to identify these frames by
name, through regular expressions.
These expressions must match the function name in its entirety. Frames that
match Profile.drop\_frames will be dropped from the profile, along with any
frames below it. Frames that match Profile.keep\_frames will be kept, even if
they match drop\_frames.