F14.md | Explore in Territory

# F14 Hash Table

F14 is a 14-way probing hash table that resolves collisions by double
hashing.  Up to 14 keys are stored in a chunk at a single hash table
position.  Vector instructions (SSE2 on x86_64, NEON on aarch64)
are used to filter within a chunk; intra-chunk search takes only a
handful of instructions.  **F14** refers to the fact that the algorithm
**F**ilters up to **14** keys at a time.  This strategy allows the hash
table to be operated at a high maximum load factor (12/14) while still
keeping probe chains very short.

F14 provides compelling replacements for most of the hash tables we use in
production at Facebook.  Switching to it can improve memory efficiency
and performance at the same time.  The hash table implementations
widely deployed in C++ at Facebook exist along a spectrum of space/time
tradeoffs.  The fastest is the least memory efficient, and the most
memory efficient (`google::sparse_hash_map`) is much slower than the rest.
F14 moves the curve, simultaneously improving memory efficiency and
performance when compared to most of the existing algorithms.

## F14 VARIANTS

The core hash table implementation has a pluggable storage strategy,
with three policies provided:

`F14NodeMap` stores values indirectly, calling malloc on each insert like
`std::unordered_map`.  This implementation is the most memory efficient
for medium and large keys.  It provides the same iterator and reference
stability guarantees as the standard map while being faster and more
memory efficient, so you can substitute `F14NodeMap` for `std::unordered_map`
safely in production code.  F14's filtering substantially reduces
indirection (and cache misses) when compared to `std::unordered_map`.

`F14ValueMap` stores values inline, like `google::dense_hash_map`.
Inline storage is the most memory efficient for small values, but for
medium and large values it wastes space.  Because it can tolerate a much
higher load factor, `F14ValueMap` is almost twice as memory efficient as
`dense_hash_map` while also faster for most workloads.

`F14VectorMap` keeps values packed in a contiguous array.  The main hash
array stores 32-bit indexes into the value vector.  Compared to the
existing internal implementations that use a similar strategy, F14 is
slower for simple keys and small or medium-sized tables (because of the
cost of bit mixing), faster for complex keys and large tables, and saves
about 16 bytes per entry on average.

We also provide:

`F14FastMap` inherits from either F14ValueMap or F14VectorMap depending
on entry size. When the key and mapped_type are less than 24 bytes, it
inherits from `F14ValueMap`. For medium and large entries, it inherits
from `F14VectorMap`. This strategy provides the best performance, while
also providing better memory efficiency than `dense_hash_map` or the other
hash tables in use at Facebook that don't individually allocate nodes.

## WHICH F14 VARIANT IS RIGHT FOR ME?

F14FastMap is a good default choice. If you care more about memory
efficiency than performance, F14NodeMap is better for medium and
large entries.  F14NodeMap is the only F14 variant that doesn't move
its elements, so in the rare case that you need reference stability you
should use it.

## HETEROGENEOUS KEY TYPE WITH TRANSPARENT HASH AND EQUALITY

In some cases it makes sense to define hash and key equality across
types.  For example, `StringPiece`'s hash and equality are capable of
accepting `std::string` (because `std::string` is implicitly convertible
to `StringPiece`).  If you mark the hash functor and key equality functor
as _transparent_, then F14 will allow you to search the table directly
using any of the accepted key types without converting the key.

For example, using `H =
folly::transparent<folly::hasher<folly::StringPiece>>` and
`E = folly::transparent<std::equal_to<folly::StringPiece>>`, an
`F14FastSet<std::string, H, E>` will allow you to use a `StringPiece` key
without the need to construct a `std::string`.

Heterogeneous lookup and erase works for any key types that can be passed
to operator() on the hasher and key_equal functors.  For operations
such as operator[] that might insert there is an additional constraint,
which is that the passed-in key must be explicitly convertible to the
table's key_type.  F14 maps understand all possible forms that can be
used to construct the underlying `std::pair<key_type const, value_type)`,
so heterogeneous keys can be used even with insert and emplace.

## RANDOMIZED BEHAVIOR IN DEBUG BUILDS

F14 introduces randomness into its behavior in debug builds and when
the address sanitizer (ASAN) is in use.  This randomness is designed to
expose bugs during test that might otherwise only occur in production.
Bugs are exposed probabilistically, they may appear only some of the time.

In debug builds `F14ValueMap` and `F14NodeMap` randomize the relationship
between insertion and iteration order.  This means that adding the same
k1 and k2 to two empty maps (or the same map twice after clearing it)
can produce the iteration order k1,k2 or k2,k1.  Unit tests will
fail if they assume that the iteration order is the same between
identically constructed maps, even in the same process.  This also
affects `folly::dynamic`'s object mode.

When the address sanitizer is enabled all of the F14 variants perform some
randomized extra rehashes on insert, which exposes iterator and reference
stability issues.  If reserve(size()+n) (or a non-zero initialCapacity)
is called then the following n insertions are exempt from the spurious
failures.  Tracking is per-thread rather than per-table so this heuristic
could lead to false positives, although we haven't seen any yet.
(Please let us know if you encounter this problem.)

## WHY CHUNKS?

Assuming that you have a magic wand that lets you search all of the keys
in a chunk in a single step (our wand is called _mm_cmpeq_epi8), then
using chunks fundamentally improves the load factor/collision tradeoff.
The cost is proportional only to the number of chunks visited to find
the key.

It's kind of like the birthday paradox in reverse.  In a room with 23
people there is a 50/50 chance that two of them have the same birthday
(overflowing a chunk with capacity 1), but the chance that 8 of them
were born in the same week (overflowing a chunk with capacity 7) is
very small.  Even though the chance of any two people being born in
the same week is higher (1/52 instead of 1/365), the larger number of
coincidences required means that the final probability is much lower
(less than 1 in a million). It would require 160 people to reach a 50/50
chance that 8 of them were born in the same week.

## WHY PROBING?

Chaining to a new chunk on collision is not very memory efficient,
because the new chunk is almost certain to be under-filled.  We tried
chaining to individual entries, but that bloated the lookup code and
can't match the performance of a probing strategy.

At our max load factor of 12/14, the expected probe length when searching
for an existing key (find hit) is 1.04, and fewer than 1% of keys are
not found in one of the first 3 chunks.  When searching for a key that is
not in the map (find miss) the expected probe length at max load factor
is 1.275 and the P99 probe length is 4.

## CHUNK OVERFLOW COUNTS: REFERENCE-COUNTED TOMBSTONES

Hash tables with a complex probing strategy (quadratic or double-hashing)
typically use a tombstone on erase, because it is very difficult to
find the keys that might have been displaced by a full bucket (i.e.,
chunk in F14).  If the probing strategy allows only a small number of
potential destinations for a displaced key (linear probing, Robin Hood
hashing, or Cuckoo hashing), it is also an option to find a displaced key,
relocate it, and then recursively repair the new hole.

Tombstones must be eventually reclaimed to deal with workloads that
continuously insert and erase.  `google::dense_hash_map` eventually triggers
a rehash in this case, for example.  Unfortunately, to avoid quadratic
behavior this rehash may have to halve the max load factor of the table,
resulting in a huge decrease in memory efficiency.

Although most probing algorithms just keep probing until they find an
empty slot, probe lengths can be substantially reduced if you track
whether a bucket has actually rejected a key.  This "overflow bit"
is set when an attempt is made to place a key into the bucket but the
bucket was full.  (An especially unlucky key might have to try several
buckets, setting the overflow bit in each.)  Amble and Knuth describe an
overflow bit in the "Further development" section of "Ordered hash tables"
(https://academic.oup.com/comjnl/article/17/2/135/525363).

The overflow bit subsumes the role of a tombstone, since a tombstone's
only effect is to cause a probe search to continue.  Unlike a tombstone,
however, the overflow bit is a property of the keys that were displaced
rather than the key that was erased.  It's only a small step to turn
this into a counter that records the number of displaced keys, and that
can be decremented on erase.  Overflow counts give us both an earlier
exit from probing and the effect of a reference-counted tombstone.
They automatically clean themselves up in a steady-state insert and
erase workload, giving us the upsides of double-hashing without the
normal downsides of tombstones.

## HOW DOES VECTOR FILTERING WORK?

F14 computes a secondary hash value for each key, which we call the key's
tag.  Tags are 1 byte: 7 bits of entropy with the top bit set.  The 14
tags are joined with 2 additional bytes of metadata to form a 16-byte
aligned __m128i at the beginning of the chunk.  When we're looking for a
key we can compare the needle's tag to all 14 tags in a chunk in parallel.
The result of the comparison is a bitmask that identifies only slots in
a chunk that might have a non-empty matching key.  Failing searches are
unlikely to perform any key comparisons, successful searches are likely
to perform exactly 1 comparison, and all of the resulting branches are
pretty predictable.

The vector search is coded using SIMD intrinsics, SSE2 on x86_64 and
NEON on aarch64.  These instructions are a non-optional part of those
platforms (unlike later SIMD instruction sets like AVX2 or SVE), so no
special compilation flags are required.  The exact vector operations
performed differs between x86_64 and aarch64 because aarch64 lacks a
movemask instruction, but the F14 algorithm is the same.

## WHAT ABOUT MEMORY OVERHEAD FOR SMALL TABLES?

The F14 algorithm works well for large tables, because the tags can
fit in cache even when the keys and values can't.  Tiny hash tables are
by far the most numerous, however, so it's important that we minimize
the footprint when the table is empty or has only 1 or 2 elements.
Conveniently, tags cause keys to be densely packed into the bottom of
a chunk and filter all memory accesses to the portions of a chunk that
are not used.  That means that we can also support capacities that are
a fraction of 1 chunk with no change to any of the search and insertion
algorithms.  The only change required is in the check to see if a rehash
is required.  F14's first three capacities all use one chunk and one
16-byte metadata vector, but allocate space for 2, 6, and then 14 keys.

## MEMORY OVERHEAD WITH EXPLICIT CAPACITY REQUEST

When using the vector storage strategy F14 has the option of sizing the
main hash array and the value_type storage array independently. In the
case that an explicit capacity has been requested (`initialCapacity`
passed to a constructor or a call to `reserve` or `rehash`) we take
advantage of this, trying to exactly fit the capacity to the requested
quantity. We also try to exactly fit the capacity for the other storage
strategies if there is only a single chunk. To avoid pathologies from
code that calls `reserve` a lot, doubling is performed for increases
if there is not at least a 1/8 increase in capacity and decreases are
ignored, with the exception of `reserve(n)` for `n <= size()`, which
behaves like `shrink_to_fit`.

## IS F14NODEMAP FULLY STANDARDS-COMPLIANT?

No.  F14 does provide full support for stateful allocators, fancy
pointers, and as many parts of the C++ standard for unordered associative
containers as it can, but it is not fully standards-compliant.

We don't know of a way to efficiently implement the full bucket API
in a table that uses double-hashed probing, in particular size_type
bucket(key_type const&).  This function must compute the bucket index
for any key, even before it is inserted into the table.  That means
that a local_iterator range can't partition the key space by the chunk
that terminated probing during insert; the only partition choice with
reasonable locality would be the first-choice chunk.  The probe sequence
for a key in double-hashing depends on the key, not the first-choice
chunk, however, so it is infeasible to search for all of the displaced
keys given only their first-choice location.  We're unwilling to use an
inferior probing strategy or dedicate space to the required metadata just
to support the full bucket API.  Implementing the rest of the bucket API,
such as local_iterator begin(size_type), would not be difficult.

F14 does not allow max_load_factor to be adjusted.  Probing tables
can't support load factors greater than 1, so the standards-required
ability to temporarily disable rehashing by temporarily setting a very
high max load factor just isn't possible.  We have also measured that
there is no performance advantage to forcing a low load factor, so it's
better just to omit the field and save space in every F14 instance.
This is part of the way we get empty maps down to 32 bytes.  The void
max_load_factor(float) method is still present, but does nothing.  We use
the default max_load_factor of 1.0f all of the time, adjusting the value
returned from size_type bucket_count() so that the externally-visible
load factor reaches 1 just as the actual internal load factor reaches
our threshold of 12/14.

The standard requires that a hash table be iterable in O(size()) time
regardless of its load factor (rather than O(bucket_count()).  That means
if you insert 1 million keys then erase all but 10, iteration should
be O(10).  For `std::unordered_map` the cost of supporting this scenario
is an extra level of indirection in every read and every write, which is
part of why we can improve substantially on its performance.  Low load
factor iteration occurs in practice when erasing keys during iteration
(for example by repeatedly calling map.erase(map.begin())), so we provide
the weaker guarantee that iteration is O(size()) after erasing any prefix
of the iteration order.  F14VectorMap doesn't have this problem.

The standard requires that the order of the elements that are not erased be
preserved (since c++14). This is a stricter requirement than necessary to make
erasing individual elements while iterating through the container possible,
since all of the patterns we have seen for doing this don't care about order for
elements before the erased one. All F14 maps and sets support erase during
iteration, but F14Fast and F14Vector don't guarantee to preserve the relative
order of elements earlier in the iteration order than the erased element.

The standard requires that clear() be O(size()), which has the practical
effect of prohibiting a change to bucket_count.  F14 deallocates
all memory during clear() if it has space for more than 100 keys, to
avoid leaving a large table that will be expensive to iterate (see the
previous paragraph).  `google::dense_hash_map` works around this tradeoff
by providing both clear() and clear_no_resize(); we could do something
similar.

As stated above, `F14NodeMap` and `F14NodeSet` are the only F14 variants
that provides reference stability.  When running under ASAN the other
storage policies will probabilistically perform extra rehashes, which
makes it likely that reference stability problems will be found by the
address sanitizer.

An additional subtlety for hash tables that don't provide reference
stability is whether they rehash before evaluating the arguments passed
to `insert()`.  F14 tables may rehash before evaluating the arguments
to a method that causes an insertion, so it's not safe to write
something like `map.insert(k2, map[k1])` with `F14FastMap`, `F14ValueMap`,
or `F14VectorMap`.  This behavior matches `google::dense_hash_map` and the
excellent `absl::flat_hash_map`.

`F14NodeMap` does not currently support the C++17 node API, but it could
be trivially added.
folly/folly/container/F14.md