folly/folly/synchronization/DistributedMutex.h

/*
 * Copyright (c) Meta Platforms, Inc. and affiliates.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

#pragma once

#include <atomic>
#include <chrono>
#include <cstdint>

#include <folly/Optional.h>
#include <folly/functional/Invoke.h>

namespace folly {
namespace detail {
namespace distributed_mutex {

/**
 * DistributedMutex is a small, exclusive-only mutex that distributes the
 * bookkeeping required for mutual exclusion in the stacks of threads that are
 * contending for it.  It has a mode that can combine critical sections when
 * the mutex experiences contention; this allows the implementation to elide
 * several expensive coherence and synchronization operations to boost
 * throughput, surpassing even atomic instructions in some cases.  It has a
 * smaller memory footprint than std::mutex, a similar level of fairness
 * (better in some cases) and no dependencies on heap allocation.  It is the
 * same width as a single pointer (8 bytes on most platforms), where on the
 * other hand, std::mutex and pthread_mutex_t are both 40 bytes.  It is larger
 * than some of the other smaller locks, but the wide majority of cases using
 * the small locks are wasting the difference in alignment padding anyway
 *
 * Benchmark results are good - at the time of writing, in the contended case,
 * for lock/unlock based critical sections, it is about 4-5x faster than the
 * smaller locks and about ~2x faster than std::mutex.  When used in
 * combinable mode, it is much faster than the alternatives, going more than
 * 10x faster than the small locks, about 6x faster than std::mutex, 2-3x
 * faster than flat combining and even faster than std::atomic<> in some
 * cases, allowing more work with higher throughput.  In the uncontended case,
 * it is a few cycles faster than folly::MicroLock but a bit slower than
 * std::mutex.  DistributedMutex is also resistent to tail latency pathalogies
 * unlike many of the other mutexes in use, which sleep for large time
 * quantums to reduce spin churn, this causes elevated latencies for threads
 * that enter the sleep cycle.  The tail latency of lock acquisition can go up
 * to 10x lower because of a more deterministic scheduling algorithm that is
 * managed almost entirely in userspace.  Detailed results comparing the
 * throughput and latencies of different mutex implementations and atomics are
 * at the bottom of folly/synchronization/test/SmallLocksBenchmark.cpp
 *
 * Theoretically, write locks promote concurrency when the critical sections
 * are small as most of the work is done outside the lock.  And indeed,
 * performant concurrent applications go through several pains to limit the
 * amount of work they do while holding a lock.  However, most times, the
 * synchronization and scheduling overhead of a write lock in the critical
 * path is so high, that after a certain point, making critical sections
 * smaller does not actually increase the concurrency of the application and
 * throughput plateaus.  DistributedMutex moves this breaking point to the
 * level of hardware atomic instructions, so applications keep getting
 * concurrency even under very high contention.  It does this by reducing
 * cache misses and contention in userspace and in the kernel by making each
 * thread wait on a thread local node and futex.  When combined critical
 * sections are used DistributedMutex leverages template metaprogramming to
 * allow the mutex to make better synchronization decisions based on the
 * layout of the input and output data.  This allows threads to keep working
 * only on their own cache lines without requiring cache coherence operations
 * when a mutex experiences heavy contention
 *
 * Non-timed mutex acquisitions are scheduled through intrusive LIFO
 * contention chains.  Each thread starts by spinning for a short quantum and
 * falls back to two phased sleeping.  Enqueue operations are lock free and
 * are piggybacked off mutex acquisition attempts.  The LIFO behavior of a
 * contention chain is good in the case where the mutex is held for a short
 * amount of time, as the head of the chain is likely to not have slept on
 * futex() after exhausting its spin quantum.  This allow us to avoid
 * unnecessary traversal and syscalls in the fast path with a higher
 * probability.  Even though the contention chains are LIFO, the mutex itself
 * does not adhere to that scheduling policy globally.  During contention,
 * threads that fail to lock the mutex form a LIFO chain on the central mutex
 * state, this chain is broken when a wakeup is scheduled, and future enqueue
 * operations form a new chain.  This makes the chains themselves LIFO, but
 * preserves global fairness through a constant factor which is limited to the
 * number of concurrent failed mutex acquisition attempts.  This binds the
 * last in first out behavior to the number of contending threads and helps
 * prevent starvation and latency outliers
 *
 * This strategy of waking up wakers one by one in a queue does not scale well
 * when the number of threads goes past the number of cores.  At which point
 * preemption causes elevated lock acquisition latencies.  DistributedMutex
 * implements a hardware timestamp publishing heuristic to detect and adapt to
 * preemption.
 *
 * DistributedMutex does not have the typical mutex API - it does not satisfy
 * the Lockable concept.  It requires the user to maintain ephemeral bookkeeping
 * and pass that bookkeeping around to unlock() calls.  The API overhead,
 * however, comes for free when you wrap this mutex for usage with
 * std::unique_lock, which is the recommended usage (std::lock_guard, in
 * optimized mode, has no performance benefit over std::unique_lock, so has been
 * omitted).  A benefit of this API is that it disallows incorrect usage where a
 * thread unlocks a mutex that it does not own, thinking a mutex is functionally
 * identical to a binary semaphore, which, unlike a mutex, is a suitable
 * primitive for that usage
 *
 * Combined critical sections allow the implementation to elide several
 * expensive operations during the lifetime of a critical section that cause
 * slowdowns with regular lock/unlock based usage.  DistributedMutex resolves
 * contention through combining up to a constant factor of 2 contention chains
 * to prevent issues with fairness and latency outliers, so we retain the
 * fairness benefits of the lock/unlock implementation with no noticeable
 * regression when switching between the lock methods.  Despite the efficiency
 * benefits, combined critical sections can only be used when the critical
 * section does not depend on thread local state and does not introduce new
 * dependencies between threads when the critical section gets combined.  For
 * example, locking or unlocking an unrelated mutex in a combined critical
 * section might lead to unexpected results or even undefined behavior.  This
 * can happen if, for example, a different thread unlocks a mutex locked by
 * the calling thread, leading to undefined behavior as the mutex might not
 * allow locking and unlocking from unrelated threads (the posix and C++
 * standard disallow this usage for their mutexes)
 *
 * Timed locking through DistributedMutex is implemented through a centralized
 * algorithm.  The underlying contention-chains framework used in
 * DistributedMutex is not abortable so we build abortability on the side.
 * All waiters wait on the central mutex state, by setting and resetting bits
 * within the pointer-length word.  Since pointer length atomic integers are
 * incompatible with futex(FUTEX_WAIT) on most systems, a non-standard
 * implementation of futex() is used, where wait queues are managed in
 * user-space (see p1135r0 and folly::ParkingLot for more)
 */
template <
    template <typename> class Atomic = std::atomic,
    bool TimePublishing = true>
class DistributedMutex {};

} // namespace distributed_mutex
} // namespace detail

/**
 * Bring the default instantiation of DistributedMutex into the folly
 * namespace without requiring any template arguments for public usage
 */
extern template class detail::distributed_mutex::DistributedMutex<>;
DistributedMutex;

} // namespace folly

#include <folly/synchronization/DistributedMutex-inl.h>