chromium/third_party/mediapipe/src/mediapipe/util/sequence/media_sequence.h

// Copyright 2019 The MediaPipe Authors.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//      http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

// This header defines a large number of getters and setters for storing
// multimedia, such as video or audio, and related machine learning data in
// tensorflow::SequenceExamples. These getters and setters simplify sharing
// data by enforcing common patterns for storing data in SequenceExample
// key-value pairs.
//
// The constants, macros, and functions are organized into 6 groups: clip
// metadata, clip label related, segment related, bounding-box related, image
// related, feature list related, and keyframe related. The following examples
// will walk through common task structures, but the relevant data to store can
// vary by task.
//
// The clip metadata group is generally data about the media and stored in the
// SequenceExample.context. Specifying the metadata enables media pipelines,
// such as MediaPipe, to retrieve that data. Typically, SetClipDataPath,
// SetClipStartTimestamp, and SetClipEndTimestamp define which data to use
// without storing the data itself. Example:
//   tensorflow::SequenceExample sequence;
//   SetClipDataPath("/relative/path/to/data.mp4", &sequence);
//   SetClipStartTimestamp(0, &sequence);
//   SetClipEndTimestamp(10000000, &sequence);  // 10 seconds in microseconds.
//
// The clip label group adds labels that apply to the entire media clip. To
// annotate that a video clip has a particular label, set the clip metadata
// above and also set the SetClipLabelIndex and SetClipLabelString. Most
// training pipelines will only use the label index or string, but we recommend
// storing both to improve readability while maintaining ease of use.
// Example:
//   SetClipLabelString({"run", "jump"}, &sequence);
//   SetClipLabelIndex({35, 47}, &sequence);
//
// The segment group is generally data about time spans within the media clip
// and stored in the SequenceExample.context. In this code, continuous lengths
// of media are called clips, and each clip may have subregions of interest that
// are called segments. To annotate that a video clip has time spans with labels
// set the clip metadata above and use the functions SetSegmentStartTimestamp,
// SetSegmentEndTimestamp, SetSegmentLabelIndex, and SetSegmentLabelString. Most
// training pipelines will only use the label index or string, but we recommend
// storing both to improve readability while maintaining ease of use. By listing
// segments as times, the frame rate or other properties can change without
// affecting the labels.
// Example:
//   SetSegmentStartTimestamp({500000, 1000000}, &sequence);  // in microseconds
//   SetSegmentEndTimestamp({2000000, 6000000}, &sequence);
//   SetSegmentLabelIndex({35, 47}, &sequence);
//   SetSegmentLabelString({"run", "jump"}, &sequence);
//
// The bounding box group is useful for identifying spatio-temporal annotations
// for detection, tracking, or action recognition. The exact keys that are
// needed can vary by task, but to annotate a video clip for detection set the
// clip metadata above and use repeatedly call AddBBox, AddBBoxTimestamp,
// AddBBoxLabelIndex, and AddBBoxLabelString. Most training pipelines will only
// use the label index or string, but we recommend storing both to improve
// readability while maintaining ease of use. Because bounding boxes are
// assigned to timepoints in a video, changing the image frame rate can can
// change the alignment. The ReconcileMetadata function can align bounding boxes
// to the nearest image.
//
// The image group is useful for storing data as sequential 2D arrays, typically
// encoded as bytes. Images can be RGB images stored as JPEG, discrete masks
// stored as PNG, or some other format. Parameters that are static over time are
// set in the context using SetImageWidth, SetImageHeight, SetImageFormat, etc.
// The series of frames and timestamps are then added with AddImageEncoded and
// AddImageTimestamp. For discrete masks, the class or instance indices can be
// mapped to labels or classes using
// SetClassSegmentationClassLabel{Index,String} and
// SetInstanceSegmentationClassLabelIndex.
//
// The feature list group is useful for storing audio and extracted features,
// such as per-frame embeddings. SequenceExamples only store lists of floats per
// timestep, so the dimensions are stored in the context to enable reshaping.
// For example, SetFeatureDimensions and repeatedly calling AddFeatureFloats and
// AddFeatureTimestamp adds per-frame embeddings. To support audio features,
// additional getters and setters are provided that understand MediaPipe types.
//
// Macros for common patterns are created in media_sequence_util.h and are used
// here extensively. Because these macros are formulaic, I will only include a
// usage example here in the code rather than repeating documentation for every
// instance. This header defines additional functions to simplify working with
// MediaPipe types.
//
// Each {TYPE}_CONTEXT_FEATURE takes a NAME and a KEY. It provides setters and
// getters for SequenceExamples and stores a single value under KEY in the
// context field. The provided functions are Has${NAME}, Get${NAME}, Set${Name},
// and Clear${NAME}.
// Eg.
//   tensorflow::SequenceExample example;
//   SetDataPath("data_path", &example);
//   if (HasDataPath(example)) {
//      string data_path = GetDataPath(example);
//      ClearDataPath(&example);
//   }
//
// Each VECTOR_{TYPE}_CONTEXT_FEATURE takes a NAME and a KEY. It provides
// setters and getters for SequenceExamples and stores a sequence of values
// under KEY in the context field. The provided functions are Has${NAME},
// Get${NAME}, Set${Name}, Clear${NAME}, Get${NAME}At, and Add${NAME}.
// Eg.
//   tensorflow::SequenceExample example;
//   SetClipLabelString({"run", "jump"}, &example);
//   if (HasClipLabelString(example)) {
//      std::vector<std::string> values = GetClipLabelString(example);
//      ClearClipLabelString(&example);
//   }
//
// Each {TYPE}_FEATURE_LIST takes a NAME and a KEY. It provides setters and
// getters for SequenceExamples and stores a single value in each feature field
// under KEY of the feature_lists field. The provided functions are Has${NAME},
// Get${NAME}, Clear${NAME}, Get${NAME}Size, Get${NAME}At, and Add${NAME}.
//   tensorflow::SequenceExample example;
//   AddImageTimestamp(1000000, &example);
//   AddImageTimestamp(2000000, &example);
//   if (HasImageTimestamp(example)) {
//     for (int i = 0; i < GetImageTimestampSize(); ++i) {
//       int64 timestamp = GetImageTimestampAt(example, i);
//     }
//     ClearImageTimestamp(&example);
//   }
//
// Each VECTOR_{TYPE}_FEATURE_LIST takes a NAME and a KEY. It provides setters
// and getters for SequenceExamples and stores a sequence of values in each
// feature field under KEY of the feature_lists field. The provided functions
// are Has${NAME}, Get${NAME}, Clear${NAME}, Get${NAME}Size, Get${NAME}At, and
// Add${NAME}.
//   tensorflow::SequenceExample example;
//   AddBBoxLabelString({"run", "jump"}, &example);
//   AddBBoxLabelString({"run", "fall"}, &example);
//   if (HasBBoxLabelString(example)) {
//     for (int i = 0; i < GetBBoxLabelStringSize(); ++i) {
//       std::vector<std::string> labels = GetBBoxLabelStringAt(example, i);
//     }
//     ClearBBoxLabelString(&example);
//   }
//
// As described in media_sequence_util.h, each of these functions can take an
// additional string prefix argument as their first argument. The prefix can
// be fixed with a new NAME by calling a FIXED_PREFIX_... macro. Prefixes are
// used to identify common storage patterns (e.g. storing an image along with
// the height and width) under different names (e.g. storing a left and right
// image in a stereo pair.) An example creating functions such as
// AddLeftImageEncoded that adds a string under the key "LEFT/image/encoded":
//  FIXED_PREFIX_STRING_FEATURE_LIST("LEFT", LeftImageEncoded, "image/encoded");

#ifndef MEDIAPIPE_TENSORFLOW_SEQUENCE_MEDIA_SEQUENCE_H_
#define MEDIAPIPE_TENSORFLOW_SEQUENCE_MEDIA_SEQUENCE_H_

#include <string>
#include <vector>

#include "absl/memory/memory.h"
#include "mediapipe/framework/formats/location.h"
#include "mediapipe/framework/formats/matrix.h"
#include "mediapipe/framework/port/proto_ns.h"
#include "mediapipe/framework/port/status.h"
#include "mediapipe/util/sequence/media_sequence_util.h"
#include "tensorflow/core/example/example.pb.h"
#include "tensorflow/core/example/feature.pb.h"

namespace mediapipe {
namespace mediasequence {

// ***********************    METADATA    *************************************
// Context Keys:
// A unique identifier for each example.
const char kExampleIdKey[] = "example/id";
// The name of the data set, including the version.
const char kExampleDatasetNameKey[] = "example/dataset_name";
// String flags or attributes for this example within a data set.
const char kExampleDatasetFlagStringKey[] = "example/dataset/flag/string";

// The relative path to the data on disk from some root directory.
const char kClipDataPathKey[] = "clip/data_path";
// Any identifier for the media beyond the data path.
const char kClipMediaId[] = "clip/media_id";
// Yet another alternative identifier.
const char kClipAlternativeMediaId[] = "clip/alternative_media_id";
// The encoded bytes for storing media directly in the SequenceExample.
const char kClipEncodedMediaBytesKey[] = "clip/encoded_media_bytes";
// The start time for the encoded media if not preserved during encoding.
const char kClipEncodedMediaStartTimestampKey[] =
    "clip/encoded_media_start_timestamp";
// The start time, in microseconds, for the start of the clip in the media.
const char kClipStartTimestampKey[] = "clip/start/timestamp";
// The end time, in microseconds, for the end of the clip in the media.
const char kClipEndTimestampKey[] = "clip/end/timestamp";
// A list of label indices for this clip.
const char kClipLabelIndexKey[] = "clip/label/index";
// A list of label strings for this clip.
const char kClipLabelStringKey[] = "clip/label/string";
// A list of label confidences for this clip.
const char kClipLabelConfidenceKey[] = "clip/label/confidence";
// A list of label start timestamps for this clip.
const char kClipLabelStartTimestampKey[] = "clip/label/start/timestamp";
// A list of label end timestamps for this clip.
const char kClipLabelEndTimestampKey[] = "clip/label/end/timestamp";

BYTES_CONTEXT_FEATURE(ExampleId, kExampleIdKey);
BYTES_CONTEXT_FEATURE(ExampleDatasetName, kExampleDatasetNameKey);
VECTOR_BYTES_CONTEXT_FEATURE(ExampleDatasetFlagString,
                             kExampleDatasetFlagStringKey);

BYTES_CONTEXT_FEATURE(ClipDataPath, kClipDataPathKey);
BYTES_CONTEXT_FEATURE(ClipAlternativeMediaId, kClipAlternativeMediaId);
BYTES_CONTEXT_FEATURE(ClipMediaId, kClipMediaId);
BYTES_CONTEXT_FEATURE(ClipEncodedMediaBytes, kClipEncodedMediaBytesKey);
INT64_CONTEXT_FEATURE(ClipEncodedMediaStartTimestamp,
                      kClipEncodedMediaStartTimestampKey);
INT64_CONTEXT_FEATURE(ClipStartTimestamp, kClipStartTimestampKey);
INT64_CONTEXT_FEATURE(ClipEndTimestamp, kClipEndTimestampKey);
VECTOR_BYTES_CONTEXT_FEATURE(ClipLabelString, kClipLabelStringKey);
VECTOR_INT64_CONTEXT_FEATURE(ClipLabelIndex, kClipLabelIndexKey);
VECTOR_FLOAT_CONTEXT_FEATURE(ClipLabelConfidence, kClipLabelConfidenceKey);
VECTOR_INT64_CONTEXT_FEATURE(ClipLabelStartTimestamp,
                             kClipLabelStartTimestampKey);
VECTOR_INT64_CONTEXT_FEATURE(ClipLabelEndTimestamp, kClipLabelEndTimestampKey);

// ***********************    SEGMENTS    *************************************
// Context Keys:
// A list of segment start times in microseconds.
const char kSegmentStartTimestampKey[] = "segment/start/timestamp";
// A list of indices marking the first frame index >= the start time.
const char kSegmentStartIndexKey[] = "segment/start/index";
// A list of segment end times in microseconds.
const char kSegmentEndTimestampKey[] = "segment/end/timestamp";
// A list of indices marking the last frame index <= the end time.
const char kSegmentEndIndexKey[] = "segment/end/index";
// A list with the label index for each segment.
// Multiple labels for the same segment are encoded as repeated segments.
const char kSegmentLabelIndexKey[] = "segment/label/index";
// A list with the label string for each segment.
// Multiple labels for the same segment are encoded as repeated segments.
const char kSegmentLabelStringKey[] = "segment/label/string";
// A list with the label confidence for each segment.
// Multiple labels for the same segment are encoded as repeated segments.
const char kSegmentLabelConfidenceKey[] = "segment/label/confidence";

VECTOR_BYTES_CONTEXT_FEATURE(SegmentLabelString, kSegmentLabelStringKey);
VECTOR_INT64_CONTEXT_FEATURE(SegmentStartTimestamp, kSegmentStartTimestampKey);
VECTOR_INT64_CONTEXT_FEATURE(SegmentEndTimestamp, kSegmentEndTimestampKey);
VECTOR_INT64_CONTEXT_FEATURE(SegmentStartIndex, kSegmentStartIndexKey);
VECTOR_INT64_CONTEXT_FEATURE(SegmentEndIndex, kSegmentEndIndexKey);
VECTOR_INT64_CONTEXT_FEATURE(SegmentLabelIndex, kSegmentLabelIndexKey);
VECTOR_FLOAT_CONTEXT_FEATURE(SegmentLabelConfidence,
                             kSegmentLabelConfidenceKey);

// *****************    REGIONS / BOUNDING BOXES    ***************************
// Context keys:
// The dimensions of each embedding per region / bounding box.
const char kRegionEmbeddingDimensionsPerRegionKey[] =
    "region/embedding/dimensions_per_region";
// The format encoding embeddings as strings.
const char kRegionEmbeddingFormatKey[] = "region/embedding/format";
// The list of region parts expected in this example.
const char kRegionPartsKey[] = "region/parts";

// Feature list keys:
// The normalized coordinates of the bounding boxes are provided in four lists
// to avoid order ambiguity, but we provide additional accessors for complete
// bounding boxes below.
const char kRegionBBoxYMinKey[] = "region/bbox/ymin";
const char kRegionBBoxXMinKey[] = "region/bbox/xmin";
const char kRegionBBoxYMaxKey[] = "region/bbox/ymax";
const char kRegionBBoxXMaxKey[] = "region/bbox/xmax";
// The point and radius can denote keypoints.
const char kRegionPointXKey[] = "region/point/x";
const char kRegionPointYKey[] = "region/point/y";
const char kRegionRadiusKey[] = "region/radius";
// The 3d point can denote keypoints.
const char kRegion3dPointXKey[] = "region/3d_point/x";
const char kRegion3dPointYKey[] = "region/3d_point/y";
const char kRegion3dPointZKey[] = "region/3d_point/z";
// The number of regions at that timestep.
const char kRegionNumRegionsKey[] = "region/num_regions";
// Whether that timestep is annotated for bounding regions.
// (Distinguishes between multiple meanings of num_regions = 0.
const char kRegionIsAnnotatedKey[] = "region/is_annotated";
// A list indicating if each region is generated (1) or manually annotated (0).
const char kRegionIsGeneratedKey[] = "region/is_generated";
// A list indicating if each region is occluded (1) or visible (0).
const char kRegionIsOccludedKey[] = "region/is_occluded";
// Lists with a label for each region.
// Multiple labels for the same region require duplicating the region.
const char kRegionLabelIndexKey[] = "region/label/index";
const char kRegionLabelStringKey[] = "region/label/string";
const char kRegionLabelConfidenceKey[] = "region/label/confidence";
// Lists with a track identifier for each region.
const char kRegionTrackIndexKey[] = "region/track/index";
const char kRegionTrackStringKey[] = "region/track/string";
const char kRegionTrackConfidenceKey[] = "region/track/confidence";
// A list with a class for each region. In general, prefer to use the label
// fields. These class fields exist to distinguish tracks when different classes
// have overlapping track ids.
const char kRegionClassIndexKey[] = "region/class/index";
const char kRegionClassStringKey[] = "region/class/string";
const char kRegionClassConfidenceKey[] = "region/class/confidence";
// The timestamp of the region annotations in microseconds.
const char kRegionTimestampKey[] = "region/timestamp";
// An embedding for each region. The length of each list must be the product of
// the number of regions and the product of the embedding dimensions.
const char kRegionEmbeddingFloatKey[] = "region/embedding/float";
// A string encoded embedding for each region.
const char kRegionEmbeddingEncodedKey[] = "region/embedding/encoded";
// The confidence of the embedding.
const char kRegionEmbeddingConfidenceKey[] = "region/embedding/confidence";
// The original timestamp in microseconds for region annotations.
// ReconcileMetadata can align region annotations to image frames, and this
// field preserves the original timestamps.
const char kUnmodifiedRegionTimestampKey[] = "region/unmodified_timestamp";

// Functions:
// These functions get and set bounding boxes as MediaPipe::Location to avoid
// needing to get and set each box coordinate separately.
int GetBBoxSize(const std::string& prefix,
                const tensorflow::SequenceExample& sequence);
std::vector<::mediapipe::Location> GetBBoxAt(
    const std::string& prefix, const tensorflow::SequenceExample& sequence,
    int index);
void AddBBox(const std::string& prefix,
             const std::vector<::mediapipe::Location>& bboxes,
             tensorflow::SequenceExample* sequence);
void ClearBBox(const std::string& prefix,
               tensorflow::SequenceExample* sequence);

// The input and output format is a pair of <y, x> coordinates to match the
// order of bounding box coordinates.
int GetPointSize(const std::string& prefix,
                 const tensorflow::SequenceExample& sequence);
std::vector<std::pair<float, float>> GetPointAt(
    const std::string& prefix, const tensorflow::SequenceExample& sequence,
    int index);
void AddPoint(const std::string& prefix,
              const std::vector<std::pair<float, float>>& points,
              tensorflow::SequenceExample* sequence);
void ClearPoint(const std::string& prefix,
                tensorflow::SequenceExample* sequence);

// The input and output format is a pair of <y, x> coordinates to match the
// order of bounding box coordinates.
int Get3dPointSize(const std::string& prefix,
                   const tensorflow::SequenceExample& sequence);
std::vector<std::tuple<float, float, float>> Get3dPointAt(
    const std::string& prefix, const tensorflow::SequenceExample& sequence,
    int index);
void Add3dPoint(const std::string& prefix,
                const std::vector<std::tuple<float, float, float>>& points,
                tensorflow::SequenceExample* sequence);
void Clear3dPoint(const std::string& prefix,
                  tensorflow::SequenceExample* sequence);
#define FIXED_PREFIX_BBOX_ACCESSORS(identifier, prefix)                        \
  inline int CONCAT_STR3(Get, identifier,                                      \
                         Size)(const tensorflow::SequenceExample& sequence) {  \
    return GetBBoxSize(prefix, sequence);                                      \
  }                                                                            \
  inline std::vector<::mediapipe::Location> CONCAT_STR3(Get, identifier, At)(  \
      const tensorflow::SequenceExample& sequence, int index) {                \
    return GetBBoxAt(prefix, sequence, index);                                 \
  }                                                                            \
  inline void CONCAT_STR2(Add, identifier)(                                    \
      const std::vector<::mediapipe::Location>& bboxes,                        \
      tensorflow::SequenceExample* sequence) {                                 \
    return AddBBox(prefix, bboxes, sequence);                                  \
  }                                                                            \
  inline void CONCAT_STR2(                                                     \
      Clear, identifier)(tensorflow::SequenceExample * sequence) {             \
    return ClearBBox(prefix, sequence);                                        \
  }                                                                            \
  inline int CONCAT_STR3(Get, identifier, PointSize)(                          \
      const tensorflow::SequenceExample& sequence) {                           \
    return GetPointSize(prefix, sequence);                                     \
  }                                                                            \
  inline int CONCAT_STR3(Get, identifier, PointSize)(                          \
      const std::string& name, const tensorflow::SequenceExample& sequence) {  \
    return GetPointSize(name, sequence);                                       \
  }                                                                            \
  inline std::vector<std::pair<float, float>> CONCAT_STR3(                     \
      Get, identifier, PointAt)(const tensorflow::SequenceExample& sequence,   \
                                int index) {                                   \
    return GetPointAt(prefix, sequence, index);                                \
  }                                                                            \
  inline std::vector<std::pair<float, float>> CONCAT_STR3(                     \
      Get, identifier, PointAt)(const std::string& name,                       \
                                const tensorflow::SequenceExample& sequence,   \
                                int index) {                                   \
    return GetPointAt(name, sequence, index);                                  \
  }                                                                            \
  inline void CONCAT_STR3(Add, identifier, Point)(                             \
      const std::vector<std::pair<float, float>>& points,                      \
      tensorflow::SequenceExample* sequence) {                                 \
    return AddPoint(prefix, points, sequence);                                 \
  }                                                                            \
  inline void CONCAT_STR3(Add, identifier, Point)(                             \
      const std::string& name,                                                 \
      const std::vector<std::pair<float, float>>& points,                      \
      tensorflow::SequenceExample* sequence) {                                 \
    return AddPoint(name, points, sequence);                                   \
  }                                                                            \
  inline void CONCAT_STR3(Clear, identifier,                                   \
                          Point)(tensorflow::SequenceExample * sequence) {     \
    return ClearPoint(prefix, sequence);                                       \
  }                                                                            \
  inline void CONCAT_STR3(Clear, identifier, Point)(                           \
      std::string name, tensorflow::SequenceExample * sequence) {              \
    return ClearPoint(name, sequence);                                         \
  }                                                                            \
  inline int CONCAT_STR3(Get, identifier, 3dPointSize)(                        \
      const tensorflow::SequenceExample& sequence) {                           \
    return Get3dPointSize(prefix, sequence);                                   \
  }                                                                            \
  inline int CONCAT_STR3(Get, identifier, 3dPointSize)(                        \
      const std::string& name, const tensorflow::SequenceExample& sequence) {  \
    return Get3dPointSize(name, sequence);                                     \
  }                                                                            \
  inline std::vector<std::tuple<float, float, float>> CONCAT_STR3(             \
      Get, identifier, 3dPointAt)(const tensorflow::SequenceExample& sequence, \
                                  int index) {                                 \
    return Get3dPointAt(prefix, sequence, index);                              \
  }                                                                            \
  inline std::vector<std::tuple<float, float, float>> CONCAT_STR3(             \
      Get, identifier, 3dPointAt)(const std::string& name,                     \
                                  const tensorflow::SequenceExample& sequence, \
                                  int index) {                                 \
    return Get3dPointAt(name, sequence, index);                                \
  }                                                                            \
  inline void CONCAT_STR3(Add, identifier, 3dPoint)(                           \
      const std::vector<std::tuple<float, float, float>>& points,              \
      tensorflow::SequenceExample* sequence) {                                 \
    return Add3dPoint(prefix, points, sequence);                               \
  }                                                                            \
  inline void CONCAT_STR3(Add, identifier, 3dPoint)(                           \
      const std::string& name,                                                 \
      const std::vector<std::tuple<float, float, float>>& points,              \
      tensorflow::SequenceExample* sequence) {                                 \
    return Add3dPoint(name, points, sequence);                                 \
  }                                                                            \
  inline void CONCAT_STR3(Clear, identifier,                                   \
                          3dPoint)(tensorflow::SequenceExample * sequence) {   \
    return Clear3dPoint(prefix, sequence);                                     \
  }                                                                            \
  inline void CONCAT_STR3(Clear, identifier, 3dPoint)(                         \
      std::string name, tensorflow::SequenceExample * sequence) {              \
    return Clear3dPoint(name, sequence);                                       \
  }

#define PREFIXED_BBOX(identifier, prefix)                                      \
  FIXED_PREFIX_BBOX_ACCESSORS(identifier, prefix)                              \
  FIXED_PREFIX_VECTOR_BYTES_FEATURE_LIST(CONCAT_STR2(identifier, LabelString), \
                                         kRegionLabelStringKey, prefix)        \
  FIXED_PREFIX_VECTOR_BYTES_FEATURE_LIST(CONCAT_STR2(identifier, ClassString), \
                                         kRegionClassStringKey, prefix)        \
  FIXED_PREFIX_VECTOR_BYTES_FEATURE_LIST(CONCAT_STR2(identifier, TrackString), \
                                         kRegionTrackStringKey, prefix)        \
  FIXED_PREFIX_VECTOR_INT64_FEATURE_LIST(CONCAT_STR2(identifier, LabelIndex),  \
                                         kRegionLabelIndexKey, prefix)         \
  FIXED_PREFIX_VECTOR_INT64_FEATURE_LIST(CONCAT_STR2(identifier, ClassIndex),  \
                                         kRegionClassIndexKey, prefix)         \
  FIXED_PREFIX_VECTOR_INT64_FEATURE_LIST(CONCAT_STR2(identifier, TrackIndex),  \
                                         kRegionTrackIndexKey, prefix)         \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(                                      \
      CONCAT_STR2(identifier, LabelConfidence), kRegionLabelConfidenceKey,     \
      prefix)                                                                  \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(                                      \
      CONCAT_STR2(identifier, ClassConfidence), kRegionClassConfidenceKey,     \
      prefix)                                                                  \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(                                      \
      CONCAT_STR2(identifier, TrackConfidence), kRegionTrackConfidenceKey,     \
      prefix)                                                                  \
  FIXED_PREFIX_VECTOR_INT64_FEATURE_LIST(CONCAT_STR2(identifier, IsGenerated), \
                                         kRegionIsGeneratedKey, prefix)        \
  FIXED_PREFIX_VECTOR_INT64_FEATURE_LIST(CONCAT_STR2(identifier, IsOccluded),  \
                                         kRegionIsOccludedKey, prefix)         \
  FIXED_PREFIX_INT64_FEATURE_LIST(CONCAT_STR2(identifier, NumRegions),         \
                                  kRegionNumRegionsKey, prefix)                \
  FIXED_PREFIX_INT64_FEATURE_LIST(CONCAT_STR2(identifier, IsAnnotated),        \
                                  kRegionIsAnnotatedKey, prefix)               \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(CONCAT_STR2(identifier, YMin),        \
                                         kRegionBBoxYMinKey, prefix)           \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(CONCAT_STR2(identifier, XMin),        \
                                         kRegionBBoxXMinKey, prefix)           \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(CONCAT_STR2(identifier, YMax),        \
                                         kRegionBBoxYMaxKey, prefix)           \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(CONCAT_STR2(identifier, XMax),        \
                                         kRegionBBoxXMaxKey, prefix)           \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(CONCAT_STR2(identifier, PointX),      \
                                         kRegionPointXKey, prefix)             \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(CONCAT_STR2(identifier, PointY),      \
                                         kRegionPointYKey, prefix)             \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(CONCAT_STR2(identifier, Radius),      \
                                         kRegionRadiusKey, prefix)             \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(CONCAT_STR2(identifier, 3dPointX),    \
                                         kRegion3dPointXKey, prefix)           \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(CONCAT_STR2(identifier, 3dPointY),    \
                                         kRegion3dPointYKey, prefix)           \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(CONCAT_STR2(identifier, 3dPointZ),    \
                                         kRegion3dPointZKey, prefix)           \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(                                      \
      CONCAT_STR2(identifier, EmbeddingFloats), kRegionEmbeddingFloatKey,      \
      prefix)                                                                  \
  FIXED_PREFIX_VECTOR_BYTES_FEATURE_LIST(                                      \
      CONCAT_STR2(identifier, EmbeddingEncoded), kRegionEmbeddingEncodedKey,   \
      prefix)                                                                  \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(                                      \
      CONCAT_STR2(identifier, EmbeddingConfidence),                            \
      kRegionEmbeddingConfidenceKey, prefix)                                   \
  FIXED_PREFIX_VECTOR_INT64_CONTEXT_FEATURE(                                   \
      CONCAT_STR2(identifier, EmbeddingDimensionsPerRegion),                   \
      kRegionEmbeddingDimensionsPerRegionKey, prefix)                          \
  FIXED_PREFIX_BYTES_CONTEXT_FEATURE(CONCAT_STR2(identifier, EmbeddingFormat), \
                                     kRegionEmbeddingFormatKey, prefix)        \
  FIXED_PREFIX_VECTOR_BYTES_CONTEXT_FEATURE(CONCAT_STR2(identifier, Parts),    \
                                            kRegionPartsKey, prefix)           \
  FIXED_PREFIX_INT64_FEATURE_LIST(CONCAT_STR2(identifier, Timestamp),          \
                                  kRegionTimestampKey, prefix)                 \
  FIXED_PREFIX_INT64_FEATURE_LIST(                                             \
      CONCAT_STR3(Unmodified, identifier, Timestamp),                          \
      kUnmodifiedRegionTimestampKey, prefix)

// Provides suites of functions for working with bounding boxes and predicted
// bounding boxes such as
// GetBBoxNumBoxes, GetBBoxSize, GetBBoxAt, GetBBoxLabelIndexAt, etc., and
// GetPredictedBBoxNumBoxes, GetPredictedBBoxSize, GetPredictedBBoxAt, etc.
const char kPredictedPrefix[] = "PREDICTED";
PREFIXED_BBOX(BBox, "");
PREFIXED_BBOX(PredictedBBox, kPredictedPrefix);

// ************************    IMAGES    **************************************
// Context keys:
// The format the images are encoded as (e.g. "JPEG", "PNG")
const char kImageFormatKey[] = "image/format";
// The number of channels in the image.
const char kImageChannelsKey[] = "image/channels";
// The colorspace of the image.
const char kImageColorspaceKey[] = "image/colorspace";
// The height of the image in pixels.
const char kImageHeightKey[] = "image/height";
// The width of the image in pixels.
const char kImageWidthKey[] = "image/width";
// The frame rate in images/second of media.
const char kImageFrameRateKey[] = "image/frame_rate";
// The maximum value if the images were saturated and normalized for encoding.
const char kImageSaturationKey[] = "image/saturation";
// The listing from discrete image values (as indices) to class indices.
const char kImageClassLabelIndexKey[] = "image/class/label/index";
// The listing from discrete image values (as indices) to class strings.
const char kImageClassLabelStringKey[] = "image/class/label/string";
// The listing from discrete instance indices to class indices they embody.
const char kImageObjectClassIndexKey[] = "image/object/class/index";
// The path of the image file if it did not come from a media clip.
const char kImageDataPathKey[] = "image/data_path";

// Feature list keys:
// The encoded image frame.
const char kImageEncodedKey[] = "image/encoded";
// Multiple images for the same timestep (e.g. multiview video).
const char kImageMultiEncodedKey[] = "image/multi_encoded";
// The timestamp of the frame in microseconds.
const char kImageTimestampKey[] = "image/timestamp";
// A per image label if specific frames have labels.
// If time spans have labels, segments are preferred to allow changing rates.
const char kImageLabelIndexKey[] = "image/label/index";
const char kImageLabelStringKey[] = "image/label/string";
const char kImageLabelConfidenceKey[] = "image/label/confidence";

#define PREFIXED_IMAGE(identifier, prefix)                                     \
  FIXED_PREFIX_INT64_CONTEXT_FEATURE(CONCAT_STR2(identifier, Height),          \
                                     kImageHeightKey, prefix)                  \
  FIXED_PREFIX_INT64_CONTEXT_FEATURE(CONCAT_STR2(identifier, Width),           \
                                     kImageWidthKey, prefix)                   \
  FIXED_PREFIX_INT64_CONTEXT_FEATURE(CONCAT_STR2(identifier, Channels),        \
                                     kImageChannelsKey, prefix)                \
  FIXED_PREFIX_BYTES_CONTEXT_FEATURE(CONCAT_STR2(identifier, Format),          \
                                     kImageFormatKey, prefix)                  \
  FIXED_PREFIX_BYTES_CONTEXT_FEATURE(CONCAT_STR2(identifier, Colorspace),      \
                                     kImageColorspaceKey, prefix)              \
  FIXED_PREFIX_FLOAT_CONTEXT_FEATURE(CONCAT_STR2(identifier, FrameRate),       \
                                     kImageFrameRateKey, prefix)               \
  FIXED_PREFIX_FLOAT_CONTEXT_FEATURE(CONCAT_STR2(identifier, Saturation),      \
                                     kImageSaturationKey, prefix)              \
  FIXED_PREFIX_BYTES_CONTEXT_FEATURE(CONCAT_STR2(identifier, DataPath),        \
                                     kImageDataPathKey, prefix)                \
  FIXED_PREFIX_VECTOR_INT64_CONTEXT_FEATURE(                                   \
      CONCAT_STR2(identifier, ClassLabelIndex), kImageClassLabelIndexKey,      \
      prefix)                                                                  \
  FIXED_PREFIX_VECTOR_BYTES_CONTEXT_FEATURE(                                   \
      CONCAT_STR2(identifier, ClassLabelString), kImageClassLabelStringKey,    \
      prefix)                                                                  \
  FIXED_PREFIX_VECTOR_INT64_CONTEXT_FEATURE(                                   \
      CONCAT_STR2(identifier, ObjectClassIndex), kImageObjectClassIndexKey,    \
      prefix)                                                                  \
  FIXED_PREFIX_BYTES_FEATURE_LIST(CONCAT_STR2(identifier, Encoded),            \
                                  kImageEncodedKey, prefix)                    \
  FIXED_PREFIX_VECTOR_BYTES_FEATURE_LIST(                                      \
      CONCAT_STR2(identifier, MultiEncoded), kImageMultiEncodedKey, prefix)    \
  FIXED_PREFIX_INT64_FEATURE_LIST(CONCAT_STR2(identifier, Timestamp),          \
                                  kImageTimestampKey, prefix)                  \
  FIXED_PREFIX_VECTOR_INT64_FEATURE_LIST(CONCAT_STR2(identifier, LabelIndex),  \
                                         kImageLabelIndexKey, prefix)          \
  FIXED_PREFIX_VECTOR_BYTES_FEATURE_LIST(CONCAT_STR2(identifier, LabelString), \
                                         kImageLabelStringKey, prefix)         \
  FIXED_PREFIX_VECTOR_FLOAT_FEATURE_LIST(                                      \
      CONCAT_STR2(identifier, LabelConfidence), kImageLabelConfidenceKey,      \
      prefix)

// Provides suites of functions for working with images and data encoded in
// images such as
// AddImageEncoded, GetImageEncodedAt, AddImageTimestamp, GetImageHeight, etc.,
// AddForwardFlowEncoded, GetForwardFlowEncodedAt, AddForwardFlowTimestamp, etc.
// AddClassSegmentationEncoded, GetClassSegmentationEncodedAt, etc., and
// AddInstanceSegmentationEncoded, GetInstanceSegmentationEncodedAt, etc.
const char kForwardFlowPrefix[] = "FORWARD_FLOW";
const char kClassSegmentationPrefix[] = "CLASS_SEGMENTATION";
const char kInstanceSegmentationPrefix[] = "INSTANCE_SEGMENTATION";
PREFIXED_IMAGE(Image, "");
PREFIXED_IMAGE(ForwardFlow, kForwardFlowPrefix);
PREFIXED_IMAGE(ClassSegmentation, kClassSegmentationPrefix);
PREFIXED_IMAGE(InstanceSegmentation, kInstanceSegmentationPrefix);

// **************************   TEXT   ****************************************
// Context keys:
// Which language text tokens are likely to be in.
const char kTextLanguageKey[] = "text/language";
// A large block of text that applies to the media.
const char kTextContextContentKey[] = "text/context/content";
// A large block of text that applies to the media as token ids.
const char kTextContextTokenIdKey[] = "text/context/token_id";
// A large block of text that applies to the media as embeddings.
const char kTextContextEmbeddingKey[] = "text/context/embedding";

// Feature list keys:
// The text contents for a given time.
const char kTextContentKey[] = "text/content";
// The start time for the text becoming relevant.
const char kTextTimestampKey[] = "text/timestamp";
// The duration where the text is relevant.
const char kTextDurationKey[] = "text/duration";
// The confidence that this is the correct text.
const char kTextConfidenceKey[] = "text/confidence";
// A floating point embedding corresponding to the text.
const char kTextEmbeddingKey[] = "text/embedding";
// An integer id corresponding to the text.
const char kTextTokenIdKey[] = "text/token/id";

BYTES_CONTEXT_FEATURE(TextLanguage, kTextLanguageKey);
BYTES_CONTEXT_FEATURE(TextContextContent, kTextContextContentKey);
VECTOR_INT64_CONTEXT_FEATURE(TextContextTokenId, kTextContextTokenIdKey);
VECTOR_FLOAT_CONTEXT_FEATURE(TextContextEmbedding, kTextContextEmbeddingKey);
BYTES_FEATURE_LIST(TextContent, kTextContentKey);
INT64_FEATURE_LIST(TextTimestamp, kTextTimestampKey);
INT64_FEATURE_LIST(TextDuration, kTextDurationKey);
FLOAT_FEATURE_LIST(TextConfidence, kTextConfidenceKey);
VECTOR_FLOAT_FEATURE_LIST(TextEmbedding, kTextEmbeddingKey);
INT64_FEATURE_LIST(TextTokenId, kTextTokenIdKey);

// ***********************    FEATURES    *************************************
// Context keys:
// The dimensions of the feature.
const char kFeatureDimensionsKey[] = "feature/dimensions";
// The rate the features are extracted per second of media.
const char kFeatureRateKey[] = "feature/rate";
// The encoding format if any for the feature.
const char kFeatureBytesFormatKey[] = "feature/bytes/format";
// For audio, the rate the samples are extracted per second of media.
const char kFeatureSampleRateKey[] = "feature/sample_rate";
// For audio, the number of channels per extracted feature.
const char kFeatureNumChannelsKey[] = "feature/num_channels";
// For audio, the number of samples per extracted feature.
const char kFeatureNumSamplesKey[] = "feature/num_samples";
// For audio, the rate the features are extracted per second of media.
const char kFeaturePacketRateKey[] = "feature/packet_rate";
// For audio, the original audio sampling rate the feature is derived from.
const char kFeatureAudioSampleRateKey[] = "feature/audio_sample_rate";
// The feature as a list of floats.
const char kContextFeatureFloatsKey[] = "context_feature/floats";
// The feature as a list of floats.
const char kContextFeatureBytesKey[] = "context_feature/bytes";
// The feature as a list of floats.
const char kContextFeatureIntsKey[] = "context_feature/ints";

// Feature list keys:
// The feature as a list of floats.
const char kFeatureFloatsKey[] = "feature/floats";
// The feature as a list of bytes. May be encoded.
const char kFeatureBytesKey[] = "feature/bytes";
// The feature as a list of ints.
const char kFeatureIntsKey[] = "feature/ints";
// The timestamp, in microseconds, of the feature.
const char kFeatureTimestampKey[] = "feature/timestamp";

// It is occasionally useful to indicate that a feature applies to a given
// range. This should be used for features only and annotations should be
// provided as segments.
const char kFeatureDurationKey[] = "feature/duration";
// Encodes an optional confidence score for generated features.
const char kFeatureConfidenceKey[] = "feature/confidence";

// Functions:

// Returns/sets a MediaPipe::Matrix for the stream with that prefix.
std::unique_ptr<mediapipe::Matrix> GetAudioFromFeatureAt(
    const std::string& prefix, const tensorflow::SequenceExample& sequence,
    int index);
void AddAudioAsFeature(const std::string& prefix,
                       const mediapipe::Matrix& audio,
                       tensorflow::SequenceExample* sequence);

PREFIXED_VECTOR_INT64_CONTEXT_FEATURE(FeatureDimensions, kFeatureDimensionsKey);
PREFIXED_FLOAT_CONTEXT_FEATURE(FeatureRate, kFeatureRateKey);
PREFIXED_VECTOR_FLOAT_CONTEXT_FEATURE(ContextFeatureFloats,
                                      kContextFeatureFloatsKey);
PREFIXED_VECTOR_BYTES_CONTEXT_FEATURE(ContextFeatureBytes,
                                      kContextFeatureBytesKey);
PREFIXED_VECTOR_INT64_CONTEXT_FEATURE(ContextFeatureInts,
                                      kContextFeatureIntsKey);
PREFIXED_BYTES_CONTEXT_FEATURE(FeatureBytesFormat, kFeatureBytesFormatKey);
PREFIXED_VECTOR_FLOAT_FEATURE_LIST(FeatureFloats, kFeatureFloatsKey);
PREFIXED_VECTOR_BYTES_FEATURE_LIST(FeatureBytes, kFeatureBytesKey);
PREFIXED_VECTOR_INT64_FEATURE_LIST(FeatureInts, kFeatureIntsKey);
PREFIXED_INT64_FEATURE_LIST(FeatureTimestamp, kFeatureTimestampKey);
PREFIXED_VECTOR_INT64_FEATURE_LIST(FeatureDuration, kFeatureDurationKey);
PREFIXED_VECTOR_FLOAT_FEATURE_LIST(FeatureConfidence, kFeatureConfidenceKey);

PREFIXED_FLOAT_CONTEXT_FEATURE(FeatureSampleRate, kFeatureSampleRateKey);
PREFIXED_INT64_CONTEXT_FEATURE(FeatureNumChannels, kFeatureNumChannelsKey);
PREFIXED_INT64_CONTEXT_FEATURE(FeatureNumSamples, kFeatureNumSamplesKey);
PREFIXED_FLOAT_CONTEXT_FEATURE(FeaturePacketRate, kFeaturePacketRateKey);
PREFIXED_FLOAT_CONTEXT_FEATURE(FeatureAudioSampleRate,
                               kFeatureAudioSampleRateKey);

// Modifies the context features to match the metadata of the features in the
// sequences. Specifically, it sets the frame indices corresponding to the
// timestamps in the label meta data based on the image timestamps. For
// encoded images, encoded optical flow, and encoded human pose puppets the
// image format, height, width, channels, and frame rate are written as
// metadata. For float feature lists, the frame rate and dimensions are
// calculated. If the float feature dimensions are already present, then the
// code verifies the number of elements matches the dimensions.
// Reconciling bounding box annotations is optional because will remove
// annotations if the sequence rate is lower than the annotation rate.
absl::Status ReconcileMetadata(bool reconcile_bbox_annotations,
                               bool reconcile_region_annotations,
                               tensorflow::SequenceExample* sequence);
}  // namespace mediasequence
}  // namespace mediapipe

#endif  // MEDIAPIPE_TENSORFLOW_SEQUENCE_MEDIA_SEQUENCE_H_