chromium/third_party/coremltools/mlmodel/format/SoundAnalysisPreprocessing.proto

// Copyright (c) 2019, Apple Inc. All rights reserved.
//
// Use of this source code is governed by a BSD-3-clause license that can be
// found in LICENSE.txt or at https://opensource.org/licenses/BSD-3-Clause

syntax = "proto3";
option optimize_for = LITE_RUNTIME;

package CoreML.Specification.CoreMLModels;

/*
 * A model which takes audio signal samples as input and outputs an array of
 * preprocessed samples according to the specified preprocessing types
 */
message SoundAnalysisPreprocessing {
  // Specific preprocessing types for sound analysis

  /* Vggish preprocesses input audio samples and makes them ready to
     be fed to Vggish feature extractor.
     c.f. https://arxiv.org/pdf/1609.09430.pdf

     The preprocessing takes input a single channel (monophonic) audio samples
     975 miliseconds long, sampled at 16KHz, i.e., 15600 samples 1D multiarray
     and produces preprocessed samples in multiarray of shape [1, 96, 64]

   (1) Splits the input audio samples into overlapping frames, where each
       frame is 25 milliseconds long and hops forward by 10 milliseconds.
       Any partial frames at the end are dropped.

   (2) Hann window: apply a periodic Hann with a window_length of
       25 milliseconds, which translates to 400 samples in 16KHz sampling rate

       w(n) = 0.5 - 0.5 * cos(2*pi*n/window_length_sample),
       where 0 <= n <= window_lenth_samples - 1 and window_lenth_samples = 400

       Then, the Hann window is applied to each frame as below

       windowed_frame(n) = frame(n) * w(n)
       where 0 <= n <= window_lenth_samples - 1 and window_lenth_samples = 400

   (3) Power spectrum: calculate short-time Fourier transfor magnitude, with
       an FFT length of 512

   (4) Log Mel filter bank: calculates a log magnitude mel-frequency
       spectrogram minimum frequency of 125Hz and maximum frequency of 7500Hz,
       number of mel bins is 64, log_offset is 0.01, number of spectrum bins
       is 64.
  */

  message Vggish {
    // no specific parameter
  }

  // Vision feature print type
  oneof SoundAnalysisPreprocessingType {
    Vggish vggish = 20;
  }
}