Advanced openSMILE Configurations: Custom Features and Pipelines
Overview
openSMILE is a flexible toolkit for extracting audio features (low-level descriptors and functionals) from speech and other audio. Advanced configurations let you create custom feature sets, process streams in real time, and build multi-stage pipelines that combine preprocessing, feature extraction, selection, and export for machine learning.
Key Concepts
- Config files: openSMILE’s behavior is driven by INI-style config files (.conf) that declare components (frames, windows, feature calculators, aggregators) and their connections.
- Component types: frameCutter, windower, spectral analyzers, LLD (low-level descriptor) extractors, functionals (statistical aggregators), and sinks (CSV, ARFF, network).
- Objects and names: modules are instances of classes; connect them via names (instance/component parameters).
- Real-time vs. batch: real-time uses streaming components, ringbuffers, and non-blocking sinks; batch can use longer windows and global functionals.
Example goals (choose one for implementation)
- Custom LLD set focused on prosody and voice quality (F0, jitter, shimmer, HNR, RMS energy, spectral tilt).
- Multi-resolution pipeline: short-term LLDs (10–25 ms) + mid-term features (200–1000 ms) + long-term functionals per file.
- Real-time low-latency extractor sending features over network (OSC/TCP) to a downstream ML service.
- Feature fusion pipeline: audio + derived linguistic timestamps (ASR) merged into a single feature stream.
Practical configuration steps
- Start from a base config: copy opensmile/config/IS09 or emobase/egemaps configs as templates.
- Define frame/windower settings: set FrameSize and FrameStep for short-term LLDs; add a second windower module for mid-term features.
- Select/remove feature calculators: enable calculators for desired LLDs (e.g., F0, energy, spectral moments); disable unused ones to reduce CPU/memory.
- Add custom calculators: implement new feature extractors by extending the C++ framework (SMILEComponent) or use the existing “cVectorProcessor”/“cFunctional” blocks to compute combinations.
- Configure functionals: set which statistics (mean, std, percentiles, regression slope) to compute per segment/file.
- Set sinks and formats: enable CSV/ARFF for batch, or enable cDataSocket/cTcpClient for streaming. Use header options for consistent ML pipelines.
- Optimize performance: reduce buffer sizes for latency-sensitive setups, compile with optimization flags, or use fewer features.
- Version control configs: keep configs in a repo and document parameter values for reproducibility.
Example snippets
- Short-term frame settings (conceptual):
Code
frameSize = 0.025 frameStep = 0.010
- Enabling a mid-term window (conceptual):
Code
midFrameSize = 0.5 midFrameStep = 0.25
Real-time pipeline tips
- Use small frame steps and ringbuffers; ensure downstream ML can handle input rate.
- Optionally run endpointing/VAD to avoid processing silence.
- Send feature deltas to capture dynamics without full functionals.
Validation and debugging
- Use openSMILE’s verbose/logging options to trace component connections.
- Compare outputs to known configs (e.g., eGeMAPS) for sanity checks.
- Visualize time-series LLDs (e.g., in Python matplotlib) to inspect behavior.
Common pitfalls
- Mismatched units (Hz vs. semitones) — normalize where needed.
- Excessive functionals cause high-dimensional outputs — apply selection/PCA.
- Real-time network delays — monitor latency and packet loss.
Further reading and resources
- openSMILE config examples (eGeMAPS, IS09, ComParE)
- openSMILE source docs and component reference
- Community recipes for emotion and speaker tasks
Leave a Reply