Audio and Video New: 1.10
Modern deep learning technologies offer much better performance on multimedia data than previous approaches, so there are lots of opportunities for cool new products and features. Prodigy lets you create training data for a variety of common tasks, such as transcription, classification and speaker diarization. You can also use Prodigy as a library of simple building blocks to construct a custom solution, even if you have to cross-reference audio, video, text and metadata.
Manual audio annotation
The audio.manual
recipe lets you load in audio or video files and add
labelled regions to them. Under the hood, Prodigy will save the start and end
timestamps, as well as the label for each region. You can click and drag to add
a region, resize existing regions by dragging the start and end and remove
regions by clicking their × button. Annotated regions can also
overlap, if needed.
The following command starts the Prodigy server, loads in audio files from a
directory ./recordings
and allows annotating regions on them for two labels,
SPEAKER_1
and SPEAKER_2
:
By default, the audio loader expects to load files from a directory. The files will be encoded as base64 and the encoded data will be removed before the annotations are placed in the database.
Manual video annotation
The audio
and audio_manual
interfaces also support video files
out-of-the-box – all you need to do is load in data with a key "video"
containing the URL or base64-encoded data. The easiest way is to use
audio.manual
with --loader video
. The video is now displayed above the
waveform and you can annotate regions referring to timestamps of the video. This
is especially helpful when annotating who is speaking, as the video can hold a
lot of clues.
Audio or video transcription
Prodigy’s blocks
interface lets you combine multiple different
interfaces into one – for example, audio
and text_input
. The
built-in audio.transcribe
workflow uses this combination to provide a
straightforward audio-transcription interface. The free-form text typed in by
the user will be saved to the annotation task as the key "transcript"
. The
following command starts the server with a directory of recordings and saves the
annotations to a dataset:
To make it easier to toggle play and pause as you transcribe and to prevent
clashes with the text input field (like with the default enter), this
recipe lets you customize the keyboard shortcuts. To toggle play/pause, you can
press
command/option/alt/ctrl+enter
or provide your own overrides via --playpause-key
, for instance
--playpause-key command+w
.
Audio or video classification
Custom recipes also let you build your very own workflows for audio or video annotations. For instance, you might want to load in audio recordings and sort them into categories, e.g. to classify the type of noise and whether it’s produced by a car, a plane or something else.
The custom recipe for this workflow is pretty straightforward: using the
Audio
loader, you can load your files from
a directory. You can then add a list of "options"
to each incoming example.
The "text"
value is displayed to the user and the "id"
is used under the
hood. When you select options, their "id"
values will be added to the task as
"accept"
, e.g. "accept": ["PLANE"]
. For more details on the available UI
settings, check out the
interface docs.
recipe.py
import prodigyfrom prodigy.components.stream import get_stream@prodigy.recipe("classify-audio")def classify_audio(dataset, source):def add_options(stream):# Load the directory of audio files and add options to each taskfor eg in stream:eg["options"] = [{"id": "CAR", "text": "🚗 Car"},{"id": "PLANE", "text": "✈️ Plane"},{"id": "OTHER", "text": "Other / Unclear"}]yield egstream = get_stream(source)stream.apply(add_options)return {"dataset": dataset,"stream": stream,"view_id": "choice","config": {"choice_style": "single", # or "multiple""choice_auto_accept": True,"audio_loop": True,"show_audio_minimap": False}}