While Google's AudioSet provides a large-scale audio dataset with various events, the events are labeled in a 10-second audio segment (weakly labeled), which is difficult to know when the event actually occured within the segment. We employed the crowdsourcing approach and curated the AudioSet to have the fine granularity of the labels to one second (strongly labeled) for typical activities in a worklplace setting. In this way, the researchers can build more responsive and accurate audio understanding machine learning models.
The dataset contains around 57,000 1-second segments for activities that occur in a workplace setting. We curated Google AudioSet to annotate the audio labels in 1-second granularity, using Amazin Mechanical Turk. We asked the crowd workers to listen to 1-second segments and choose the right label. To ensure the quality of the annotations, we excluded audio segments that did not reach majority agreement among the turkers. Disclaimer: While we tried to make sure the annotations are correct for all segments, there could exist segments that have incorrect annotations. Please make sure to test such cases in your data pipelines.
Event | # of segments |
---|---|
Clicking | 88 |
Door | 113 |
Conversation | 120 |
Male Speech | 124 |
Female Speech | 136 |
Chatter | 158 |
Knock | 180 |
Walk | 220 |
Hubbub | 227 |
Television | 239 |
Clapping | 240 |
Silence | 251 |
Typing | 315 |
Applause | 360 |
Laughter | 530 |
Crowd | 1,218 |
Other, unidentifiable events | 24,081 |
Speech | 28,671 |
The structure of dataset mainly follows that of Google AudioSet. Here is the dataset structure.
[audio_segment_id]/ |
┣ [audio_segment_id]_[index].wav |
┣ [audio_segment_id]_[index]_labels.txt |
┣ ... |
┗ ... |
As shown in the above structure, each directory contains 1-second audio segments ([audio_segment_id]_[index].wav) and corresponding label file ([audio_segment_id]_[index]_labels.txt). The wav files can be directly used for training and testing a machine learning model or can be converted into audio embeddings using VGGish model.
[start_seconds], [id], [name] |
0,/m/09x0r,"Speech" |
The label files are structured as shown in the above. It contains start time for the event and ID/name for the event. The ID and name follow the Google AudioSet Ontology convention