Ambient Acoustic Context Dataset

Curating AudioSet with the fine granularity for a workplace setting

While Google's AudioSet provides a large-scale audio dataset with various events, the events are labeled in a 10-second audio segment (weakly labeled), which is difficult to know when the event actually occured within the segment. We employed the crowdsourcing approach and curated the AudioSet to have the fine granularity of the labels to one second (strongly labeled) for typical activities in a worklplace setting. In this way, the researchers can build more responsive and accurate audio understanding machine learning models.

Dataset description

The dataset contains around 57,000 1-second segments for activities that occur in a workplace setting. We curated Google AudioSet to annotate the audio labels in 1-second granularity, using Amazin Mechanical Turk. We asked the crowd workers to listen to 1-second segments and choose the right label. To ensure the quality of the annotations, we excluded audio segments that did not reach majority agreement among the turkers. Disclaimer: While we tried to make sure the annotations are correct for all segments, there could exist segments that have incorrect annotations. Please make sure to test such cases in your data pipelines.

Event	# of segments
Clicking	88
Door	113
Conversation	120
Male Speech	124
Female Speech	136
Chatter	158
Knock	180
Walk	220
Hubbub	227
Television	239
Clapping	240
Silence	251
Typing	315
Applause	360
Laughter	530
Crowd	1,218
Other, unidentifiable events	24,081
Speech	28,671

Dataset structure

The structure of dataset mainly follows that of Google AudioSet. Here is the dataset structure.

[audio_segment_id]/

┣ [audio_segment_id]_[index].wav

┣ [audio_segment_id]_[index]_labels.txt

┣ ...

┗ ...

As shown in the above structure, each directory contains 1-second audio segments ([audio_segment_id]_[index].wav) and corresponding label file ([audio_segment_id]_[index]_labels.txt). The wav files can be directly used for training and testing a machine learning model or can be converted into audio embeddings using VGGish model.

[start_seconds], [id], [name]

0,/m/09x0r,"Speech"

The label files are structured as shown in the above. It contains start time for the event and ID/name for the event. The ID and name follow the Google AudioSet Ontology convention

Ambient acoustic context dataset for building responsive, context-augmented voice assistants

Curating AudioSet with the fine granularity for a workplace setting

Dataset description

Dataset structure

Publication