EMOPIA

EMOPIA (pronounced ‘yee-mò-pi-uh’) dataset is a shared multi-modal (audio and MIDI) database focusing on perceived emotion in pop piano music, to facilitate research on various tasks related to music emotion. The dataset contains 1,087 music clips from 387 songs and clip-level emotion labels annotated by four dedicated annotators. Since the clips are not restricted to one clip per song, they can also be used for song-level analysis.

The detail of the methodology for building the dataset please refer to our paper.

Example of the dataset

	Low Valence	High Valence
High Arousal	Q2	Q1
Low Arousal	Q3	Q4

Number of clips

The following table shows the number of clips and their average length for each quadrant in Russell’s valence-arousal emotion space, in EMOPIA.

Quadrant	# clips	Avg. length (in sec / #tokens)
Q1	250	31.9 / 1,065
Q2	265	35.6 / 1,368
Q3	253	40.6 / 771
Q4	310	38.2 / 729

Pipeline of data collection

Fig.1

Dataset Analysis

Fig.2 Violin plots of the distribution in (a) note density, (b) length, and (c) velocity for clips from different classes.

Fig.3 Histogram of the keys (left / right: major / minor 249 keys) for clips from different emotion classes.

Emotion Classification

We performed emotion classification task in audio domain and symbolic domain, respectively.
Our baseline approach is logistic regression using handcraft features. For the symbolic domain, note length, velocity, beat note density, and key were used, and for the audio domain, an average of 20 dimensions of mel-frequency cepstral co-efficient(MFCC) vectors were used.

Fig.4 symbolic-domain and audio domain representation.

The input representation of the deep learning appraoch uses mel-spectrogram for the audio domain and midi-like and remi for the symbolic domain.

Fig.5 results of emotion classification.

We adopt the A Structured Self-attentive Sentence Embedding for symbolic-domain classification network, and adopt the Short-chunk CNN + Residual for audio-domain classification network.

Fig.6 inference example of Sakamoto: Merry Christmas Mr. Lawrence

An inference example is Sakamoto: Sakamoto: Merry Christmas Mr. Lawrence. The emotional change in the first half and second half of the song is impressive. The front part is clearly low arousal and the second half turns into high arousal. Impressively, audio and mid classifiers return different inference results.

For the classification codes, please refer to SeungHeon’s repository.

The pre-trained model weights are also in the repository.

Conditional Generation

We adopt the Compound Word Transformer for emotion-conditioned symbolic music generation using EMPOIA. The CP+emotion representation is used as the data representation.
In the data representation, we additionally consider the “emotion” tokens and make it a new family. The prepending approach is motivated by CTRL.

Fig.7 Compound word with emotion token

As the size of EMOPIA might not be big enough, we use additionally the AILabs1k7 dataset compiled by Hsiao et al. to pre-train the Transformer.
You can download the model weight of the pre-trained Transformer from here.
The following are some generated examples for each Quadrant:

Q1 (High valence, high arousal)

Baseline
Transformer w/o pre-training
Transformer w/ pre-training

Q2 (Low valence, high arousal)

Baseline
Transformer w/o pre-training
Transformer w/ pre-training

Q3 (Low valence, low arousal)

Baseline
Transformer w/o pre-training
Transformer w/ pre-training

Q4 (High valence, low arousal)

Baseline
Transformer w/o pre-training
Transformer w/ pre-training

Authors and Affiliations

Hsiao-Tzu (Anna) Hung
Research Assistant @ Academia Sinica / MS CSIE student @National Taiwan University
r08922a20@csie.ntu.edu.tw
website, LinkedIn.
Joann Ching
Research Assistant @ Academia Sinica
joann8512@gmail.com
website
Seungheon Doh Ph.D Student @ Music and Audio Computing Lab, KAIST
seungheondoh@kaist.ac.kr
website, LinkedIn
Juhan Nam associate professor @ KAIST juhan.nam@kaist.ac.kr
Website
Yi-Hsuan Yang
Chief Music Scientist @ Taiwan AI Labs / Associate Research Fellow @ Academia Sinica
affige@gmail.com, yhyang@ailabs.tw
website

Cite this dataset

@inproceedings{
         {EMOPIA},
         author = {Hung, Hsiao-Tzu and Ching, Joann and Doh, Seungheon and Kim, Nabin and Nam, Juhan and Yang, Yi-Hsuan},
         title = {{EMOPIA}: A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation},
         booktitle = {Proc. Int. Society for Music Information Retrieval Conf.},
         year = {2021}
          }