Multimodal Humor Dataset: Predicting Laughter tracks for Sitcoms

Badri N. Patro1*  Mayank Lunayach1*  Deepankar Srivastav1  Sarvesh1  Hunar Singh1  Vinay P. Namboodiri1
1IIT Kanpur, *equal contribution

In WACV 2021

Task example

Download Dataset

Additional Plots/Figures

tSNE plots of Visual Dialogs
A tSNE plot made by randomly selecting 1500 images (each from Humorous and Non-Humorous set) as the last frame of some visual dialog turns. Sometimes these visual models could cheat by detecting some pattern inHumorous/Non-Humorous visual dialogs like specific camera angle etc. The above plot hints towards its absence.To visualize the plot better, each image is represented by a dot and the corresponding plot is shown below. (Currentplot is slightly scaled up to ease the visibility.)
A green dot represents a humorous sample and red dot, a non-humorous sample. They seem to be randomly distributed, hinting towards absence of any such bias.

Bar plots drawn for the word distribution of dialogs spoken by Top 6 Speakers in our dataset. Similarity in the top 20 set of each plot suggests that humor/non-humor is not biased due to a particular speaker.

Other Dataset Statistics
The figure showing average time per turn in a Dialog, across the Dataset. The figure showing average dialog time, across the Dataset. The figure showing contribution of each speaker in generating humor, across the Dataset.

MSAM model

The figure describing the proposed Multimodal Self Attention Model (MSAM) for the laughter detection task. We obtain features of each joint dialogue turn using Multimodal Self attention network. We then obtain the final feature vector using a sequential network before feeding the resultant vector to the binary classifier.

Qualitative results

Randomly sampled results (MSAM model) of each prediction category, (correct/incorrect) x (humor/non-humor). Eg. Humor label in a red box means ground truth label was non-humor but predicted label was humor.

Explaining humor

The left column shows visualization of attention at the word level and the right column shows attention visualization at turn level.

Baseline Models

Fusion Models
Text based Fusion Model (TFM) Video based Fusion Model (VFM)
Attention Models
Text based Attention Model (TAM) Video based Attention Model (VAM)