MHD

¹IIT Kanpur, *equal contribution

Task example

Dataset sample (consisting of randomly sampled 400 dialogues)
Full dataset (compressed in a zip file)

The dataset folder has the following directory structure:
|-- DT_{N} | |-- Raw | | |-- S{M} | | | |-- The Big Bang_S0{M}{I}.json | |-- test.json | |-- train.json | `-- val.json where N is the no. of dialogue turns for that sub dataset, M represents the season of the series (varies from 1 to 5) and I represents the episode number in that season (like 01, 02, and so on). Episode level extracted dialogues are in the Raw folder. Dialogues split into the train, val, and test categories are in train.json, val.json, and test.json, respectively.

tSNE plots of Visual Dialogs

A tSNE plot made by randomly selecting 1500 images (each from Humorous and Non-Humorous set) as the last frame of some visual dialog turns. Sometimes these visual models could cheat by detecting some pattern inHumorous/Non-Humorous visual dialogs like specific camera angle etc. The above plot hints towards its absence.To visualize the plot better, each image is represented by a dot and the corresponding plot is shown below. (Currentplot is slightly scaled up to ease the visibility.)

A green dot represents a humorous sample and red dot, a non-humorous sample. They seem to be randomly distributed, hinting towards absence of any such bias.

Other Dataset Statistics

The figure showing average time per turn in a Dialog, across the Dataset.	The figure showing average dialog time, across the Dataset.	The figure showing contribution of each speaker in generating humor, across the Dataset.

The figure describing the proposed Multimodal Self Attention Model (MSAM) for the laughter detection task. We obtain features of each joint dialogue turn using Multimodal Self attention network. We then obtain the final feature vector using a sequential network before feeding the resultant vector to the binary classifier.

Randomly sampled results (MSAM model) of each prediction category, (correct/incorrect) x (humor/non-humor). Eg. Humor label in a red box means ground truth label was non-humor but predicted label was humor.

The left column shows visualization of attention at the word level and the right column shows attention visualization at turn level.

Fusion Models

Text based Fusion Model (TFM)	Video based Fusion Model (VFM)

Attention Models

Text based Attention Model (TAM)	Video based Attention Model (VAM)