Delta Lab

Probabilistic framework for solving Visual Dialog

Badri N. Patro, Anupriy , Vinay P. Namboodiri

[ArXiv]

sherlock


Abstract:

l0

Proposed Probabilistic Diversity and Uncertainty Network (PDUN) consists of three parts, viz. a) Probabilistic Representation Module encodes image feature with a question and history feature in an attentive manner. b) Diversity module captures the diversity, and diverse answer is generated using Variational Auto-Encoder. c) Uncertainty module predicts uncertainty of the network.

In this paper, we propose a probabilistic framework for solving the task of `Visual Dialog'. Solving this task requires reasoning and understanding of visual modality, language modality, and common sense knowledge to answer. Various architectures have been proposed to solve this task by variants of multi-modal deep learning techniques that combine visual and language representations. However, we believe that it is crucial to understand and analyze the sources of uncertainty for solving this task. Our approach allows for estimating uncertainty and also aids a diverse generation of answers. The proposed approach is obtained through a probabilistic representation module that provides us with representations for image, question and conversation history, a module that ensures that diverse latent representations for candidate answers are obtained given the probabilistic representations and an uncertainty representation module that chooses the appropriate answer that minimizes uncertainty. We thoroughly evaluate the model with a detailed ablation analysis, comparison with state of the art and visualization of the uncertainty that aids in the understanding of the method. Using the proposed probabilistic framework, we thus obtain an improved visual dialog system that is also more explainable.

l2

Results were showing the certainty of the correct class increases from baseline model [9] to our proposed uncertainty model (PDUN). In this figure, we show the top 2 class confidence score of the question, "Is this a park?". In the baseline model focus on woman, guitar and chair and predicts "NO," which is confused with the correct prediction of the answer, whether it is a park or not. PDUN model minimizes the uncertainty and predicts the correct answer "Yes" with a high confidence score.


PDUN Model:

l3

Probabilistic Diversity Uncertainty Network(PDUN), Bayesian CNN/LSTM is used to obtain the embeddings $g_i,f_i,h_i$ which is then fused using the Fusion Module to get $e_f$. Then correlation is found between fused embedding with answer option embedding. Finally, variance and logits output are obtained using MLP, which is then used in Logits Reparameterization Trick(LRT) to get final softmax output.


RUAM Model:

l4

Reverse Uncertainty based Attention Map (RUAM): We obtain attention embedding $f_i$ from the attention network $G_f$ using image, question and history embeddings $g_i,g_q,g_h$. Then we classify into answer class and obtain the uncertainty present in the data. Then we obtain reverse uncertainty map with will combine with attention map to get better confidence on the attention map as shown in the figure.

Some example of visual dialog using our method:


Results:

l5

Figure shows the difference between aleatoric dialog results and baseline dialog results. In this figure, the first row refers to Grad-CAM visualization of first example for baseline visual dialog model and second row refers to Grad-CAM visualization of first example for Aleatoric visual dialog model and same scheme is followed for next 2 rows. The first column indicates target Image and corresponding caption and starting from second column is the visualization of rounds of dialog from round 1 to 10.

l6


MC-Sampling for a particular turn of a particular example:

l7

We visualize the multiple outputs from the Bayesian neural network. We took 100 sample from the posterior distribution of dialog model for particular image, particular question. It shows how Grad-CAM is flowing for particular image, particular question.


MC-Sampling for a particular turn of a particular example: