Delta Lab

AUTO QA: The Question Is Not Only What, but Also Where

Sumit Kumar, Badri N. Patro, Vinay P. Namboodiri

[Paper] [Source Code]

iitk


Abstract:

Visual Question Answering can be a functionally relevant task if purposed as such. In this paper, we aim to investigate and evaluate its efficacy in terms of localization-based question answering. We do this specifically in the context of autonomous driving where this functionality is important. To achieve our aim, we provide a new dataset, Auto-QA. Our new dataset is built over the Argoverse dataset and provides a truly multi-modal setting with seven views per frame and point-cloud LIDAR data being available for answering a localization-based question. We contribute localized attention adaptations of most popular VQA baselines and evaluate them on this task. We also provide joint point-cloud and image-based baselines that perform well on this task. An additional evaluation that we perform is to analyse whether the attention module is accurate or not for the image-based VQA baselines. To summarize, through this work we thoroughly analyze the localization abilities through visual question answering for autonomous driving and provide a new benchmark task for the same. Our best joint baseline model achieves a useful 74.8% accuracy on this task.

Lights
Question Question_Type Image_To_Attend Answer
Is there a pedestrian to my side right? Exist 3 True
What is the count of vehicle on my rear right? Count 4 1
What is the nearby object to my side right? Closest 3 Pedestrian
Is there any on road obstacle to my front? Exist 1 False

An Instance of Autonomous Question Answering : (a) Image Scene which consists of seven images corresponding to seven different directions (b) 3D-Point Cloud from lidar sensor for corresponding image-scene and different variety of generated question


DATASET:

Download

Dataset is built over Argoverse Dataset. It consists of 113 logs with multi-view images and lidar point cloud. For Downloading Training dataset Lidar Point Cloud and Images from Argoverse , follow this link and download argoverse tracking training log 1,2,3,4 from here

For test dataset Lidar Point Cloud and Images, download validation log in argoverse tracking from here


Code

See our code on github GITHUB. Here we provide code for scene generation, question generation. You may also test our baseline models involving individual modalities i.e Image and Lidar and also combined models with both modalities. See our Github repo for more instructions on these.


Question JSON format:

Generated Question are stored in json data format.Following is the json format for question file.
Please refer to our github page for question generation process.

{
"info" : info,
"questions" : [question],
}

info {
"version" : str,
"split" : str,
"date_created" : datetime
}

question {
"question_family_index" : int,
"question_index" : int,
"lidar_index" : int,
"program" : list,
"split" : str,
"template_filename" : str,
"answer" : str,
"video" : int, #unique id of log from which question is generated
"question" : str
}

Sample Question JSON file can be downloaded from here

Some example of Attention Visualization using our baseline models:

Q: Is there a vehicle to my side right? Ans: True
Q: What is the closest object on side left? Ans: Pedestrian

Some Results from Dataset:

Sample Results
Sample Results