AUTO QA: The Question Is Not Only What, but Also Where
Sumit Kumar, Badri N. Patro, Vinay P. Namboodiri
Abstract:
Visual Question Answering can be a functionally relevant task if purposed as such. In this paper, we aim to investigate and evaluate its efficacy in terms of localization-based question answering. We do this specifically in the context of autonomous driving where this functionality is important. To achieve our aim, we provide a new dataset, Auto-QA. Our new dataset is built over the Argoverse dataset and provides a truly multi-modal setting with seven views per frame and point-cloud LIDAR data being available for answering a localization-based question. We contribute localized attention adaptations of most popular VQA baselines and evaluate them on this task. We also provide joint point-cloud and image-based baselines that perform well on this task. An additional evaluation that we perform is to analyse whether the attention module is accurate or not for the image-based VQA baselines. To summarize, through this work we thoroughly analyze the localization abilities through visual question answering for autonomous driving and provide a new benchmark task for the same. Our best joint baseline model achieves a useful 74.8% accuracy on this task.
Question | Question_Type | Image_To_Attend | Answer |
---|---|---|---|
Is there a pedestrian to my side right? | Exist | 3 | True |
What is the count of vehicle on my rear right? | Count | 4 | 1 |
What is the nearby object to my side right? | Closest | 3 | Pedestrian |
Is there any on road obstacle to my front? | Exist | 1 | False |
An Instance of Autonomous Question Answering : (a) Image Scene which consists of seven images corresponding to seven different directions (b) 3D-Point Cloud from lidar sensor for corresponding image-scene and different variety of generated question
DATASET:
Download
Dataset is built over Argoverse Dataset. It consists of 113 logs with multi-view images and lidar point cloud. For Downloading Training dataset Lidar Point Cloud and Images from Argoverse , follow this link and download argoverse tracking training log 1,2,3,4 from here
For test dataset Lidar Point Cloud and Images, download validation log in argoverse tracking from here
Code
See our code on github GITHUB. Here we provide code for scene generation, question generation. You may also test our baseline models involving individual modalities i.e Image and Lidar and also combined models with both modalities. See our Github repo for more instructions on these.
Question JSON format:
Generated Question are stored in json data format.Following is the json format for question file.
Please refer to our github page for question generation process.
{
"info" : info,
"questions" : [question],
}
info {
"version" : str,
"split" : str,
"date_created" : datetime
}
question {
"question_family_index" : int,
"question_index" : int,
"lidar_index" : int,
"program" : list,
"split" : str,
"template_filename" : str,
"answer" : str,
"video" : int, #unique id of log from which question is generated
"question" : str
}
Sample Question JSON file can be downloaded from here
Some example of Attention Visualization using our baseline models:
|
|
Some Results from Dataset:
|
|||||||
|