VPQ

Visual Precise Query

[ACM-MM-2020]

Abstract:

Training pairs for the proposed visually precise query (VPQ) generation task. The image and its title in the blue box are the inputs to the system. The outputs are the 4 keywords shown in the search box, which are part of the title itself. The acceptance criteria investigates, whether these words, when used as a search query, were able to retrieve the original image.

We present the problem of Visually Precise Query (VPQ) generation which enables a more intuitive match between a user's information need and an e-commerce site's product description. Given an image of a fashion item, what is the most optimum search query that will retrieve the exact same or closely related product(s) with high probability. In this paper we introduce the task of VPQ generation which takes a product image and its title as its input and provides a \textit{word level} extractive summary of the title, containing a list of salient attributes, which can now be used as a query to search for similar products. We collect a large dataset of fashion images and their titles and merge it with an existing research dataset which was created for a different task. Given the image and title pair, VPQ problem is posed as identifying a non-contiguous collection of spans within the title. We provide a dataset of around 350K image, title and corresponding VPQ entries and will be releasing it to the research community. We provide a detailed description of the data collection process as well as discuss the future direction of research for the problem introduced in this work. We provide the standard text as well as visual domain baseline comparisons and also provide multi-modal baseline models to analyze the task introduced in this work. Finally, we propose a hybrid fusion model which promises to be the direction of research in the multi-modal community.

VPQ Data Collection:

The overall data generation pipeline. The dataset is split into two halves. Each is then turned into two uni-modal ANN indexes. Nearest neighbors based on both text and images are collected for each subset based on the matches from the indexes built for the other subset. Tag based filtering then reduces the possible count of query candidates. Finally, the queries are furnished to a search engine and the results are then analyzed to select the final VPQs.

Attribute distribution

Attribute level distributions for the dress class.

Sub category distribution

Sub category distribution for the dress class.

From Dress catalogue to query dataset

Query candidate selection process. For the target image and title shown in blue, four possible short queries are shown (in red) with their retrieved results.

BibTex

@inproceedings{Dasgupta2020visually,
title={Visually Precise Query},
author={Dasgupta, Riddhiman and Tom, Francis and Kumar, Sudhir and Das Gupta, Mithun and Kumar, Yokesh and Patro, Badri N. and Namboodiri, Vinay },
booktitle={Proceedings of the 28th ACM International Conference on Multimedia (MM '20)},
year={2020}
}

-------------------------------------------------------------------------------------------------------------------------------------------