Visual Question Answering from Multi Models
Abstract
This work proposes an open-ended, free-form task
called Visual Question Answering (VQA). The aim for
a natural language query regarding the image and a
supplied image is to offer an appropriate response in
natural language. Answers and questions remain
open-ended to reflect real-world situations such as
helping the blind. In the various area of an image
Visual questions selectively target such as underlying
context and background details. Because of this, a
system that excels at visual quality assurance (VQA)
usually requires an in-depth knowledge of the image
and more advanced reasoning than a system that
generates generic image descriptions. Furthermore,
since many open-ended responses are limited to a few
words or a restricted set of responses that may be given
in a multiple-choice style, VQA is accessible to
computer evaluation. In this project you ask to use
ROBERTA model to extract features from questions
and answers and then apply BEIT model to extract
features from the images. Both features should be
fusion in multi-modal to answer for given question and
images. To train multi modal you ask to use VQA 2.0
dataset.