A Semantic Weight Adaptive Model Based on Visual Question Answering

Mohmmed Adnaan Qureshi; Mohammed Fasahath Siddiq; Mirza Mohammad Ali Baig; Dr ijteba sultana

doi:10.63665/geq0hr62

Authors

Mohmmed Adnaan Qureshi B.E Students; Department of CSE ISL Engineering College, Hyderabad, India Author
Mohammed Fasahath Siddiq B.E Students; Department of CSE ISL Engineering College, Hyderabad, India Author
Mirza Mohammad Ali Baig B.E Students; Department of CSE ISL Engineering College, Hyderabad, India Author
Dr ijteba sultana Associate professor; Department of CSE ISL Engineering College, Hyderabad, India Author

DOI:

https://doi.org/10.63665/geq0hr62

Keywords:

Flask, Multilingual Visual Question Answering (VQA), BLIP, Deep Learning, Natural Language Processing (NLP), Vision Transformer (ViT), Transformer Models, Video Question Answering, Image Understanding, Multilingual Translation, Web Application, Indian Languages, Real-Time AI Systems.

Abstract

This project presents a multilingual Visual Question Answering (VQA) web application developed using the Flask framework by integrating deep learning and Natural Language Processing (NLP) techniques. The system utilizes the transformer-based BLIP model, specifically the Salesforce/blip-vqa-base architecture, to answer user questions based on uploaded images and short-duration videos. The BLIP model combines a Vision Transformer (ViT) for extracting semantic visual features with a transformer-based text decoder for generating contextually accurate answers. Unlike traditional CNN-LSTM-based VQA systems, the proposed model performs joint multimodal learning, enabling improved understanding of both visual content and natural language queries.

A key feature of the proposed system is its multilingual capability, allowing users to interact with the application in various Indian languages such as Hindi, Telugu, Tamil, and Kannada. To support multilingual communication, a translation module is integrated that converts user questions into English before processing them through the VQA model and subsequently translates the generated answers back into the user’s preferred language. Although the current implementation uses a simplified mock translation mechanism, the architecture is designed for future integration with advanced Neural Machine Translation (NMT) systems such as IndicTrans2.

The application also supports short video inputs by extracting keyframes from uploaded videos and generating context-aware responses based on visual analysis. Beam search decoding is employed during answer generation to produce coherent, grammatically meaningful, and high-probability responses. In addition, the system incorporates secure file upload validation and real-time processing within a scalable web environment.

The proposed multilingual VQA system demonstrates the effectiveness of transformer-based multimodal AI models in creating accessible, intelligent, and interactive applications. The system has potential applications in education, assistive technologies, healthcare support, surveillance, and smart human-computer interaction systems, while also providing a foundation for future advancements in multilingual and multimodal artificial intelligence research.

Downloads

Download data is not yet available.

References

1) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi, “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation,” in Proceedings of the International Conference on Machine Learning (ICML), 2022.

2) Ashish Vaswani et al., “Attention Is All You Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.

3) Alexey Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in International Conference on Learning Representations (ICLR), 2021.

4) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

5) Karen Simonyan and Andrew Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in International Conference on Learning Representations (ICLR), 2015.

6) Sepp Hochreiter and Jürgen Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

7) Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian, “Deep Modular Co-Attention Networks for Visual Question Answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

8) Tanmay Gupta, Ameya Godbole, and Mitesh M. Khapra, “IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for Indian Languages,” 2023.

9) Alec Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in International Conference on Machine Learning (ICML), 2021.

10) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan, “Show and Tell: A Neural Image Caption Generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

A Semantic Weight Adaptive Model Based on Visual Question Answering

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Submission

Submission

Menu

visitors

Latest publications

Reach US

Ethics and Policies

Important Links

Downloads & Indexing