A Semantic Weight Adaptive Model Based on Visual Question Answering
DOI:
https://doi.org/10.63665/geq0hr62Keywords:
Flask, Multilingual Visual Question Answering (VQA), BLIP, Deep Learning, Natural Language Processing (NLP), Vision Transformer (ViT), Transformer Models, Video Question Answering, Image Understanding, Multilingual Translation, Web Application, Indian Languages, Real-Time AI Systems.Abstract
This project presents a multilingual Visual Question Answering (VQA) web application developed using the Flask framework by integrating deep learning and Natural Language Processing (NLP) techniques. The system utilizes the transformer-based BLIP model, specifically the Salesforce/blip-vqa-base architecture, to answer user questions based on uploaded images and short-duration videos. The BLIP model combines a Vision Transformer (ViT) for extracting semantic visual features with a transformer-based text decoder for generating contextually accurate answers. Unlike traditional CNN-LSTM-based VQA systems, the proposed model performs joint multimodal learning, enabling improved understanding of both visual content and natural language queries.
A key feature of the proposed system is its multilingual capability, allowing users to interact with the application in various Indian languages such as Hindi, Telugu, Tamil, and Kannada. To support multilingual communication, a translation module is integrated that converts user questions into English before processing them through the VQA model and subsequently translates the generated answers back into the user’s preferred language. Although the current implementation uses a simplified mock translation mechanism, the architecture is designed for future integration with advanced Neural Machine Translation (NMT) systems such as IndicTrans2.
The application also supports short video inputs by extracting keyframes from uploaded videos and generating context-aware responses based on visual analysis. Beam search decoding is employed during answer generation to produce coherent, grammatically meaningful, and high-probability responses. In addition, the system incorporates secure file upload validation and real-time processing within a scalable web environment.
The proposed multilingual VQA system demonstrates the effectiveness of transformer-based multimodal AI models in creating accessible, intelligent, and interactive applications. The system has potential applications in education, assistive technologies, healthcare support, surveillance, and smart human-computer interaction systems, while also providing a foundation for future advancements in multilingual and multimodal artificial intelligence research.
Downloads
References
1) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi, “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation,” in Proceedings of the International Conference on Machine Learning (ICML), 2022.
2) Ashish Vaswani et al., “Attention Is All You Need,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
3) Alexey Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in International Conference on Learning Representations (ICLR), 2021.
4) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep Residual Learning for Image Recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
5) Karen Simonyan and Andrew Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in International Conference on Learning Representations (ICLR), 2015.
6) Sepp Hochreiter and Jürgen Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
7) Zhou Yu, Jun Yu, Yuhao Cui, Dacheng Tao, and Qi Tian, “Deep Modular Co-Attention Networks for Visual Question Answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
8) Tanmay Gupta, Ameya Godbole, and Mitesh M. Khapra, “IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for Indian Languages,” 2023.
9) Alec Radford et al., “Learning Transferable Visual Models From Natural Language Supervision,” in International Conference on Machine Learning (ICML), 2021.
10) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan, “Show and Tell: A Neural Image Caption Generator,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Authors

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
