A Hybrid Deep Learning For Image Captioning
Keywords:
Image captioning, Convolutional Neural Networks (CNN), Long Short-Term Memory (LSTM).Abstract
The Image Caption Generator introduces an
innovative approach to automatically describing image
content by seamlessly integrating computer vision and
natural language processing (NLP). Leveraging recent
advancements in neural networks, NLP, and computer
vision, the model combines Convolutional Neural
Networks (CNNs), specifically the pre-trained Xception
model, for precise image feature extraction, with
Recurrent Neural Networks (RNNs), including Long
Short-Term Memory (LSTM) cells, for coherent
sentence generation. Enhanced by the incorporation of
a Beam Search algorithm and an Attention mechanism,
the model significantly improves the accuracy and
relevance of generated captions by dynamically
focusing on different parts of the image and exploring
multiple caption sequences. Trained on a dataset of
8,000 images from the Flickr8K dataset paired with
human-judged descriptions over multiple epochs, the
model achieves a significant reduction in loss.
Additionally, it incorporates a text-to-speech module
using the pyttsx3 library to audibly articulate the
generated text from the image captions, enhancing
accessibility for visually impaired individuals or users
who prefer audio output. Evaluation using
BLEUscoreand METEOR metrics confirms the model's
proficiency in producing coherent and contextually
accurate image captions, marking a significant
advancement in image captioning technology.