Emotion Recognition from Video Audio and Text

Authors

  • Ms. G Srilakshmi Associate Professor, Department Of Ece, Bhoj Reddy Engineering College For Women, India Author
  • Putta Meghana B. Tech Students, Department Of Ece, Bhoj Reddy Engineering College For Women, India. Author
  • Chakali Mythri B. Tech Students, Department Of Ece, Bhoj Reddy Engineering College For Women, India. Author
  • Kattukuri Samhitha B. Tech Students, Department Of Ece, Bhoj Reddy Engineering College For Women, India. Author

Abstract

Emotion detection from video, audio, and text has 
become a key focus in artificial intelligence and 
human-computer 
interaction. 
As 
digital 
communication increasingly involves multiple 
modalities, accurately interpreting human emotions 
is essential for enhancing user experience, 
supporting mental health diagnostics, and 
advancing affective computing. This paper provides 
a comprehensive overview of emotion recognition 
methods across video, audio, and text, exploring the 
unique contributions of each modality and their 
combined potential in multimodal systems.  The analysis begins by examining how each 
modality contributes to emotion detection. Video 
leverages facial expressions, gestures, and body 
language using computer vision, while audio 
focuses on vocal traits such as pitch and tone 
through signal processing and machine learning. 
Text-based emotion detection uses natural language 
processing to interpret sentiment and context from 
written content. When integrated, these modalities 
create more accurate and resilient emotion 
recognition systems that mirror the complexity of 
human emotion.  
The paper also explores key challenges, including 
data 
synchronization, 
multimodal 
feature 
monitoring, educational technology, and many other 
domains. 
Human emotions are expressed through multiple 
channels — facial expressions, speech patterns, and 
the words people use. Video captures facial 
movements and body language; audio captures voice 
tone, pitch, and rhythm; text conveys the semantic extraction, and the scarcity of diverse, annotated 
datasets. Advances in machine learning, especially 
deep learning techniques like transformers and 
attention mechanisms, have significantly improved 
emotion 
detection 
performance. 
Potential 
applications include mental health monitoring, 
customer service, education, and interactive 
entertainment. The paper concludes by highlighting 
future 
research 
needs, 
including 
ethical 
considerations, generalizable models, and real-time 
emotion recognition, aiming to create AI systems 
that better understand and respond to human 
emotions. 

Downloads

Download data is not yet available.

References

[1] Y. Zhang, J. Li, and H. Wang, "Hybrid LSTM

Attention and CNN Model for Enhanced Speech

Emotion Recognition," Applied Sciences, vol. 14,

no. 23, 2024.

[2] A. Kumar, S. Bhattacharya, and M. Singh,

"EMERSK: Explainable Multimodal Emotion

Recognition with Situational Knowledge," arXiv

preprint arXiv:2306.08657, 2023.

[3] L. Chen, W. Zhou, and Q. Wu, "Recursive Joint

Attention for Audio-Visual Fusion in Regression

Based Emotion Recognition," arXiv preprint

arXiv:2304.07958, 2023.

[4] M. Patel, R. Sharma, and V. Gupta, "CFN-ESA: A

Cross-Modal Fusion Network with Emotion-Shift

Awareness for Dialogue Emotion Recognition,"

arXiv preprint arXiv:2307.15432, 2023.

[5] S. Lee, K. Park, and J. Kim, "An Ensemble 1D

CNN-LSTM-GRU Model with Data Augmentation

for Speech Emotion Recognition," Expert Systems

with Applications, vol. 214, 2023.

[6] H. Zhang, X. Wang, and Y. Liu, “Multimodal

Emotion Recognition Using Deep Learning: A

Survey,”

IEEE Transactions on Affective

Computing, vol. 14, no. 2, pp. 345–360, 2023.

[7] J. S. Park and M. Kim, “Deep Learning-Based

Multimodal Emotion Recognition: A Review,”

Sensors, vol. 22, no. 4, 2022.

[8] R. Das, A. Dey, and S. Mukherjee, “A CNN-LSTM

Based Framework for Multimodal Emotion

Recognition,” in Proc. IEEE Int. Conf. on

Multimedia & Expo Workshops, 2023, pp. 1–6.

[9] M. Chen and Y. Zhang, “Random Forest Based

Multimodal Fusion for Emotion Recognition,”

Journal of Ambient Intelligence and Humanized

Computing, vol. 13, no. 5, pp. 2567–2578, 2022.

Downloads

Published

2025-06-18

Issue

Section

Articles

How to Cite

Emotion Recognition from Video Audio and Text. (2025). International Journal of Multidisciplinary Engineering In Current Research, 10(6), 295-311. https://ijmec.com/index.php/multidisciplinary/article/view/809