AI-Powered Content Moderation: A Hybrid Defense Against Prompt Injection and Harmful Text

Authors

  • R K Anirudh Kashyap Jain Deemed to be university Author

Keywords:

AI ethics, Deep learning, LLM, LSTM, prompt injection.

Abstract

With the extensive use of Large 
Language Models (LLMs) in many digital 
products, new measures must be taken to help 
secure these models from potentially harmful 
input and adversarial use. This work introduces 
an AI hybrid content moderation system designed 
to protect the LLM from harmful, unethical, or 
manipulative prompts. The proposed system 
includes a hybrid approach, utilizing a rule-based 
system with regular expression, alongside a deep 
learning classifier design based on Long Short
Term Memory (LSTM) architecture. The 
combination of these two layers allows the 
moderation to flag harmful restricted content in 
real-time, while identifying more subtle threats 
and contextually hidden threats that would have 
likely been missed in a static filter.  
The research dataset, leverages curated real
world examples and created synthetic prompt 
injections, categorized and labeled into safe and 
restricted examples. The findings show that the 
proposed system can achieve a high level of 
effectiveness for determining safe content, with a 
Precision, Recall and F1-score of 100%. The 
LSTM model was unable to accurately identify the 
restricted examples in the test batch, and it is 
likely that it was due to the imbalance of the class 
representation or under-representation of the 
restricted class. The macro and weighted average 
scores were still good, at 0.50 and 1.00 
respectively; demonstrating good results at 
confident accurately validating safe input.  
ISSN: 2456-4265 
IJMEC 2025 
The Flask-based web application incorporates 
the 
moderation engine with live online 
engagement, and the transparency in decision
making is signified through visual feedback. The 
emphasis on modularity, ethics in AI alignment, 
and extensibility was design priority. The future 
works will expand towards transformer models 
(e.g., BERT), facilitate multilingual support, and 
improve the system's ability to respond to 
adversarial prompts. In summary, this work 
represents the first step towards establishing a 
scalable, secure, contextual moderation system 
that can be utilized alongside contemporary LLMs 
across real-world scenarios.

DOI:  https://www.doi-ds.org/doilink/06.2025-42189678

Downloads

Download data is not yet available.

References

Bagdasaryan, E., Hsieh, T.-Y., Nassi, B., &

Shmatikov, V. (2023). (Ab)using images and

sounds for indirect instruction injection in multi

modal

LLMs.

arXiv.

https://arxiv.org/abs/2307.10490

[2] Brown, T. B., Mann, B., Ryder, N., Subbiah, M.,

Kaplan, J., Dhariwal, P., ... & Amodei, D.

(2020). Language models are few-shot learners.

Advances in Neural Information Processing

Systems,

33,

1877–1901.

https://doi.org/10.48550/arXiv.2005.14165

[3] Daryanani, L. (2023). How to jailbreak

ChatGPT.

GitHub.

https://github.com/0xk1h0/ChatGPT-Jailbreak

[4] He, Y., Qiu, J., Zhang, W., & Yuan, Z. (2024).

Fortifying ethical boundaries in AI: Advanced

strategies for enhancing security in large

language

models.

arXiv.

https://arxiv.org/abs/2402.01725

[5] Li, H., Guo, D., Fan, W., Xu, M., & Song, Y.

(2023). Multi-step jailbreaking privacy attacks

on

ChatGPT.

arXiv.

https://arxiv.org/abs/2310.08419

[6] Perez, E., Huang, S., Song, F., Cai, T., Ring, R.,

Aslanides, J., ... & Irving, G. (2022). Red

teaming language models with language models.

arXiv. https://arxiv.org/abs/2202.03286

[7] Solaiman, I., & Dennison, C. (2021). Process for

adapting language models to society (PALMS)

with values-targeted datasets. Advances in

Neural Information Processing Systems, 34,

5861–5873.

https://proceedings.neurips.cc/paper/2021/hash/7

d6044a48e0d9f9bc7d74c2734da5d08

Abstract.html

[8] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit,

J., Jones, L., Gomez, A. N., ... & Polosukhin, I.

(2017). Attention is all you need. Advances in

Neural Information Processing Systems, 30.

https://papers.nips.cc/paper/2017/file/3f5ee24354

7dee91fbd053c1c4a845aa-Paper.pdf

[9] Weidinger, L., Mellor, J., Rauh, M., Griffin, C.,

Uesato, J., Huang, P.-S., ... & Gabriel, I. (2021).

Ethical and social risks of harm from language

models. arXiv. https://arxiv.org/abs/2112.04359

[10] Xu, W., Liu, X., & Yang, Y. (2021).

Understanding and improving rule-based content

moderation systems in NLP. Proceedings of the

AAAI Conference on Artificial Intelligence,

35(5),

4533–4541.

https://doi.org/10.1609/aaai.v35i5.16605

[11] Zhang, Y., Sheng, Y., Wu, J., & Liu, Y. (2021).

Toxic content detection with deep learning:

Challenges

and

opportunities.

ACM

Transactions on Intelligent Systems and

Technology,

12(4),

1–22.

https://doi.org/10.1145/3450922

[12] Gehman, S., Gururangan, S., Sap, M., Choi, Y.,

& Smith, N. A. (2020). RealToxicityPrompts:

Evaluating neural toxic degeneration in

language

models.

arXiv.

https://arxiv.org/abs/2009.11462

[13] Bender, E. M., & Friedman, B. (2018). Data

statements for natural language processing:

Toward mitigating system bias and enabling

better science. Transactions of the Association

for Computational Linguistics, 6, 587–604.

https://doi.org/10.1162/tacl_a_00041

[14] Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang,

P. (2016). SQuAD: 100,000+ questions for

machine

comprehension of text. arXiv.

https://arxiv.org/abs/1606.05250

[15] Mozes, M., Zhang, X., Kleinberg, B., & Wallace,

B. (2021). Detecting hate speech in the context of

code-mixed

language.

https://arxiv.org/abs/2105.01738

[16] Pavlopoulos,

J.,

Malakasiotis,

arXiv.

P.,

&

Androutsopoulos, I. (2017). Deeper attention to

abusive

user

content

moderation. arXiv.

https://arxiv.org/abs/1705.09820

[17] Kumar, R., Ojha, A. K., Malmasi, S., &

Zampieri, M. (2020). Evaluating aggression

identification

in

social

media.

arXiv.

https://arxiv.org/abs/1803.07557

[18] Chung, W., Kuzmenko, E., Leach, K., & Voss,

C. R. (2019). Increasing language model

robustness with adversarial training. In

Proceedings of the 2019 Conference of the North

American Chapter of the ACL.

Ghorbani, A., Abid, A., & Zou, J. (2019).

Interpretation of machine learning predictions using

influence functions and LIME. In Proceedings of the

36th International Conference on Machine Learning

Published

2025-06-02

Issue

Section

Articles

How to Cite

AI-Powered Content Moderation: A Hybrid Defense Against Prompt Injection and Harmful Text. (2025). International Journal of Multidisciplinary Engineering In Current Research, 10(6), 1-11. https://ijmec.com/index.php/multidisciplinary/article/view/747