AI-Powered Content Moderation: A Hybrid Defense Against  Prompt Injection and Harmful Text

R K Anirudh Kashyap

Authors

R K Anirudh Kashyap Jain Deemed to be university Author

Keywords:

AI ethics, Deep learning, LLM, LSTM, prompt injection.

Abstract

With the extensive use of Large
Language Models (LLMs) in many digital
products, new measures must be taken to help
secure these models from potentially harmful
input and adversarial use. This work introduces
an AI hybrid content moderation system designed
to protect the LLM from harmful, unethical, or
manipulative prompts. The proposed system
includes a hybrid approach, utilizing a rule-based
system with regular expression, alongside a deep
learning classifier design based on Long Short
Term Memory (LSTM) architecture. The
combination of these two layers allows the
moderation to flag harmful restricted content in
real-time, while identifying more subtle threats
and contextually hidden threats that would have
likely been missed in a static filter.
The research dataset, leverages curated real
world examples and created synthetic prompt
injections, categorized and labeled into safe and
restricted examples. The findings show that the
proposed system can achieve a high level of
effectiveness for determining safe content, with a
Precision, Recall and F1-score of 100%. The
LSTM model was unable to accurately identify the
restricted examples in the test batch, and it is
likely that it was due to the imbalance of the class
representation or under-representation of the
restricted class. The macro and weighted average
scores were still good, at 0.50 and 1.00
respectively; demonstrating good results at
confident accurately validating safe input.
ISSN: 2456-4265
IJMEC 2025
The Flask-based web application incorporates
the
moderation engine with live online
engagement, and the transparency in decision
making is signified through visual feedback. The
emphasis on modularity, ethics in AI alignment,
and extensibility was design priority. The future
works will expand towards transformer models
(e.g., BERT), facilitate multilingual support, and
improve the system's ability to respond to
adversarial prompts. In summary, this work
represents the first step towards establishing a
scalable, secure, contextual moderation system
that can be utilized alongside contemporary LLMs
across real-world scenarios.

DOI: https://www.doi-ds.org/doilink/06.2025-42189678

Downloads

Download data is not yet available.

References

Bagdasaryan, E., Hsieh, T.-Y., Nassi, B., &

Shmatikov, V. (2023). (Ab)using images and

sounds for indirect instruction injection in multi

modal

LLMs.

arXiv.

https://arxiv.org/abs/2307.10490

[2] Brown, T. B., Mann, B., Ryder, N., Subbiah, M.,

Kaplan, J., Dhariwal, P., ... & Amodei, D.

(2020). Language models are few-shot learners.

Advances in Neural Information Processing

Systems,

33,

1877–1901.

https://doi.org/10.48550/arXiv.2005.14165

[3] Daryanani, L. (2023). How to jailbreak

ChatGPT.

GitHub.

https://github.com/0xk1h0/ChatGPT-Jailbreak

[4] He, Y., Qiu, J., Zhang, W., & Yuan, Z. (2024).

Fortifying ethical boundaries in AI: Advanced

strategies for enhancing security in large

language

models.

arXiv.

https://arxiv.org/abs/2402.01725

[5] Li, H., Guo, D., Fan, W., Xu, M., & Song, Y.

(2023). Multi-step jailbreaking privacy attacks

on

ChatGPT.

arXiv.

https://arxiv.org/abs/2310.08419

[6] Perez, E., Huang, S., Song, F., Cai, T., Ring, R.,

Aslanides, J., ... & Irving, G. (2022). Red

teaming language models with language models.

arXiv. https://arxiv.org/abs/2202.03286

[7] Solaiman, I., & Dennison, C. (2021). Process for

adapting language models to society (PALMS)

with values-targeted datasets. Advances in

Neural Information Processing Systems, 34,

5861–5873.

https://proceedings.neurips.cc/paper/2021/hash/7

d6044a48e0d9f9bc7d74c2734da5d08

Abstract.html

[8] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit,

J., Jones, L., Gomez, A. N., ... & Polosukhin, I.

(2017). Attention is all you need. Advances in

Neural Information Processing Systems, 30.

https://papers.nips.cc/paper/2017/file/3f5ee24354

7dee91fbd053c1c4a845aa-Paper.pdf

[9] Weidinger, L., Mellor, J., Rauh, M., Griffin, C.,

Uesato, J., Huang, P.-S., ... & Gabriel, I. (2021).

Ethical and social risks of harm from language

models. arXiv. https://arxiv.org/abs/2112.04359

[10] Xu, W., Liu, X., & Yang, Y. (2021).

Understanding and improving rule-based content

moderation systems in NLP. Proceedings of the

AAAI Conference on Artificial Intelligence,

35(5),

4533–4541.

https://doi.org/10.1609/aaai.v35i5.16605

[11] Zhang, Y., Sheng, Y., Wu, J., & Liu, Y. (2021).

Toxic content detection with deep learning:

Challenges

and

opportunities.

ACM

Transactions on Intelligent Systems and

Technology,

12(4),

1–22.

https://doi.org/10.1145/3450922

[12] Gehman, S., Gururangan, S., Sap, M., Choi, Y.,

& Smith, N. A. (2020). RealToxicityPrompts:

Evaluating neural toxic degeneration in

language

models.

arXiv.

https://arxiv.org/abs/2009.11462

[13] Bender, E. M., & Friedman, B. (2018). Data

statements for natural language processing:

Toward mitigating system bias and enabling

better science. Transactions of the Association

for Computational Linguistics, 6, 587–604.

https://doi.org/10.1162/tacl_a_00041

[14] Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang,

P. (2016). SQuAD: 100,000+ questions for

machine

comprehension of text. arXiv.

https://arxiv.org/abs/1606.05250

[15] Mozes, M., Zhang, X., Kleinberg, B., & Wallace,

B. (2021). Detecting hate speech in the context of

code-mixed

language.

https://arxiv.org/abs/2105.01738

[16] Pavlopoulos,

J.,

Malakasiotis,

arXiv.

P.,

&

Androutsopoulos, I. (2017). Deeper attention to

abusive

user

content

moderation. arXiv.

https://arxiv.org/abs/1705.09820

[17] Kumar, R., Ojha, A. K., Malmasi, S., &

Zampieri, M. (2020). Evaluating aggression

identification

in

social

media.

arXiv.

https://arxiv.org/abs/1803.07557

[18] Chung, W., Kuzmenko, E., Leach, K., & Voss,

C. R. (2019). Increasing language model

robustness with adversarial training. In

Proceedings of the 2019 Conference of the North

American Chapter of the ACL.

Ghorbani, A., Abid, A., & Zou, J. (2019).

Interpretation of machine learning predictions using

influence functions and LIME. In Proceedings of the

36th International Conference on Machine Learning

AI-Powered Content Moderation: A Hybrid Defense Against Prompt Injection and Harmful Text

Authors

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

How to Cite

Submission

Submission

Menu

visitors

Latest publications

Reach US

Ethics and Policies

Important Links

Downloads & Indexing