AI-Powered Content Moderation: A Hybrid Defense Against Prompt Injection and Harmful Text
Keywords:
AI ethics, Deep learning, LLM, LSTM, prompt injection.Abstract
With the extensive use of Large
Language Models (LLMs) in many digital
products, new measures must be taken to help
secure these models from potentially harmful
input and adversarial use. This work introduces
an AI hybrid content moderation system designed
to protect the LLM from harmful, unethical, or
manipulative prompts. The proposed system
includes a hybrid approach, utilizing a rule-based
system with regular expression, alongside a deep
learning classifier design based on Long Short
Term Memory (LSTM) architecture. The
combination of these two layers allows the
moderation to flag harmful restricted content in
real-time, while identifying more subtle threats
and contextually hidden threats that would have
likely been missed in a static filter.
The research dataset, leverages curated real
world examples and created synthetic prompt
injections, categorized and labeled into safe and
restricted examples. The findings show that the
proposed system can achieve a high level of
effectiveness for determining safe content, with a
Precision, Recall and F1-score of 100%. The
LSTM model was unable to accurately identify the
restricted examples in the test batch, and it is
likely that it was due to the imbalance of the class
representation or under-representation of the
restricted class. The macro and weighted average
scores were still good, at 0.50 and 1.00
respectively; demonstrating good results at
confident accurately validating safe input.
ISSN: 2456-4265
IJMEC 2025
The Flask-based web application incorporates
the
moderation engine with live online
engagement, and the transparency in decision
making is signified through visual feedback. The
emphasis on modularity, ethics in AI alignment,
and extensibility was design priority. The future
works will expand towards transformer models
(e.g., BERT), facilitate multilingual support, and
improve the system's ability to respond to
adversarial prompts. In summary, this work
represents the first step towards establishing a
scalable, secure, contextual moderation system
that can be utilized alongside contemporary LLMs
across real-world scenarios.
Downloads
References
Bagdasaryan, E., Hsieh, T.-Y., Nassi, B., &
Shmatikov, V. (2023). (Ab)using images and
sounds for indirect instruction injection in multi
modal
LLMs.
arXiv.
https://arxiv.org/abs/2307.10490
[2] Brown, T. B., Mann, B., Ryder, N., Subbiah, M.,
Kaplan, J., Dhariwal, P., ... & Amodei, D.
(2020). Language models are few-shot learners.
Advances in Neural Information Processing
Systems,
33,
1877–1901.
https://doi.org/10.48550/arXiv.2005.14165
[3] Daryanani, L. (2023). How to jailbreak
ChatGPT.
GitHub.
https://github.com/0xk1h0/ChatGPT-Jailbreak
[4] He, Y., Qiu, J., Zhang, W., & Yuan, Z. (2024).
Fortifying ethical boundaries in AI: Advanced
strategies for enhancing security in large
language
models.
arXiv.
https://arxiv.org/abs/2402.01725
[5] Li, H., Guo, D., Fan, W., Xu, M., & Song, Y.
(2023). Multi-step jailbreaking privacy attacks
on
ChatGPT.
arXiv.
https://arxiv.org/abs/2310.08419
[6] Perez, E., Huang, S., Song, F., Cai, T., Ring, R.,
Aslanides, J., ... & Irving, G. (2022). Red
teaming language models with language models.
arXiv. https://arxiv.org/abs/2202.03286
[7] Solaiman, I., & Dennison, C. (2021). Process for
adapting language models to society (PALMS)
with values-targeted datasets. Advances in
Neural Information Processing Systems, 34,
5861–5873.
https://proceedings.neurips.cc/paper/2021/hash/7
d6044a48e0d9f9bc7d74c2734da5d08
Abstract.html
[8] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit,
J., Jones, L., Gomez, A. N., ... & Polosukhin, I.
(2017). Attention is all you need. Advances in
Neural Information Processing Systems, 30.
https://papers.nips.cc/paper/2017/file/3f5ee24354
7dee91fbd053c1c4a845aa-Paper.pdf
[9] Weidinger, L., Mellor, J., Rauh, M., Griffin, C.,
Uesato, J., Huang, P.-S., ... & Gabriel, I. (2021).
Ethical and social risks of harm from language
models. arXiv. https://arxiv.org/abs/2112.04359
[10] Xu, W., Liu, X., & Yang, Y. (2021).
Understanding and improving rule-based content
moderation systems in NLP. Proceedings of the
AAAI Conference on Artificial Intelligence,
35(5),
4533–4541.
https://doi.org/10.1609/aaai.v35i5.16605
[11] Zhang, Y., Sheng, Y., Wu, J., & Liu, Y. (2021).
Toxic content detection with deep learning:
Challenges
and
opportunities.
ACM
Transactions on Intelligent Systems and
Technology,
12(4),
1–22.
https://doi.org/10.1145/3450922
[12] Gehman, S., Gururangan, S., Sap, M., Choi, Y.,
& Smith, N. A. (2020). RealToxicityPrompts:
Evaluating neural toxic degeneration in
language
models.
arXiv.
https://arxiv.org/abs/2009.11462
[13] Bender, E. M., & Friedman, B. (2018). Data
statements for natural language processing:
Toward mitigating system bias and enabling
better science. Transactions of the Association
for Computational Linguistics, 6, 587–604.
https://doi.org/10.1162/tacl_a_00041
[14] Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang,
P. (2016). SQuAD: 100,000+ questions for
machine
comprehension of text. arXiv.
https://arxiv.org/abs/1606.05250
[15] Mozes, M., Zhang, X., Kleinberg, B., & Wallace,
B. (2021). Detecting hate speech in the context of
code-mixed
language.
https://arxiv.org/abs/2105.01738
[16] Pavlopoulos,
J.,
Malakasiotis,
arXiv.
P.,
&
Androutsopoulos, I. (2017). Deeper attention to
abusive
user
content
moderation. arXiv.
https://arxiv.org/abs/1705.09820
[17] Kumar, R., Ojha, A. K., Malmasi, S., &
Zampieri, M. (2020). Evaluating aggression
identification
in
social
media.
arXiv.
https://arxiv.org/abs/1803.07557
[18] Chung, W., Kuzmenko, E., Leach, K., & Voss,
C. R. (2019). Increasing language model
robustness with adversarial training. In
Proceedings of the 2019 Conference of the North
American Chapter of the ACL.
Ghorbani, A., Abid, A., & Zou, J. (2019).
Interpretation of machine learning predictions using
influence functions and LIME. In Proceedings of the
36th International Conference on Machine Learning