Jailbreak LLM Response Classifier
This project focuses on the detection and analysis of responses generated by language models under adversarial prompts, developed as part of my Bachelor's thesis.
Overview
The project combines supervised learning techniques with unsupervised analysis to classify and understand how language models respond to jailbreak attempts and adversarial prompts.
Technologies Used
- Python: Main programming language
- BERT: For natural language processing and classification
- Machine Learning: Various ML techniques for response classification
- NLP: Natural language processing for text analysis
- PyTorch: Deep learning framework
Key Features
- Detection of jailbreak attempts in LLM responses
- Classification of response types and safety levels
- Analysis of adversarial prompt effectiveness
- Comprehensive evaluation metrics and reporting
Research Focus
The research investigates how large language models behave when subjected to adversarial prompts designed to bypass their safety mechanisms, providing insights into model robustness and security.
Status
Currently in development as part of my 2024-2025 academic year bachelor's thesis project.