Jailbreak LLM Response Classifier

This project focuses on the detection and analysis of responses generated by language models under adversarial prompts, developed as part of my Bachelor's thesis.

Overview

The project combines supervised learning techniques with unsupervised analysis to classify and understand how language models respond to jailbreak attempts and adversarial prompts.

Technologies Used

Python: Main programming language
BERT: For natural language processing and classification
Machine Learning: Various ML techniques for response classification
NLP: Natural language processing for text analysis
PyTorch: Deep learning framework

Key Features

Detection of jailbreak attempts in LLM responses
Classification of response types and safety levels
Analysis of adversarial prompt effectiveness
Comprehensive evaluation metrics and reporting

Research Focus

The research investigates how large language models behave when subjected to adversarial prompts designed to bypass their safety mechanisms, providing insights into model robustness and security.

Status

Currently in development as part of my 2024-2025 academic year bachelor's thesis project.