0

Jailbreak LLM Response Classifier

Detection and analysis of responses generated by language models under adversarial prompts. Bachelor's thesis project combining supervised learning with unsupervised analysis.

Jailbreak LLM Response Classifier

This project focuses on the detection and analysis of responses generated by language models under adversarial prompts, developed as part of my Bachelor's thesis.

Overview

The project combines supervised learning techniques with unsupervised analysis to classify and understand how language models respond to jailbreak attempts and adversarial prompts.

Technologies Used

  • Python: Main programming language
  • BERT: For natural language processing and classification
  • Machine Learning: Various ML techniques for response classification
  • NLP: Natural language processing for text analysis
  • PyTorch: Deep learning framework

Key Features

  • Detection of jailbreak attempts in LLM responses
  • Classification of response types and safety levels
  • Analysis of adversarial prompt effectiveness
  • Comprehensive evaluation metrics and reporting

Research Focus

The research investigates how large language models behave when subjected to adversarial prompts designed to bypass their safety mechanisms, providing insights into model robustness and security.

Status

Currently in development as part of my 2024-2025 academic year bachelor's thesis project.