Jailbreak LLM Response Classifier

Bachelor's Thesis Project - University of Milano-Bicocca

Supervisors: Prof. Claudio Ferretti, Prof.ssa Martina Saletta

Project Overview

Detection and analysis of responses generated by language models under adversarial prompts. This project addresses the challenge of automatically detecting when an LLM response results from a jailbreak prompt, combining supervised learning with unsupervised analysis.

Key Features

Binary Classification with fine-tuned BERT model
Embedding Extraction using [CLS] token embeddings
Unsupervised Clustering (K-means & Hierarchical)
Dimensionality Reduction with PCA, t-SNE, and UMAP
Fully containerized Google Colab notebooks

Technical Stack

PythonBERTTransformersscikit-learnPyTorchGoogle ColabJupyter

Methodology

Fine-tuned BERT Classification

Labels responses as jailbreak or non-jailbreak with confidence scores for thresholding analysis.

Clustering Analysis

Free K-means with optimal k determination and hierarchical clustering for multi-level grouping.

Visualization

2D plots using PCA, t-SNE, and UMAP to assess cluster separation and structure.

Project Structure

LLM_Response_Classifier/
├── CODE/
│   ├── FINAL_VERSIONS/
│   │   ├── free_clustering/
│   │   └── hierarchical_clustering/
│   └── PREVIOUS_STEPS/
├── DATASETS/
│   ├── fine_tuning_and_test/
│   └── kmeans_evaluation/
├── reports_and_findings/
└── FINAL_REPORT/

View Code Live Demo