Jailbreak LLM Response Classifier

Bachelor's Thesis Project - University of Milano-Bicocca

Supervisors: Prof. Claudio Ferretti, Prof.ssa Martina Saletta

Project Overview

Detection and analysis of responses generated by language models under adversarial prompts. This project addresses the challenge of automatically detecting when an LLM response results from a jailbreak prompt, combining supervised learning with unsupervised analysis.

Key Features

  • Binary Classification with fine-tuned BERT model
  • Embedding Extraction using [CLS] token embeddings
  • Unsupervised Clustering (K-means & Hierarchical)
  • Dimensionality Reduction with PCA, t-SNE, and UMAP
  • Fully containerized Google Colab notebooks

Technical Stack

PythonBERTTransformersscikit-learnPyTorchGoogle ColabJupyter

Methodology

Fine-tuned BERT Classification

Labels responses as jailbreak or non-jailbreak with confidence scores for thresholding analysis.

Clustering Analysis

Free K-means with optimal k determination and hierarchical clustering for multi-level grouping.

Visualization

2D plots using PCA, t-SNE, and UMAP to assess cluster separation and structure.

Project Structure

LLM_Response_Classifier/
├── CODE/
│   ├── FINAL_VERSIONS/
│   │   ├── free_clustering/
│   │   └── hierarchical_clustering/
│   └── PREVIOUS_STEPS/
├── DATASETS/
│   ├── fine_tuning_and_test/
│   └── kmeans_evaluation/
├── reports_and_findings/
└── FINAL_REPORT/