Bachelor's Thesis Project - University of Milano-Bicocca
Detection and analysis of responses generated by language models under adversarial prompts. This project addresses the challenge of automatically detecting when an LLM response results from a jailbreak prompt, combining supervised learning with unsupervised analysis.
Labels responses as jailbreak or non-jailbreak with confidence scores for thresholding analysis.
Free K-means with optimal k determination and hierarchical clustering for multi-level grouping.
2D plots using PCA, t-SNE, and UMAP to assess cluster separation and structure.
LLM_Response_Classifier/ ├── CODE/ │ ├── FINAL_VERSIONS/ │ │ ├── free_clustering/ │ │ └── hierarchical_clustering/ │ └── PREVIOUS_STEPS/ ├── DATASETS/ │ ├── fine_tuning_and_test/ │ └── kmeans_evaluation/ ├── reports_and_findings/ └── FINAL_REPORT/