ACL 2025 Tutorial

Guardrails and Security for LLMs:
Safe, Secure, and Controllable Steering of LLM Applications

Presenters:
Other contributors:
1NVIDIA, 2Allen Institute for AI, 3University of Washington, 4University of Illinois at Urbana-Champaign, 5University Politehnica of Bucharest, 6ITU University of Copenhagen

Sunday July 27 14:00 - 17:30 (CET) @ Austria Center Vienna, Hall C

About this tutorial

Pretrained generative models, especially large language models, provide novel ways for users to interact with computers. While generative NLP research and applications had previously aimed at very domain-specific or task-specific solutions, current LLMs and applications (e.g. dialogue systems, agents) are versatile across many tasks and domains. Despite being trained to be helpful and aligned with human preferences (e.g., harmlessness), enforcing robust guardrails on LLMs remains a challenge. And, even when protected against rudimentary attacks, just like other complex software, LLMs can be vulnerable to attacks using sophisticated adversarial inputs.

This tutorial provides a comprehensive overview of key guardrail mechanisms developed for LLMs, along with evaluation methodologies and a detailed security assessment protocol - including auto red-teaming of LLM-powered applications. Our aim is to move beyond the discussion of single prompt attacks and evaluation frameworks towards addressing how guardrailing can be done in complex dialogue systems that employ LLMs.

We aim to provide a cutting-edge and complete overview of deployment risks associated with LLMs in production environments. While the main focus will be on how to effectively protect against safety and security threats, we also tackle the more recent topic of providing dialogue and topical rails, including respecting custom policies. We also examine the new attack vectors introduced by LLM-enabled dialogue systems, such as methods for circumventing dialogue steering.

Schedule (tentative)

Tutorial topic Duration
Introduction [slides] 5 min
  Types of LLM guardrails
  Guardrails and LLM security
Content moderation and safety [slides] 35 min
  Taxonomies of safety risks
  Landscape of safety models and datasets
  Synthetic data generation for LLM safety
  Custom safety policies
  Safety and reasoning models
  System level considerations
LLM security [slides] 30 min
  Overview
  Tools for assessing LLM security
  Auto red-teaming
  Adversarial attacks
Alignment attacks [slides] 20 min
  Data poisoning and sleeper agents
  Instruction hierarchy
  Trojan horse and safety backdoors
Coffee break (3:30-4pm CET) 30 min
Dialogue rails and security [slides] 20 min
  Dialogue and topical rails
  Evaluation of dialogue rails
  Multi-turn/dialogue attacks and protection
Multilingual guardrails [slides] 15 min
  Multilingual safety models
Inference-time steering for safety [slides] 20 min
  Activation-based steering
  Circuit breakers
  Inference-time steering for concept / topical guardrails
LLM agent safety [slides] 30 min
  Safety challenges and measures for different types of basic agents
  Assessing agent safety
  Multi-agent safety risks
  Multi-agents for enhancing AI safety
Final recommendations 5 min
Total 180 min

Reading List

Bold papers are the suggested reading list.

Content moderation and safety

Multilingual guardrails

Inference-time steering for safety