Guardrails and Security for LLMs: Safe, Secure, and Controllable Steering of LLM Applications

Guardrails and Security for LLMs:
Safe, Secure, and Controllable Steering of LLM Applications

¹NVIDIA, ²Allen Institute for AI, ³University of Washington, ⁴University of Illinois at Urbana-Champaign, ⁵University Politehnica of Bucharest, ⁶ITU University of Copenhagen

(Tentative) Sunday July 7 14:00 - 17:30 (CET) @ Austria Center Vienna, Hall C

About this tutorial

Pretrained generative models, especially large language models, provide novel ways for users to interact with computers. While generative NLP research and applications had previously aimed at very domain-specific or task-specific solutions, current LLMs and applications (e.g. dialogue systems, agents) are versatile across many tasks and domains. Despite being trained to be helpful and aligned with human preferences (e.g., harmlessness), enforcing robust guardrails on LLMs remains a challenge. And, even when protected against rudimentary attacks, just like other complex software, LLMs can be vulnerable to attacks using sophisticated adversarial inputs.

This tutorial provides a comprehensive overview of key guardrail mechanisms developed for LLMs, along with evaluation methodologies and a detailed security assessment protocol - including auto red-teaming of LLM-powered applications. Our aim is to move beyond the discussion of single prompt attacks and evaluation frameworks towards addressing how guardrailing can be done in complex dialogue systems that employ LLMs.

We aim to provide a cutting-edge and complete overview of deployment risks associated with LLMs in production environments. While the main focus will be on how to effectively protect against safety and security threats, we also tackle the more recent topic of providing dialogue and topical rails, including respecting custom policies. We also examine the new attack vectors introduced by LLM-enabled dialogue systems, such as methods for circumventing dialogue steering.

Schedule (tentative)

Tutorial topic	Duration
Introduction	10 min
Types of LLM guardrails
Guardrails and LLM security
Content moderation and safety	30 min
Taxonomies of safety risks
Landscape of safety models and datasets
Synthetic data generation for LLM safety
Custom safety policies
LLM security	30 min
Overview
Tools for assessing LLM security
Auto red-teaming
Adversarial attacks
Alignment attacks	25 min
Data poisoning and sleeper agents
Instruction hierarchy
Trojan horse and safety backdoors
Dialogue rails and security	30 min
Dialogue and topical rails
Evaluation of dialogue rails
Multi-turn/dialogue attacks and protection
Multilingual rails and security	25 min
Multilingual safety models
LLM security and dialogue rails
In-flight steering for safety	25 min
Activation-based steering
Circuit breakers
In-flight steering for topical rails
Final recommendations	5 min
Total	180 min

BibTeX

@article{park2021nerfies, author = {author = {Rebedea, Traian and Derczynski, Leon and Ghosh, Shaona and Sreedhar, Makesh Narsimhan and Brahman, Faeze and Jiang, Liwei and Li, Bo and Tsvetkov, Yulia and Parisien, Christopher and Choi, Yejin}}, title = {ACL 2025 Tutorial - Guardrails and Security for LLMs: Safe, Secure, and Controllable Steering of LLM Applications}, journal = {ACL}, year = {2025}, }

ACL 2025 Tutorial

Guardrails and Security for LLMs: Safe, Secure, and Controllable Steering of LLM Applications

About this tutorial

Schedule (tentative)

Reading List

BibTeX

Guardrails and Security for LLMs:
Safe, Secure, and Controllable Steering of LLM Applications