Steering off course: Reliability Challenges in Steering Language Models

Published in ACL (Association for Computational Linguistics), 2025

The contents above will be part of a list of publications, if the user clicks the link for the publication than the contents of section will be rendered as a full page, allowing you to provide more information about the paper for the reader. When publications are displayed as a single page, the contents of the above “citation” field will automatically be included below this section in a smaller font.

Recommended citation: Patrick Queiroz Da Silva, Hari Sethuraman, Dheeraj Rajagopal, Hannaneh Hajishirzi, and Sachin Kumar. 2025. Steering off Course: Reliability Challenges in Steering Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 19856–19882, Vienna, Austria. Association for Computational Linguistics.
Download Paper