Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models
Published in ICML 2026, 2026
Abstract
The paper examines how fine-tuned language models can covertly encode information in their outputs through steganographic methods. The authors introduce an improved steganographic scheme with reduced recoverability by using mappings derived from the model’s embedding space rather than arbitrary ones. Empirically, on models such as Llama-8B and Ministral-8B, this yields significant increases in secret recovery rates (+78% to +123%). On the detection side, they propose using mechanistic interpretability with linear probes on model activations, achieving up to 33% higher accuracy than traditional steganalysis approaches in identifying malicious fine-tuning signatures.
Accepted at the International Conference on Machine Learning (ICML), 2026.
Recommended citation: Westphal, C., Navaie, K., & Rosas, F. E. (2026). Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models. To appear in Proceedings of the International Conference on Machine Learning (ICML).
Download Paper
