Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models

Published in ICML 2026, 2026

Abstract

The paper examines how fine-tuned language models can covertly encode information in their outputs through steganographic methods. The authors introduce an improved steganographic scheme with reduced recoverability by using mappings derived from the model’s embedding space rather than arbitrary ones. Empirically, on models such as Llama-8B and Ministral-8B, this yields significant increases in secret recovery rates (+78% to +123%). On the detection side, they propose using mechanistic interpretability with linear probes on model activations, achieving up to 33% higher accuracy than traditional steganalysis approaches in identifying malicious fine-tuning signatures.

Accepted at the International Conference on Machine Learning (ICML), 2026.

Download paper here

Recommended citation: Westphal, C., Navaie, K., & Rosas, F. E. (2026). Hide and Seek in Embedding Space: Geometry-based Steganography and Detection in Large Language Models. To appear in Proceedings of the International Conference on Machine Learning (ICML).
Download Paper

Twitter Facebook LinkedIn

Charles Westphal

Abstract