Now You (Still) See Me: Detecting Evasive Steganographic Payloads in LLMs

Published in arXiv preprint, 2026, 2026

Abstract

The paper examines how large language models can be fine-tuned to conceal secret information within seemingly ordinary outputs. The authors demonstrate that existing detection mechanisms using linear probes on model activations can be circumvented through adversarial fine-tuning across five model variants — the evasive trojans maintain 58–79% secret recovery rates while evading detection, with minimal impact on model performance. They then propose a data-level intervention using recontextualization datasets that successfully restores detectability. The findings reveal that activation-based steganography detection is vulnerable to adaptive evasion, but theory-guided evaluation distributions can expose otherwise hidden payloads.

Download paper here

Recommended citation: Westphal, C., Douglas, T., Navaie, K., Pimentel, T., & Rosas, F. E. (2026). Now You (Still) See Me: Detecting Evasive Steganographic Payloads in LLMs. arXiv preprint arXiv:2606.09411.
Download Paper

Twitter Facebook LinkedIn

Charles Westphal

Abstract