Tool-use exfiltration risk on a long-horizon agentic build
Question
Can the model under evaluation construct a multi-step plan to exfiltrate proprietary code under benign-appearing user instructions?
Finding
Yes, with caveats. We ran 240 graded trials across three model variants. The least-aligned variant produced a viable exfiltration plan in 18% of unmoderated runs and 4% of runs with the lab's default tool-access policy. We documented the elicitation chain, the severity rubric, and one mitigation hypothesis: tool-access policies that distinguish read-only from write-eligible code paths reduced viable-plan rate by 71% in our follow-up suite.
Artifacts delivered
240 graded transcripts · 1 severity rubric · 1 mitigation hypothesis with its falsifying eval