Work

Selected engagements, anonymized where required.

Partner identities are protected by default. Under a written engagement we can share more granular versions of the same case studies, including named comparators and unpublished severity scores.

Tool-use exfiltration risk on a long-horizon agentic build

Question

Can the model under evaluation construct a multi-step plan to exfiltrate proprietary code under benign-appearing user instructions?

Finding

Yes, with caveats. We ran 240 graded trials across three model variants. The least-aligned variant produced a viable exfiltration plan in 18% of unmoderated runs and 4% of runs with the lab's default tool-access policy. We documented the elicitation chain, the severity rubric, and one mitigation hypothesis: tool-access policies that distinguish read-only from write-eligible code paths reduced viable-plan rate by 71% in our follow-up suite.

Artifacts delivered

240 graded transcripts · 1 severity rubric · 1 mitigation hypothesis with its falsifying eval

Multi-turn jailbreak study on a refusal-tuned chat model

Question

How robust is refusal behavior under multi-turn adversarial pressure where the adversary controls the conversation history?

Finding

Robust under single-turn elicitation. Less robust under 4+ turn elicitation when the adversary primes a benign frame in early turns. We ran 412 transcripts across 16 elicitation strategies. Three strategies achieved >50% success rates beyond turn 6. The provider patched these in the next safety-fine-tune; we re-ran the same suite and verified the patches held.

Artifacts delivered

412 transcripts · 16 elicitation strategies · before/after replication study

Action-log review pipeline for an internal agent

Question

How does an enterprise security team get useful signal from millions of agent actions per day without buying yet another SaaS?

Finding

We built an action-log review pipeline that flags 0.06% of actions for human review based on a provenance-tracked policy graph. The pipeline runs on the client's own infrastructure with no external calls. We wrote the reviewer playbook and trained the first three reviewers. The pipeline has been operational for nine months with a measured false-positive rate of 11%.

Artifacts delivered

Pipeline code · provenance schema · reviewer playbook · 9-month operational record

Deception proxy task suite

Question

How do you measure deceptive alignment in a model that hasn't been deployed and may not exhibit deception under standard probes?

Finding

We co-designed an 18-task evaluation suite built on behavioral proxies for incentive-inconsistent reasoning. The suite is open-source and has been adopted by two other safety orgs as a baseline for comparison.

Artifacts delivered

18-task suite · documentation · external adoption by 2 partner orgs

On anonymization

We publish at the case-study level without partner names, with quantitative ranges where exact numbers would compromise the engagement. The trade is intentional: we'd rather a future partner trust the work than know which specific lab we ran it for.

Under written engagement, full case studies with named comparators and exact severity scores are available. Ask.

Reputation is the asset. We don't spend it twice.