An RL Environment for Regulatory Compliance Auditing: Design Decisions and Baseline Findings

A 72B parameter model cannot tell a legal liability from a P&L leak. That was true in March 2026. It is not true in April.

On call_006, three baseline runs with Qwen2.5-72B scored 0.111. The same task with Gemma 4 31B IT scores 0.918. The Hero Agent trap, the single most interesting failure mode in this project, is nearly solved by a frontier open-source model zero-shot.

The scenario: an insurance agent invents an unauthorized retention discount to stop a cancellation. Customer is happy. Sentiment is positive. Ground truth has two violations. A high-severity churn_save_policy_breach and a medium-severity incorrect_hold_procedure. Qwen2.5-72B flags neither. It sees the positive sentiment, classifies the violation as the legally-adjacent unauthorized_commitment category it saw in training data, and misses the P&L leak entirely.

Gemma 4 31B IT catches both violations. It triages correctly, identifies the policy breach, and submits a clean report. Not perfectly. One run scores 0.909, another 0.922. But the floor is no longer 0.076. It is 0.909.

This post covers what RegTriage is, why I built it as an RL environment instead of a classifier, the architectural decisions, and what the April 2026 baselines revealed.

Why an RL Environment, Not a Classifier

Three reasons this is an RL problem, not a classification problem.

First: deterministic grading, not model opinions. The scoring function is severity-weighted F1 with an auto-fail cap. Missing a high-severity regulatory disclosure failure costs 3x as much as missing a low-severity hold procedure violation. Getting a high violation wrong is materially worse than getting a low one wrong. The grading function encodes this asymmetry directly. A compliance officer can verify it against a rubric. No subjectivity.

The grading formula:

Component	Weight
Compliance verdict (pass/fail correct)	0.20
Violation F1 (severity-weighted: high=3x, med=2x, low=1x)	0.60
Efficiency bonus (budget_remaining / total_budget)	0.20

Plus: severity calibration bonus (+0.02 per exact match), false positive penalties (-0.03 to -0.10 per FP), and an auto-fail cap. Miss every high-severity violation and your score locks at 0.30 regardless of other work. In real QA, missing a CFPB-mandated disclosure is an automatic audit failure. The grading function models that.

Second: compute budget forces triage. The agent gets 50 + (total_turns x 3) budget units. A 10-turn call gets 80. A 30-turn call gets 140. Each tool has a cost:

Tool	Cost
`get_call_metadata`	5
`get_sentiment_timeline`	5
`get_transcript_length`	1
`read_transcript_chunk`	3 x turns read
`analyze_turn`	10
`flag_violation`	2
`submit_report`	0

Reading the entire transcript is expensive. Reading targeted sections after triage is cheap. This is how expert QA supervisors actually work. Scan metadata, check sentiment hotspots, deep-dive into 3-5 specific turns. Budget exhaustion before submission means the agent didn't triage. It brute-forced. The efficiency bonus (20% of the final score) rewards agents that submit with budget remaining.

Third: on-premise training. Financial institutions will not send compliance data to external APIs. SOC 2, PCI-DSS, and internal risk committees prohibit it. The environment runs as a Dockerized Gymnasium. Train on your own infrastructure, deploy on your own infrastructure. openenv validate passes on 2 vCPU, 8 GB RAM, no GPU required for the environment itself. LLM inference uses an external API (HuggingFace router by default), but the grading, budget, and state management are entirely local.

The Hero Agent Trap

Here is the most interesting finding from running baseline inference across all 12 tasks.

A customer calls to cancel their account. The agent says "I can knock your rate down by half a percent. Would you like to stay?" Customer is happy. Call ends positively. Sentiment analysis says: great call.

Ground truth says: two violations. churn_save_policy_breach (HIGH). The agent invented a retention discount not authorized by policy, bypassing the CRM retention system. And incorrect_hold_procedure (MED). The agent placed the customer on hold without asking permission first.

Both are policy violations. Both involve unauthorized promises. But they are categorically different:

unauthorized_commitment: "I guarantee that late fee won't show on your credit report". The company gets sued.
churn_save_policy_breach: "I'll knock your rate down by half a percent". The company loses money.

A 72-billion-parameter model cannot distinguish losing money from getting sued. At least, Qwen2.5-72B cannot. It reads the positive sentiment, classifies the violation as the legally-adjacent category it has seen in training data, and misses the P&L leak entirely. Three runs score 0.111. Zero percent detection rate on churn_save_policy_breach.

Qwen2.5-72B: 0.111 on call_006. Gemma 4 31B IT: 0.918 on call_006.

Gemma 4 31B IT, running with its native thinking mode enabled, catches both violations. It triages the call, reads the metadata, checks sentiment, deep-dives into the turns where the agent makes the unauthorized offer, flags the churn_save_policy_breach as HIGH severity, flags the incorrect_hold_procedure as MEDIUM, and submits a clean report. The scores are 0.909, 0.922, and 0.922 across three runs.

That is not a bug being fixed. That is a model generation gap. The question is no longer "can any model do this zero-shot?" The question is "can RL training on this reward function make a small model match a large one?"

That is still the exact narrative RL research cares about. The signal is weaker. The gap is smaller. But for Gemma 4 26B A4B IT (0.670 on call_006) and Qwen 3.5-35B-A3B (0.381 on call_006), there is still clear room for improvement.

Compute Budget as Action Pricing

Most RL environments use a flat step limit. "You get 20 actions, use them wisely." That is a toy constraint. It doesn't map to anything real.

RegTriage uses action pricing because that is how real QA supervisors operate. They don't get 20 minutes per call regardless of length. They triage. 2 minutes on an easy call, 15 on a complex one. The difference is allocation strategy, not raw time.

The budget formula 50 + (total_turns x 3) means:

Short calls (10 turns): 80 budget, enough for metadata + sentiment + targeted reading + analysis + flag + submit
Long calls (30 turns): 140 budget, enough for the same workflow with more reading room
Brute-force reading (entire transcript): costs 3 x total_turns, which on a 30-turn call is 90 units just for reading. That leaves almost nothing for analysis.

The key insight: read_transcript_chunk is priced per-turn. Reading 2 targeted turns costs 6 units. Reading 15 turns costs 45. The agent that reads metadata and sentiment first, identifies 2-3 hot spots, and deep-dives into those turns will always outperform the agent that reads sequentially from turn 0. call_008 demonstrates this. An 18-turn transcript with 104 budget units. Gemma 4 31B reads 5 targeted chunks across the call, flags both violations (unauthorized commitment and incorrect hold procedure), and submits with 16 units remaining. Score: 0.851 across all three runs. The efficiency bonus directly rewards the triage discipline. No budget exhaustion, no auto-fail.

Other tools are priced to teach a specific investigation order. Cheap triage first (metadata 5 units, sentiment 5 units), then targeted reading (3 x turns), then deep analysis (10 units per turn). Flagging is cheap (2 units) to encourage the agent to flag freely rather than hoard findings. Submission is free (0 units). You should never be unable to submit because you ran out of budget investigating.

This is the domain-to-RL translation. QA supervisors don't read every word, they triage. The budget constraint forces the agent to learn the same discipline. The efficiency bonus (20% of the final score) directly rewards agents that submit with budget remaining. Proof they didn't brute-force.

Why Naive Accuracy Fails in Compliance

Common mistake when building compliance tools: optimize for accuracy. "Did the model find the violation?"

Accuracy is the wrong metric. Consider:

False positive on a HIGH severity claim: An agent flags a disclosure failure that never happened. A QA supervisor wastes 30 minutes reviewing a clean call. At scale, that is hours of wasted human time.

False negative on a HIGH severity claim: An agent misses a disclosure failure that did happen. CFPB receives a complaint. The institution faces a $100M+ enforcement action.

These are not symmetric errors. The grading function encodes this asymmetry through three mechanisms:

Severity-weighted F1: High violations count 3x, medium 2x, low 1x. Getting a high violation wrong is materially worse than getting a low one wrong.
Auto-fail cap: Miss every high-severity violation? Score is capped at 0.30 regardless of how many low-severity violations you correctly flagged. In compliance, missing every critical breach makes the entire audit worthless.
False positive penalties scaled by claimed severity: Claiming a false HIGH costs -0.10. Claiming a false LOW costs -0.03. Overclaiming severity is punished proportionally.

The result: scores range from 0.381 to 0.918 across tasks and models. The environment is solvable (easy tasks score near 0.9) but not trivially solvable for all models. Exactly the dynamic you want in an RL benchmark. Clear room for improvement with an unambiguous signal.

Baseline Results

Three models, three runs each, zero-shot, no RL training. All via HuggingFace Inference Router (Novita provider):

Task	Tier	Gemma 4 31B	Gemma 4 26B A4B	Qwen 3.5-35B-A3B	Key Finding
call_001	Easy	0.755	0.874	0.923	Disclosure failure caught by all
call_002	Hard (clean)	0.017	0.550	0.020	Clean call. Gemma 26B passes; 31B and Qwen hallucinate violations
call_003	Easy	0.883	0.836	0.863	PII exposure caught by all
call_004	Hard (clean)	0.904	0.904	0.542	Clean call. Gemma models pass; Qwen flags false positives
call_005	Medium	0.857	0.300	0.858	Both violations found by Gemma 31B and Qwen; Gemma 26B misses
call_006	Medium	0.918	0.670	0.381	Hero Agent trap. Gemma 31B nearly solves it. Qwen still struggles
call_007	Medium	0.881	0.736	0.864	Both violations found by Gemma 31B and Qwen
call_008	Medium	0.851	0.833	0.419	Budget management varies. Gemma models triage well; Qwen exhausts budget
call_009	Hard	0.859	0.720	0.783	Caught 2 of 3 violations consistently
call_010	Hard	0.750	0.674	0.410	Misses `churn_save_policy_breach` on longer transcripts
call_011	Medium	0.889	0.883	0.677	High detection rate on obvious violations
call_012	Hard	0.900	0.641	0.476	Misses `churn_save_policy_breach` vs unauthorized on complex calls

Model	Overall	Easy	Medium	Hard	call_006
Gemma 4 31B IT	0.789	0.819	0.879	0.686	0.918
Gemma 4 26B A4B IT	0.718	0.855	0.684	0.698	0.670
Qwen 3.5-35B-A3B	0.602	0.893	0.640	0.446	0.381
Qwen 2.5-72B (Mar 2026)	0.462	0.897	0.435	0.268	0.111

What Gemma 4 31B gets right: Near-perfect detection on easy and medium tasks. The Hero Agent trap, which floored every previous model at ~0.08-0.11, scores 0.918. It correctly distinguishes P&L leaks from legal liabilities. It triages efficiently, rarely exhausting budget. Clean call precision is high but not perfect (call_002 scores 0.017 due to hallucinated violations).

What Gemma 4 31B gets wrong: Hard tasks with subtle violations still challenge it. call_010 and call_012, which mix churn_save_policy_breach with other violation types, score 0.750 and 0.900. Not failures, but not perfect either. The 0.017 on call_002 is a false positive problem. It flags violations on a clean call, likely because it is over-eager to find issues.

What the comparison reveals:

Model architecture matters more than parameter count. Gemma 4 26B A4B IT (26B total, 4B active) outperforms Qwen 3.5-35B-A3B (35B total, 3B active) on every difficulty tier. The dense 31B model outperforms both. Google's reasoning-focused training and 256K context window give it an edge on long-form compliance analysis.
The Hero Agent trap is model-dependent, not universal. Qwen 2.5-72B cannot do it. Qwen 3.5-35B-A3B partially does it (0.381). Gemma 4 26B A4B mostly does it (0.670). Gemma 4 31B nearly solves it (0.918). The trap is real, but it is a function of model capability, not task impossibility.
The RL signal is still there, but it is weaker. If Gemma 4 31B scores 0.918 zero-shot, the maximum possible improvement from RL training is 0.082. For Gemma 4 26B A4B, the gap is 0.330. For a hypothetical 7B model, the gap could be 0.50+. The narrative shifts from "frontier models fail, RL fixes it" to "smaller models can match larger ones with RL training."

Architecture Decisions That Matter

A few choices that are not obvious from the README.

PII redaction before the agent sees data. SSN, account numbers, names, emails are regex-redacted before any transcript turn reaches the agent. 28 unit tests cover edge cases. Years not redacted (2024 preserved). Dollar amounts preserved ($500.00). Partial SSN mentions preserved (last four of your social). Idempotent redaction (running the pipeline twice produces identical output). The pii_exposure_risk violation tests whether the human agent requested excessive PII. Not whether PII exists in the text. This models production conditions. Enterprises won't send raw PII to LLM APIs (SOC 2 compliance), and the violation signal is about the agent's request behavior, not the data itself.

Type-only matching for violations. Violations match by category, not turn index. A churn_save_policy_breach spanning turns 8-14 gets credit regardless of which turn the agent points to. Multi-turn violations can legitimately be flagged at any participating turn. Deliberate design choice. The alternative (requiring exact turn matches) would make the environment artificially brittle. In real compliance work, the auditor needs to identify what happened, not pinpoint the exact utterance where it started.

Draft Incident Report, not a score. The output is a structured report with verdict (PASS/FAIL), recommended action (ESCALATE/REVIEW/ARCHIVE), triage efficiency percentage, estimated human review minutes, and a findings list with per-violation status (DETECTED/MISSED). The agent is a scout, not a judge. The human supervisor makes the final call. The ESCALATE/REVIEW/ARCHIVE routing is based on severity. High-severity findings trigger same-day SLA, medium triggers next-day, clean calls get archived without review. At 2M calls/month (a rough estimate assuming typical utilization), this cuts the human review workload from 2M calls to roughly 200K-400K escalated/flagged calls. An 80% reduction while achieving 100% coverage.

OpenEnv-native architecture. The environment uses create_app() from openenv.core.env_server, standard Pydantic base types (Action, Observation, State), and the canonical Dockerfile base image (ghcr.io/meta-pytorch/openenv-base:latest). No custom HTTP handlers. The server exposes both stateless HTTP endpoints (/reset, /step, /state) and a stateful WebSocket (/ws) for multi-step episodes. The HTTP endpoints create fresh environment instances per request. Stateless by design. The WebSocket maintains a persistent session for full episode traces.

Where This Goes

I built this for the Meta PyTorch OpenEnv Hackathon. The code passes every validation gate. 6/6 openenv validate checks, Docker build, 28 unit tests on the PII redaction pipeline. But I missed the submission window. Disappointing, but the environment is real and the results above are reproducible. Open-sourcing it was the right next move regardless.

Next steps, in order of what I'd actually do:

1. PPO/GRPO training loop on Gemma 4 26B A4B. The 0.670 on call_006 leaves a 0.248 gap to the 31B model. If RL training can close that gap, a 26B model (4B active) matches a 31B dense model at 1/8th the inference cost. That is the efficiency narrative enterprises care about.

2. Real QA log ingestion. The 12 transcripts are GPT-4o generated. The next step is partnering with a BPO to ingest their anonymized QA failure logs and auto-generate scenario specifications from real patterns. Three files to customize: COMPLIANCE_RUBRIC (the 6 policies), transcripts.json (the scenarios), grading.py (severity weights and auto-fail threshold). Swap the curriculum, keep the gymnasium.

3. Multi-agent review. One agent audits the call, a second agent reviews the Draft Incident Report for consistency and false positives. This models the real QA workflow where a senior supervisor reviews the junior auditor's work. Path to fully autonomous QA with human oversight only on escalated cases.

How to Use It

The environment is OpenEnv-compliant. Run it locally with Docker or use the live instance.

Live environment: HuggingFace Space (/health endpoint confirmed active)

Source code + full setup: GitHub

For researchers: the baseline_results_multi/ directory in the repo contains all raw JSON results from the three-model comparison. Reproducibility is the point.

If you're doing RL research on tool-use agents, compliance, or budget-constrained decision-making, the reward function and the Hero Agent signal are the most immediately useful things to build on.

The verifiability stack

RegTriage is one piece of a three-repo compliance verification stack. The other two solve adjacent problems.

rubric-grader-eval handles deterministic evaluation. It compiles unstructured rubrics into machine-readable schemas, then evaluates documents against them with golden-set ground truth. The architecture and the compiler-first philosophy.

auditguard-mcp handles the tool-call compliance layer. It gates every MCP tool execution through RBAC, PII detection, policy enforcement, and structured audit logging. Seven-step pipeline for auditable LLM tool use.

Scrutiny is the vertical application of the pattern. A 12-rule FDCPA/Reg F rubric compiled from statutory text. A dual-path evaluator that scores a collections call transcript in under 60 seconds. The architecture and where it breaks.

Together: RegTriage trains the agent. rubric-grader-eval measures the accuracy. auditguard-mcp proves the pipeline ran compliantly. Scrutiny proves the pattern works as a vertical product. Verifiability from training to tool call to final report.

References

Gemma 4 Technical Report. Model architecture and training details for Gemma 4 31B IT and Gemma 4 26B A4B IT.
Qwen3 Technical Report. Covers the Qwen3 model family including MoE architectures like Qwen3-30B-A3B, predecessor to Qwen3.5-35B-A3B. Model page.
OpenEnv Specification. The RL environment standard RegTriage is built on.
HuggingFace Inference Router. The API used for all baseline inference runs.
Pydantic. Data validation for action, observation, and state models.
SOC 2 Type II. Service organization controls referenced in the on-prem constraint discussion.
PCI-DSS. Payment Card Industry Data Security Standard.