Customize Evaluation¶

RL-based red-teaming trains a tester via reinforcement learning so it generates test cases that are likely to elicit unsafe outputs from a target model. Testers probe targets through multi-turn conversations and try to surface failure modes (unsafe/harmful outputs).

ASTRA-RL supports both training testers and evaluating targets. In evaluation mode we repeatedly test targets using a pre-trained tester and collect metrics that describe how, when, and how often a target fails.

What counts as a successful audit?

An audit is successful when the target model produces an unsafe or harmful utterance according to the configured scorer (e.g., Detoxify, LlamaGuard). Red-teaming's purpose is to discover as many such failure modes as possible so they can be analyzed and mitigated.

Quick links¶

To run a red-team evaluation using a trained tester, follow the quick guide: Quick Start: Evaluation. This shows how to use the supported evaluation systems (HFEvaluationSystem for HF models and GPT2EvaluationSystem for GPT-2) and the default evaluator (ASTEvaluator).
If you need to support a custom model or tokenizer, see Evaluation System Customization.
If you want to collect different per-turn or aggregated metrics, or change how metrics are computed/serialized, see Evaluator Customization.

Short workflow¶

Train a tester (see Quick Start: Training).
Point evaluation at your tester checkpoint.
Run ASTEvaluator over a set of held-out prompts (never used at training time).
Inspect per-turn logs and aggregated metrics (JSON output) to find failure modes.

Tips¶

Use a scorer that matches your safety criteria (e.g., toxicity vs. policy violations).
Keep evaluation prompts out-of-sample to avoid reporting overfit behavior.