Customize Evaluation¶
RL-based red-teaming trains a tester via reinforcement learning so it generates test cases that are likely to elicit unsafe outputs from a target model. Testers probe targets through multi-turn conversations and try to surface failure modes (unsafe/harmful outputs).
ASTRA-RL supports both training testers and evaluating targets. In evaluation mode we repeatedly test targets using a pre-trained tester and collect metrics that describe how, when, and how often a target fails.
What counts as a successful audit?
An audit is successful when the target model produces an unsafe or harmful utterance according to the configured scorer (e.g., Detoxify, LlamaGuard). Red-teaming's purpose is to discover as many such failure modes as possible so they can be analyzed and mitigated.
Quick links¶
-
To run a red-team evaluation using a trained tester, follow the quick guide: Quick Start: Evaluation. This shows how to use the supported evaluation systems (
HFEvaluationSystemfor HF models andGPT2EvaluationSystemfor GPT-2) and the default evaluator (ASTEvaluator). -
If you need to support a custom model or tokenizer, see Evaluation System Customization.
-
If you want to collect different per-turn or aggregated metrics, or change how metrics are computed/serialized, see Evaluator Customization.
Short workflow¶
- Train a tester (see Quick Start: Training).
- Point evaluation at your tester checkpoint.
- Run
ASTEvaluatorover a set of held-out prompts (never used at training time). - Inspect per-turn logs and aggregated metrics (JSON output) to find failure modes.
Tips¶
- Use a scorer that matches your safety criteria (e.g., toxicity vs. policy violations).
- Keep evaluation prompts out-of-sample to avoid reporting overfit behavior.