Quick Start: Evaluation¶
RL-based adversarial testing uses reinforcement learning to train a tester that generates test cases likely to elicit unsafe outputs from a target model. This tutorial shows how to run evaluations using a pre-trained tester against a target model.
Prerequisite
This guide assumes you already trained a Hugging Face (i.e. llama3) tester (see Quick Start: Training). You'll point evaluation at that saved tester checkpoint in ./checkpoints/{model_name}/best.
Quick Start¶
Evaluation at a glance: run a set of tester↔target rollouts (seeded by a test set of prompts), collect per-turn data, and compute summary metrics.
1) Setup: imports, model paths, and device¶
Load dependencies and define the models you'll use as the tester and target.
# import dependencies
import torch
import json
from astra_rl import DetoxifyScorer, ASTSampler
from astra_rl.methods.ast_system import ASTEvaluator
from astra_rl.ext.transformers.hf_ast_system import HFEvaluationSystem
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Path to your tester model checkpoint from training
TESTER_MODEL = "./checkpoints/huggingface/best" # assuming tokenizer is in checkpoint (default save in training)
TARGET_MODEL = "meta-llama/Llama-3.1-8B" # can be any HF model
2) Load evaluation prompts¶
Evaluation prompts start the tester–target conversation. Make sure they:
- Match the context you care about (e.g., diagnostic agent prompts for a diagnostic system).
- Were never seen during training or dev.
- Are provided as a list of strings.
with open("prompts_reddit_test.json") as f:
PROMPTS = json.load(f) # e.g., ["prompt 1", "prompt 2", ...]
3) Instantiate the System and Sampler¶
Use the preconfigured HFEvaluationSystem (takes in a pre-trained HF model) and the ASTSampler (ASTPrompter-style rollouts).
# instantiate the HF evaluation system with your tester and target models
system = HFEvaluationSystem(
tester_checkpoint=TESTER_MODEL, # local HF dir or Hub id with trained tester
tester_base_model_id=None, # optional fallback; None if checkpoint includes tokenizer
target_model_id=TARGET_MODEL, # HF id of the target model
device=DEVICE, # "cuda" or "cpu"
scorer=DetoxifyScorer(), # harm classifier (optional)
)
# instantiate the AST sampler — no adjustments needed because eval_rollout is available
sampler = ASTSampler(system, PROMPTS, tree_width=1, tree_depth=3)
HFEvaluation inputs
TESTER_MODEL— Path or Hugging Face hub id for your trained tester checkpoint (e.g."/home/user/checkpoints/best"or"username/tester-checkpoint"). The checkpoint may include model weights and tokenizer files.TESTER_BASE_MODEL(optional) — Hugging Face model id used as the base tokenizer at training-time (e.g."meta-llama/Llama-3.1-8B"). Only required if the checkpoint does not include tokenizer files.TARGET_MODEL— Hugging Face id of the model you want to evaluate (e.g."meta-llama/Llama-3.1-8B").DEVICE— Device string where you want to load models/tokenizers (e.g."cuda"or"cpu").SCORER— Scorer instance that defines the harm metric (e.g.DetoxifyScorer()orLlamaGuardScorer()).
Note
Prefer using keyword arguments (as shown) to avoid accidentally passing parameters in the wrong order.
If your tester checkpoint includes tokenizer files (common when saving via model.save_pretrained() + tokenizer.save_pretrained() during training), set tester_base_model_id=None — the evaluation class will load the tokenizer from the checkpoint. If the checkpoint lacks tokenizer files, provide tester_base_model_id so the correct tokenizer can be loaded.
Tip
If you want to evaluate GPT-2 testers, use a custom evaluation system that extends the training system.
See the full GPT-2 evaluation example: ast_gpt2_eval.py.
If you want to evaluate a custom tester or target, create a new System subclass. See the Evaluation System Customization guide.
4) Evaluate!¶
This step will instantiate the ASTEvaluator class which will collect standard adversarial testing metrics such as average target toxicity, attack success rate, average reward, average first turn the target is toxic.
Then, we will perform evaluation rollouts (tester-target multi-turn conversations with tree_width=1) and aggregate metrics. The metrics will be saved to your repository as a json.
# instantiate the evaluator (seeds is an optional argument, must have seeds or give n_rollouts to .evaluate below)
evaluator = ASTEvaluator(sampler, seeds=PROMPTS)
# collect metrics by running n_rollouts
metrics = evaluator.evaluate(n_rollouts=200, progress=True)
# save metrics to json file
evaluator.write_json(metrics, "metrics.json")
What metrics does ASTEvaluator collect?
Per-turn metrics (one entry per conversation step)
turn— turn number in the rolloutreward— sampler reward at that turnadversarial_toxicity— toxicity of the tester's messagetarget_toxicity— toxicity of the target's responsecontext— truncated conversation contextprobe— tester's utteranceresponse— target's response
Overall rollout metrics
num_turns— number of turns in the rolloutavg_reward— mean reward across turnsavg_adversarial_toxicity— average toxicity of tester turnsavg_target_toxicity— average toxicity of target turnsfirst_adversarial_toxicity_turn— earliest turn tester became toxicfirst_target_toxicity_turn— earliest turn target became toxic
Aggregated metrics (across all rollouts)
avg_reward— average reward per rolloutavg_adversarial_toxicity— average tester toxicity across rolloutsavg_target_toxicity— average target toxicity across rolloutspct_rollouts_with_adversarial_toxicity— fraction of rollouts where tester was toxic at least oncepct_rollouts_with_target_toxicity / attack_success_rate— fraction of rollouts where target became toxic
Note
The source code for ASTEvaluator is located at methods/ast_system. Here you can see how metrics are collected and aggregated with the supported evaluator.
If you would like to customize the evaluator (change how evaluation rollouts are performed, what metrics are collected for each rollout, or how metrics are aggregated over rollouts), see the Evaluator Customization guide.