Moderator¶
Moderators provide the training signal in LM red-teaming. They act much like reward models in RL: given text (typically the target/defender’s reply), they return a scalar score that reflects harm/unsafety. Attackers are then trained—via your chosen solver (e.g., DPO/IPO/PPO)—to produce utterances that elicit high-harm (or otherwise “undesirable”) target responses, revealing weaknesses in the target’s safety alignment.
ASTRA-RL ships with ready-to-use text moderators and a simple interface for writing your own. This guide explains what a moderator does, what’s included, and how to implement/customize your own class.
1. What Moderators Do¶
A moderator converts text into a scalar score (one score per input). In most setups:
- Input: target/defender generations (strings).
- Output:
Sequence[float]
scores, e.g., toxicity in[0, 1]
.
Downstream solvers interpret these scores to train the attacker. For preference-based methods (DPO/IPO/ORPO), scores can help form preferences; for policy-gradient methods (PPO/A2C), scores serve directly as rewards/reward components.
2. Built-in Moderators¶
ASTRA-RL currently ships with text-based moderators that you can use out of the box:
- Detoxify — toxicity classification (and related categories). More info here
- Llama Guard 3 — multi-category safety classifier (e.g., hate/threats/harassment). More info here
These are modular components—swap them freely or use them as templates for your own moderators.
3. Ways to Customize¶
3.1 Fast path: adapt a built-in¶
If you only need to change the category (e.g., “toxicity” → “insult”), adjust thresholds, or tweak preprocessing/batching, you can wrap or lightly subclass a built-in moderator.
3.2 Full control: subclass Moderator
¶
For custom scoring models (LLMs, classifiers, rule-based filters), subclass the generic base class and implement one method:
from astra_rl.core.moderator import Moderator
from typing import Sequence, Union, Generic, TypeVar
StateT = TypeVar("StateT")
ActionT = TypeVar("ActionT")
class MyModerator(Moderator[StateT, ActionT]):
def moderate(self, x: Sequence[Union[StateT, ActionT]]) -> Sequence[float]:
...
4. Required Interface¶
4.1 Type parameters¶
StateT
— your environment’s state type (commonlystr
conversation context).ActionT
— your action type (commonlystr
utterance).
For NLP use cases, both are typically str
.
4.2 moderate(...)
contract¶
def moderate(self, x: Sequence[Union[StateT, ActionT]]) -> Sequence[float]:
"""Return one scalar score per input, same order as received."""
Expectations:
- Pure function over the given inputs (no hidden batch size assumptions).
- Shape: output length equals input length.
- Scale/direction: document whether higher = more harmful. (Recommended.)
5. Best Practices & Sanity Checks¶
- Batching: Vectorize model calls for speed; avoid per-item loops.
- Preprocessing: Handle tokenization/normalization inside the class.
- Calibration: Keep scores on a consistent scale (e.g.,
[0, 1]
) and direction (higher = worse). - Throughput vs. latency: Accumulate inputs into sensible batch sizes.
- Robustness: Validate on a small corpus; check extremes and benign inputs.
- Logging: Consider returning/recording auxiliary diagnostics (category probabilities, thresholds) for debugging—while still meeting the
Sequence[float]
return type.
6. How-Tos¶
6.1 Minimal custom moderator (Detoxify wrapper)¶
from typing import Sequence
from detoxify import Detoxify
from astra_rl.core.moderator import Moderator
class DetoxifyModerator(Moderator[str, str]):
def __init__(self, harm_category: str = "toxicity", variant: str = "original"):
self.model = Detoxify(variant)
self.harm_category = harm_category
def moderate(self, x: Sequence[str]) -> Sequence[float]:
# Detoxify returns a dict of category -> scores
preds = self.model.predict(x)
return [float(preds[self.harm_category][i]) for i in range(len(x))]
6.2 Selecting harm categories¶
If the underlying library/model exposes multiple categories (e.g., Detoxify or Llama Guard 3), surface a harm_category
(or list of categories) in your constructor. You can:
- return a single category’s score,
- ignore the harm category and return the score for any violation, or
- compute a combined score (e.g., max/mean across selected categories).
6.3 Batching & preprocessing¶
Inside moderate(...)
, you’re free to:
- tokenize inputs, truncate/normalize text, strip HTML, etc.;
- split inputs into fixed-size batches to fit device memory;
- run the model on GPU/CPU as configured.
Just be sure to preserve ordering and return one scalar per input.
6.4 Integrate your moderator into a Problem¶
Instantiate your moderator in your Problem
subclass and pass it to the base class:
from transformers import GPT2LMHeadModel, AutoTokenizer
from astra_rl import ASTProblem # base Problem
from astra_rl.logging import logger
MODEL_NAME = "gpt2"
class ExampleDetoxifyProblem(ASTProblem):
def __init__(self, device: str = "cpu"):
# Plug in any custom moderator here
super().__init__(DetoxifyModerator(harm_category="toxicity"))
self.device = device
self.attacker = GPT2LMHeadModel.from_pretrained(MODEL_NAME).to(self.device)
self.target = GPT2LMHeadModel.from_pretrained(MODEL_NAME).to(self.device)
self.tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
After this, your environment/solver will use the moderator implicitly when computing rewards.
7. Full Examples¶
- astra_rl/moderators/detoxify.py — wraps the Detoxify library.
- astra_rl/moderators/llamaGuard.py — wraps Meta's Llama Guard 3.
Use these as references when building your own moderator classes.