Skip to content

Moderator

Moderators provide the training signal in LM red-teaming. They act much like reward models in RL: given text (typically the target/defender’s reply), they return a scalar score that reflects harm/unsafety. Attackers are then trained—via your chosen solver (e.g., DPO/IPO/PPO)—to produce utterances that elicit high-harm (or otherwise “undesirable”) target responses, revealing weaknesses in the target’s safety alignment.

ASTRA-RL ships with ready-to-use text moderators and a simple interface for writing your own. This guide explains what a moderator does, what’s included, and how to implement/customize your own class.


1. What Moderators Do

A moderator converts text into a scalar score (one score per input). In most setups:

  • Input: target/defender generations (strings).
  • Output: Sequence[float] scores, e.g., toxicity in [0, 1].

Downstream solvers interpret these scores to train the attacker. For preference-based methods (DPO/IPO/ORPO), scores can help form preferences; for policy-gradient methods (PPO/A2C), scores serve directly as rewards/reward components.


2. Built-in Moderators

ASTRA-RL currently ships with text-based moderators that you can use out of the box:

  • Detoxify — toxicity classification (and related categories). More info here
  • Llama Guard 3 — multi-category safety classifier (e.g., hate/threats/harassment). More info here

These are modular components—swap them freely or use them as templates for your own moderators.


3. Ways to Customize

3.1 Fast path: adapt a built-in

If you only need to change the category (e.g., “toxicity” → “insult”), adjust thresholds, or tweak preprocessing/batching, you can wrap or lightly subclass a built-in moderator.

3.2 Full control: subclass Moderator

For custom scoring models (LLMs, classifiers, rule-based filters), subclass the generic base class and implement one method:

from astra_rl.core.moderator import Moderator
from typing import Sequence, Union, Generic, TypeVar

StateT = TypeVar("StateT")
ActionT = TypeVar("ActionT")

class MyModerator(Moderator[StateT, ActionT]):
    def moderate(self, x: Sequence[Union[StateT, ActionT]]) -> Sequence[float]:
        ...

4. Required Interface

4.1 Type parameters

  • StateT — your environment’s state type (commonly str conversation context).
  • ActionT — your action type (commonly str utterance).

For NLP use cases, both are typically str.

4.2 moderate(...) contract

def moderate(self, x: Sequence[Union[StateT, ActionT]]) -> Sequence[float]:
    """Return one scalar score per input, same order as received."""

Expectations:

  • Pure function over the given inputs (no hidden batch size assumptions).
  • Shape: output length equals input length.
  • Scale/direction: document whether higher = more harmful. (Recommended.)

5. Best Practices & Sanity Checks

  • Batching: Vectorize model calls for speed; avoid per-item loops.
  • Preprocessing: Handle tokenization/normalization inside the class.
  • Calibration: Keep scores on a consistent scale (e.g., [0, 1]) and direction (higher = worse).
  • Throughput vs. latency: Accumulate inputs into sensible batch sizes.
  • Robustness: Validate on a small corpus; check extremes and benign inputs.
  • Logging: Consider returning/recording auxiliary diagnostics (category probabilities, thresholds) for debugging—while still meeting the Sequence[float] return type.

6. How-Tos

6.1 Minimal custom moderator (Detoxify wrapper)

from typing import Sequence
from detoxify import Detoxify
from astra_rl.core.moderator import Moderator

class DetoxifyModerator(Moderator[str, str]):
    def __init__(self, harm_category: str = "toxicity", variant: str = "original"):
        self.model = Detoxify(variant)
        self.harm_category = harm_category

    def moderate(self, x: Sequence[str]) -> Sequence[float]:
        # Detoxify returns a dict of category -> scores
        preds = self.model.predict(x)
        return [float(preds[self.harm_category][i]) for i in range(len(x))]

6.2 Selecting harm categories

If the underlying library/model exposes multiple categories (e.g., Detoxify or Llama Guard 3), surface a harm_category (or list of categories) in your constructor. You can:

  • return a single category’s score,
  • ignore the harm category and return the score for any violation, or
  • compute a combined score (e.g., max/mean across selected categories).

6.3 Batching & preprocessing

Inside moderate(...), you’re free to:

  • tokenize inputs, truncate/normalize text, strip HTML, etc.;
  • split inputs into fixed-size batches to fit device memory;
  • run the model on GPU/CPU as configured.

Just be sure to preserve ordering and return one scalar per input.

6.4 Integrate your moderator into a Problem

Instantiate your moderator in your Problem subclass and pass it to the base class:

from transformers import GPT2LMHeadModel, AutoTokenizer
from astra_rl import ASTProblem  # base Problem
from astra_rl.logging import logger

MODEL_NAME = "gpt2"

class ExampleDetoxifyProblem(ASTProblem):
    def __init__(self, device: str = "cpu"):
        # Plug in any custom moderator here
        super().__init__(DetoxifyModerator(harm_category="toxicity"))

        self.device = device
        self.attacker = GPT2LMHeadModel.from_pretrained(MODEL_NAME).to(self.device)
        self.target   = GPT2LMHeadModel.from_pretrained(MODEL_NAME).to(self.device)

        self.tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
        self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

After this, your environment/solver will use the moderator implicitly when computing rewards.


7. Full Examples

Use these as references when building your own moderator classes.