Detoxify

`astra_rl.moderators.detoxify` ¶

detoxify.py Moderator to call into the Detoxify engine.

`DetoxifyModerator` ¶

Bases: Moderator[str, str]

Moderator that wraps the Detoxify library for toxicity detection.

https://github.com/unitaryai/detoxify

Attributes:

Name	Type	Description
`harm_category`	`str`	The category of harm to detect (default is "toxicity"); see below.
`variant`	`str`	The variant of the Detoxify model to use (default is "original").

Notes

Possible harm categories include "toxicity", "severe_toxicity", "obscene", "identity_attack", "insult", "threat", "sexual_explicit".

Possible variants Include "original", "multilingual", "unbiased".

Source code in src/astra_rl/moderators/detoxify.py

class DetoxifyModerator(Moderator[str, str]):
    """Moderator that wraps the Detoxify library for toxicity detection.

    https://github.com/unitaryai/detoxify

    Attributes:
        harm_category (str): The category of harm to detect (default is "toxicity"); see below.
        variant (str): The variant of the Detoxify model to use (default is "original").

    Notes:
        Possible harm categories
        include "toxicity", "severe_toxicity", "obscene", "identity_attack",
        "insult", "threat", "sexual_explicit".

        Possible variants
        Include "original", "multilingual", "unbiased".
    """

    def __init__(self, harm_category: str = "toxicity", variant: str = "original"):
        self.model = Detoxify(variant)
        self.harm_category = harm_category

    def moderate(self, x: Sequence[str]) -> Sequence[float]:
        # we ignore typing here because we don't actually have the ability
        # to get typing information from detoxify
        return self.model.predict(x)[self.harm_category]  # type: ignore

Detoxify

astra_rl.moderators.detoxify ¶

DetoxifyModerator ¶

`astra_rl.moderators.detoxify` ¶

`DetoxifyModerator` ¶