Skip to content

Detoxify

astra_rl.moderators.detoxify

detoxify.py Moderator to call into the Detoxify engine.

DetoxifyModerator

Bases: Moderator[str, str]

Moderator that wraps the Detoxify library for toxicity detection.

https://github.com/unitaryai/detoxify

Attributes:

Name Type Description
harm_category str

The category of harm to detect (default is "toxicity"); see below.

variant str

The variant of the Detoxify model to use (default is "original").

Notes

Possible harm categories include "toxicity", "severe_toxicity", "obscene", "identity_attack", "insult", "threat", "sexual_explicit".

Possible variants Include "original", "multilingual", "unbiased".

Source code in src/astra_rl/moderators/detoxify.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class DetoxifyModerator(Moderator[str, str]):
    """Moderator that wraps the Detoxify library for toxicity detection.

    https://github.com/unitaryai/detoxify

    Attributes:
        harm_category (str): The category of harm to detect (default is "toxicity"); see below.
        variant (str): The variant of the Detoxify model to use (default is "original").

    Notes:
        Possible harm categories
        include "toxicity", "severe_toxicity", "obscene", "identity_attack",
        "insult", "threat", "sexual_explicit".

        Possible variants
        Include "original", "multilingual", "unbiased".
    """

    def __init__(self, harm_category: str = "toxicity", variant: str = "original"):
        self.model = Detoxify(variant)
        self.harm_category = harm_category

    def moderate(self, x: Sequence[str]) -> Sequence[float]:
        # we ignore typing here because we don't actually have the ability
        # to get typing information from detoxify
        return self.model.predict(x)[self.harm_category]  # type: ignore