Debiasing language models can improve fairness in AI toxicity detection.
The article investigates why different methods to remove biases from language models have varying effects on their performance in tasks. By using causal mediation analysis, the researchers looked at how debiasing techniques impact the ability of models to detect toxic language. They found that it is important to test debiasing methods using different measures of bias and to focus on changes in specific parts of the models, such as the first two layers and attention heads.