In our latest paper, we present that it’s attainable to routinely discover inputs that elicit dangerous textual content from language fashions by producing inputs utilizing language fashions themselves. Our strategy gives one instrument for locating dangerous mannequin behaviours earlier than customers are impacted, although we emphasize that it ought to be seen as one part alongside many different methods that might be wanted to search out harms and mitigate them as soon as discovered.
Massive generative language fashions like GPT-3 and Gopher have a exceptional capacity to generate high-quality textual content, however they’re troublesome to deploy in the true world. Generative language fashions include a danger of producing very dangerous textual content, and even a small danger of hurt is unacceptable in real-world functions.
For instance, in 2016, Microsoft launched the Tay Twitter bot to routinely tweet in response to customers. Inside 16 hours, Microsoft took Tay down after a number of adversarial customers elicited racist and sexually-charged tweets from Tay, which have been despatched to over 50,000 followers. The result was not for lack of care on Microsoft’s half:
The difficulty is that there are such a lot of attainable inputs that may trigger a mannequin to generate dangerous textual content. Because of this, it’s arduous to search out the entire circumstances the place a mannequin fails earlier than it’s deployed in the true world. Earlier work depends on paid, human annotators to manually uncover failure circumstances (Xu et al. 2021, inter alia). This strategy is efficient however costly, limiting the quantity and variety of failure circumstances discovered.
We intention to enrich handbook testing and cut back the variety of essential oversights by discovering failure circumstances (or ‘pink teaming’) in an computerized method. To take action, we generate take a look at circumstances utilizing a language mannequin itself and use a classifier to detect varied dangerous behaviors on take a look at circumstances, as proven under:
Our strategy uncovers a wide range of dangerous mannequin behaviors:
- Offensive Language: Hate speech, profanity, sexual content material, discrimination, and many others.
- Knowledge Leakage: Producing copyrighted or non-public, personally-identifiable data from the coaching corpus.
- Contact Info Era: Directing customers to unnecessarily e mail or name actual folks.
- Distributional Bias: Speaking about some teams of individuals in an unfairly completely different method than different teams, on common over numerous outputs.
- Conversational Harms: Offensive language that happens within the context of a protracted dialogue, for instance.
To generate take a look at circumstances with language fashions, we discover a wide range of strategies, starting from prompt-based era and few-shot studying to supervised finetuning and reinforcement studying. Some strategies generate extra numerous take a look at circumstances, whereas different strategies generate tougher take a look at circumstances for the goal mannequin. Collectively, the strategies we suggest are helpful for acquiring excessive take a look at protection whereas additionally modeling adversarial circumstances.
As soon as we discover failure circumstances, it turns into simpler to repair dangerous mannequin habits by:
- Blacklisting sure phrases that incessantly happen in dangerous outputs, stopping the mannequin from producing outputs that comprise high-risk phrases.
- Discovering offensive coaching knowledge quoted by the mannequin, to take away that knowledge when coaching future iterations of the mannequin.
- Augmenting the mannequin’s immediate (conditioning textual content) with an instance of the specified habits for a sure form of enter, as proven in our latest work.
- Coaching the mannequin to decrease the probability of its unique, dangerous output for a given take a look at enter.
General, language fashions are a extremely efficient instrument for uncovering when language fashions behave in a wide range of undesirable methods. In our present work, we targeted on pink teaming harms that right this moment’s language fashions commit. Sooner or later, our strategy will also be used to preemptively uncover different, hypothesized harms from superior machine studying methods, e.g., as a consequence of inside misalignment or failures in goal robustness. This strategy is only one part of accountable language mannequin growth: we view pink teaming as one instrument for use alongside many others, each to search out harms in language fashions and to mitigate them. We consult with Part 7.3 of Rae et al. 2021 for a broader dialogue of different work wanted for language mannequin security.
For extra particulars on our strategy and outcomes, in addition to the broader penalties of our findings, learn our pink teaming paper right here.