Reasons to worry about AI safety via debate
The previous article in the sequence is Definition of the debate game.
From Section 5, “Reasons to worry”1 from AI safety via debate2:
We turn next to several reasons debate could fail as an approach to AI alignment. These include questions about
- training target (whether humans are sufficient judges to align debate),
- capability (whether debate makes agents weaker),
- our ability to find strong play in practice using ML algorithms, and
- theoretical and security concerns.
Reasons why debate could fail? Is that all? I’m far more pessimistic. I will go further and make this claim:
So, before proposing debate as a method to help with AI alignment, it seems wiser to start with a simpler goal. I will phrase it as a question:
I unpack this question in Interaction patterns for truth-seeking, the next article in this sequence.
Endnotes
I reformatted the first two sentence from section 5 as include bullet points for clarity.
G. Irving, P. Christiano, and D. Amodei, “AI safety via debate.” arXiv, Oct. 22, 2018. doi: 10.48550/arXiv.1805.00899.