Reasons to worry about AI safety via debate

published on 2024-05-16 · updated on 2024-06-18

The previous article in the sequence is Definition of the debate game.

From Section 5, “Reasons to worry”¹ from AI safety via debate²:

We turn next to several reasons debate could fail as an approach to AI alignment. These include questions about

training target (whether humans are sufficient judges to align debate),

capability (whether debate makes agents weaker),

our ability to find strong play in practice using ML algorithms, and

theoretical and security concerns.

Reasons why debate could fail? Is that all? I’m far more pessimistic. I will go further and make this claim:

Independent of AI alignment, debate is a deeply flawed truth-seeking technique.

So, before proposing debate as a method to help with AI alignment, it seems wiser to start with a simpler goal. I will phrase it as a question:

What interactions patterns are useful for truth-seeking?

I unpack this question in Interaction patterns for truth-seeking, the next article in this sequence.

Endnotes

I reformatted the first two sentence from section 5 as include bullet points for clarity.

G. Irving, P. Christiano, and D. Amodei, “AI safety via debate.” arXiv, Oct. 22, 2018. doi: 10.48550/arXiv.1805.00899.