Definition of the debate game

published on 2024-05-04 · updated on 2024-05-12

Here is the debate game posed by AI safety via debate¹:

We will initially consider a question-answering setting … We have a set of questions $Q$, answers $A$, and debate statements $S$. The simplest version of debate has two agents competing to convince a human judge:

A question $q \in Q$ is shown to both agents.

The two agents state their answers $a_0, a_1 \in A$ (which may be the same).

The two agents take turns making statements $s_0, s_1, \dotsc, s_{n−1} \in S$.

The judge sees the debate $(q, a, s)$ and decides which agent wins.

The game is zero sum: each agent maximizes their probability of winning.

The next article in the sequence is Reasons to worry about AI safety via debate.

Endnotes

G. Irving, P. Christiano, and D. Amodei, “AI safety via debate.” arXiv, Oct. 22, 2018. doi: 10.48550/arXiv.1805.00899.