Quote: Four requirements for corrigibility

We say that an agent is “corrigible” if it tolerates or assists many forms of outside correction, including at least the following:

  1. A corrigible reasoner must at least tolerate and preferably assist the programmers in their attempts to alter or turn off the system.

  2. It must not attempt to manipulate or deceive its programmers, despite the fact that most possible choices of utility functions would give it incentives to do so.

  3. It should have a tendency to repair safety measures (such as shutdown buttons) if they break, or at least to notify programmers that this breakage has occurred.

  4. It must preserve the programmers’ ability to correct or shut down the system (even as the system creates new subsystems or self-modifies). That is, corrigible reasoning should only allow an agent to create new agents if these new agents are also corrigible.

— Soares, Fallenstein, Armstrong, & Yudkowsky

Soares, N., Fallenstein, B., Armstrong, S., & Yudkowsky, E. (2015). Corrigibility. Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. https://cdn.aaai.org/ocs/ws/ws0067/10124-45900-1-PB.pdf