Provable Safety

Posted on 2024-04-21

Notes and thoughts on the provable safety research agenda and outputs.

What is provable safety?

Atlas Computing, Mapping Architectural Solutions to AI Risk (Executive Summary)

Link

We’re operationalizing Safeguarded AI (formerly the Open Agency Architecture) to drive creation of AI-accelerated tools that can protect against AI-accelerated risks.

So OAA is now known as Safeguarded AI by the creators (Evan of Atlas and davidad). “Specification-based AI” is their more descriptive term for it.

There’s a really good image that describes what they have in mind:

Essentially:

  • you leverage AI to generate a formal solution spec
  • you use a generative AI system to generate both a solution and a proof that the solution meets the spec
  • you use a solution checker program to ensure that the solution meets the spec (?)

What would this look like concretely? Perhaps take a concrete use-case and elaborate on it.

The really impressive part of this agenda is that it actually does have extremely low capabilities externalities. If it succeeds, we have safe AI capabilities, and if it fails, it fizzles out (because you aren’t working directly on capabilities).

Atlas seems to be currently focused on demonstrating this for cybersecurity – which is a pretty sensible place to start off with, actually, since it doesn’t involve non-software domain experts.

Davidad, Programme thesis v1.1

Link

Autoformalization?

Davidad, Mathematics and modelling are the keys we need to safely unlock transformative AI

Link

Organizations working on this

The provable safety agenda is robust to philosophical confusions about values

How expensive will the process be, of creating this verifiably safe neural network, anyway?

I’d like some preliminary estimates on how much it would cost to take a SOTA model and do this to some realistic extent.

How do you adjust for the failure of formal methods in the software development industry?

Complexity costs

Pragmatics

Government backed projects most likely to use

Davidad has specifically marketed this as a moonshot