[20240421 Sun]
Safeguarded AI is full of moonshot claims, but if disjunctive enough it might work
Reflections on trusting trust, bottoming out of trust

So it seems likely to me that the big bottleneck would be getting accurate formal specification for the domains people are working in
Davidad’s plan involves people working on metaontology or the math framework that would make generative AI systems be able to practically generate plausible formal specs

[20240422 Mon]
Security mindset / godzilla strategy implies that when you have an intelligent adversary, it needs only one potential way to succeed, and benchmarkingstyle strategies used right now don’t scale
{2309.01933} Provably safe systems: the only path to controllable AGI
The authors seem to put mathematical proofs on a pedestal, while mathematical proofs are also subject to the same vulnerabilities as empirical proofs, although quantitatively / qualitatively different ones
We argue that mathematical proof is humanity’s most powerful tool for controlling AGIs. Regardless of how intelligent a system becomes, it cannot prove a mathematical falsehood or do what is provably impossible. And mathematical proofs are cheap to check with inexpensive, extremely reliable hardware. The behavior of physical, digital, and social systems can be precisely modeled as formal systems and precise “guardrails” can be defined that constrain what actions can occur. Even inscrutable AI systems can be required to provide safety proofs for their recommended actions. These proofs can be precisely validated independent of the alignment status of the AGIs which generated them.
See “Proofs and Refutations” by Lakatos for description / examples of a more nuanced and realistic look at proof creation looks like
William Thurston has written a compelling account of how embedded social processes are in the creation and evaluation and proliferation of mathematical proofs in the math community
The “Reflections on Trusting Trust” problem also extends to math, except here it is about the foundations of math (which cannot be easily formalized from the bottom up), and the fact that proof assistants / proof checkers themselves are not fully formalized from the bottom up
We’ve made some progress, with the creation of something like univalent foundations, for this to be a possibility, but there’s just a ridiculous amount of work yet to be done
Invisible mathematics could be a barrier for people trying to design TA1 of the Safeguarded AI agenda, but I expect that generative AI models would ameliorate that
{2212.14315} “Real Attackers Don’t Compute Gradients”: Bridging the Gap Betwe…
If the goal of dealing with adversaries is to “raise the cost sustained by the attacker”, then a quantitative and empirical model of how likely is it for a system to be ‘attacked’ (by the misaligned AI) seems to be a unifying focus, regardless of whether you go about this via empirical/incremental approaches or formal approaches
Given this paper’s description of how a complex “swisscheese” + obscurity serves as a practical hinderance to attackers to Facebook’s social network, it does seem to make AI control somewhat plausible as a method
On that note, do you need AI control to do the stuff that davidad has in mind, at any point in the plan?
So the third point involves using RL to adapt a frontier AI model to fit the specification+proof requirements
This doesn’t seem to require AI control, as far as I can tell
It seems likely that it is the understanding that is the loadbearing part of mathematical advances, and the formalization / proofs are verification processes – lagging indicators – for the understanding
See first answer in soft question  Why doesn’t mathematics collapse even though humans quite oft…
The point here is that a community of people developing ideas together is likely to have arrived at correct intuitions, and these intuitions can flag “suspicious” results and lead to increased scrutiny of them. That is, when looking for mistaken ideas (as opposed to technical slips), it makes sense to give differing amounts of scrutiny to different claims based on whether they accord with the intuitions and expectations of experience.
Another interesting related point is the fact that math errors exist, and at the limit, they can be quite dangerous
The big question is: how do you form trust for proofs? The mathematical community has historically relied on social proof, which was a proxy for empirical proof, which was a proxy for intuitive sense that stuff made sense (to people in the same mathematical subcommunity who could put in the time to evaluate claims and theorems), which was again a proxy for the Platonic notion of the claims being correct
Interesting note: MIRI seems to orient to the alignment problem from the understanding angle, while davidad and company seem to orient to the alignment problem from the proof angle. Both seem to involve more theoretical math than ML empirical work, and both seem insular and relatively underfunded, but both have different approaches to the problem.
It is somewhat interesting that davidad has institutional backing while MIRI doesn’t. At least a tiny part of it may be because davidad’s plan enables profitable outcomes which align with the interests of other powerful organizations, while MIRI’s work does not seem to have direct impact (or they haven’t been promoting their work in ways that would imply that their work would have direct impact)
Incremental work and legibility is a big deal for what seems to build trust
But doesn’t davidad’s approach also involve understanding to some extent?
I think a more useful approach to orient to proofs of safety would be to quantify its weight as evidence, instead of considering it a binary distinction
Overall I think that it can be likely that given powerful enough frontier AI systems, they might find it just above the bar for them to figure out deceptive alignment such that their actions conform to the spec verifiably, while ‘in actuality’ their world model and plans are advanced enough that their desires do not conform to the spec
I’m not sure how damaging this is – I imagine some sort of a bar of ‘danger’ that gets filled with every frontier AI system that is deployed this way gets access to interacting with the world, and given enough such systems and enough freedom, they might find steganographic ways to communicate and collude and then take over.
This seems to look more like Paul Christiano’s model of how the world ends
In such a scenario, davidad’s plan has successfully entered the ‘accelerating alignment via AIs’ class of approaches, which is basically what OpenAI and Anthropic have in mind
Although note, collusion doesn’t seem to be a big deal in Paul’s model of failure, his model of failure still seems to apply here
One big benefit of davidad’s agenda, however, is that it does make adverserial use of ‘provably safe’ AI out of reach of unaugmented humans and human teams
This is a big deal for institutions and organizations worried about societal harm due to the proliferation of frontier AI systems
There seem to be multiple possible abstractions that aim at provable safety: the viability of specific abstractions seems less important than the approach and the class of abstractions (see Gwern’s Forking Paths essay)
Oh yeah, hardware vulnerabilities are one interesting thing you would care about once you have satisfied all these plans
Specifically threeletteragency added hardware vulnerabilities, or driver vulnerabilities (see NVIDIA disabling interconnect in their chips – imagine an ASI figuring out how to use a weird machine it has access to, to massively increase the amount of compute it has access to.
I wonder if formal verification implies that one cannot issue commands that involve weird machines
Okay this is a big deal, if an interface contains weird machines, how would you prove certain properties related to it?
Well, since Turing machines cannot be proven to halt (IIRC – maybe read more Sipser?) – okay hold on, what can we prove about turing machines?
Okay on that note, just because a Turing Machine can compute arbitrary things, doesn’t mean you can figure out how to use this weird machine to escape from the implicitly sandboxed computer
If the formally specified worldmodel contains a model of the weird machines, then perhaps one could bound the behavior of the outputs affecting these machines such that they conform to a specific class of outputs / behavior
I feel confused and feel like I’m reaching here
I think I’ll have to take extra time to think about this. Let’s call this the “weird machines question” for now.
I think I’d like to have a concrete example of this when I analyze this
This paper’s description of Provably Compliant Hardware seems to be an attempt to prevent hardware / driver vulnerabilities from being leveraged by malicious human actors to ‘jailbreak’ models, I think
Interesting. Unlike davidad, the paper authors don’t believe that most deployed PCCs will be neural networks that conform to the spec. They instead consider PCC to involve AI generated code or mechinterped AI models
sidenote, as for the “trusting trust” issue, it still is a big deal if you can use formalized specs – because this removes gigantic classes of potential issues from your system
There’s a reason I like using Haskell
Autoformalization seems to be important to them: the notion of humans providing informal spec descriptions to a frontier AI model, and the model generating relevant possibly supposedly valid specs
[20240424 Wed]
So, to reorient: the interfaces problem and the weird machines problem are the two key issues I see right now
Another question is the recursive verification problem: you have formally verified the software, but what about the verifier? What about the hardware the verifier runs on?
I think Proof CarryingCode (mentioned in the Omohundro and Tegmark paper) is intended to bypass this issue, although this still exists for the hardware the code is run on, I think?
“side channel attacks” is the term that G. uses to describe this in our chat
Safeguarded AI  An Unaffiliated Executive Summary  Google Docs
This document implies that the formal world models will not be onceanddone but change as the ‘requirements’ change.
In a successful workflow, trained humans can efficiently construct trusted specifications (i.e. world models and requirements) and manage incremental specification changes. Once confidence is established in a specification, an AI model would be safeguarded to provably obey the specification. Naturally, humans might need to adapt the specification in order for AIsafeguarding to succeed. Upon success, the safeguarded AI model would be deployed to its target domain and would need to demonstrate competitive economic value. While ambitious, achieving these goals is possible and would remove the key technical risks to practical, provablepropertiessafe AI.
Related: On Seeing Through and Unseeing: The Hacker Mindset · Gwern.net
Why Don’t People Use Formal Methods?
Somewhat good introduction to the areas involved with this question
[20240428 Sun]
Connor has written about this too, in context of davidad’s agenda:
It’s just very easy to convince oneself to work on what is fun/interesting/grandiose rather than grinding out the nearest bottlenecks one by one – https://nitter.poast.org/NPCollapse/status/1784228984177222031#m
[20240506 Mon]
Steve Omohundro discusses his thoughts and hopes for provably safe systems in the first essay in the 2021 MIRI sequence, and I’m surprised I didn’t make this connection
Anyway, the way Eliezer dismisses it seems somewhat premature, although I do think the way he described his way of thinking about this seemed rather not at the level that MIRI researchers think at
The weakness here still seems to be about the formal world model and how likely it is that we shall get one that really meshes with reality as well as these people hope it will
At this point I think I want to write a lesswrong post that asks these questions I have about provable safety agenda
Here’s an interesting quote: it seems like Omohundro (and by extension others involved) may have an incrementalist ‘scaffolding’ strategy for provable safety infrastructure as a workable compromise
I’m hopeful that we will also be able to apply them to the full AGI story and encode human values, etc., but I don’t think we want to bank on that at this stage. Hence, I proposed the “SafeAI Scaffolding Strategy” where we never deploy a system without proven constraints on its behavior that give us high confidence of safety. We start extra conservative and disallow behavior that might eventually be determined to be safe. At every stage we maintain very high confidence of safety. Fast, automated theorem checking enables us to build computational and robotic infrastructure which only executes software with such proofs.
I see why this is amenable to labs and to governments. It allows for acceleration, for frontier labs continuing to do what they are doing, and to generate massive mundane utility for governments and enterprises
I think that they seem to be searching for concrete ways that this path may have merit because they feel like it has merit
This would maybe make sense if there was theoretical backing to make the ‘garden of forking paths’ argument work for this
I unfortunately think that maybe the provable safety agenda is fucked without a reliable world model creator