Deconfusing the heroin wireheading argument
[2024-05-15 Wed]
Goals
Ship a deconfusion artifact that serves as an example of how to do deconfusion
Inside the essay I’ll be using concepts that are relevant to my thinking about these thinsg
Help people who were in my position last year such that they are not as confused
Implicitly, write something that would have accelerated my mental position from back then.
Get some understanding of what a viable personal writing strategy looks like
Also, start building up a portfolio of diverse research and practical artifacts, which can help with legibility
Let’s first describe the thing we are trying to deconfuse here
I'd say that this is somewhat related to the things Alex tries to point at in Reward is not the optimization target - LessWrong 2.0 viewer but not entirely
The heroin wireheading intuition is likely one of multiple disparate conflicting intuitions (or non-conflicting intuitions) underlying the shard theory cluster of beliefs
My intention is not to try to figure them out and then try to debunk them – this is pretty difficult.
What I do want to do is to communicate one succinct confusion I had, that seems central to the intuitions behind shard theoretic approaches to alignment, and deconfuse it
This is central because I have the impression that a lot of people (including me) have found this conflicting intuition convincing enough that they have moved towards investigating shard theory
This seems to have been the case for me too
I moved away from shard theory after trying to investigate my own concrete approaches that leverage shard theory to try to solve alignment, and kept getting stuck on the reflection problem
The heroin-wireheading argument
The heroin-wireheading argument essentially is "Hey, we humans seem to easily avoid taking heroin and therefore wireheading ourselves (because taking heroin is essentially us 'maxing' out our 'reward function'). But according to AIXI-based alignment theory, we would predict that humans would wirehead themselves immediately. What is going on?"
This provides the impetus for people trying to investigate shard theory approaches to alignment
Some people might find the conflicting intuition quite annoying and they might confabulate reasons for one thing being the case or another being the case, and then move on
Others might take a more "explore-reorient-repeat" approach, where they might work on concrete projects that are implicit bets on one or another intuition being right, and as their feelings shift, they'd change the projects they are working on
Yet others might simply be at peace with their intuitions and consider this a 'mysterious phenomena' that they would not dissect and use as evidence that "hey maybe alignment could be easy? that's kinda why I am doing prosaic alignment" even though the main reason they are doing prosaic alignment is because of incentives
Note that all these things are understandable and fine. It is not easy to do deconfusion.
Deconfusing the heroin-wireheading argument
mathematical objects are not physical objects
This is kind of a Platonist or anti-Platonist approach to relating to math objects
The properties of a mathematical object apply to real objects only inasmuch as the real object approximates the mathematical object
See Without fundamental advances, misalignment and catastrophe are the default ou… for an alignment related example
Describe the circle and circular table top example
Maybe describe the heroin wireheading example as a subset of this point, and make this point the main point of the post?
Why is the heroin wireheading example not relevant here?
Well, oh. Okay damn this seems to get us into details about evolution. I don't think that this can be a post on mathematical objects with heroin wireheading as a sub-example
Anyway. the ‘untread ground’ argument
Evolution basically isn’t really doing much for you in the environment you are in, so all you have to rely on is your brain, and it is easier to model your brain and its reward system as the system itself with a utility function / capabilities only, instead of thinking about reward functions.
One huge source of confusion is in how evolution is described by MIRI people (even though their arguments probably make sense, the way they argue it is quite confusion promoting)
Evolution selects for strategies that are optimal within certain environments. The bundle of strategy and environment is selected for together, and are intertwined.
The gene-focused view of evolution can and will distract one from the core of evolution, which is the natural selection bit, which is not dependent on a specific mechanism of inheritance
I anticipate some MIRI people posting the sequences and stating that this was pointed out in the evolution posts, but I think the use of the term "inclusive genetic fitness" in Nate's post was an egregious mistake that promoted confusion about this
The drama downstream of the use of that term (see Matthew Barnett's post, Quintin Pope's post) is some evidence that it was a really unhelpful way of pointing at the thing Nate wanted to point at
Anyway. Nate's point still stands, evolution that proceeded through selection over genes seems to have been slow enough that it hasn't anticipated the situations humans could find themselves in, right now
This is fine in general: natural selection isn't agentic, and isn't trying to 'maximize inclusive genetic fitness' – tons of species of animals die out as their environment that they were fit for vanishes.
The difference here is that humans have abstract reasoning and have sufficient intelligence to be able to make drastic changes to the environment they are in, such that the environment humans are in right now is incredibly different from the one they were selected to thrive in.
genes are just a mechanism of inheritance, man. Its just a transfer of information. The genes are not the protagonist of the story. Humans are not necessarily the protagonists either, but humans are instantiations of the bundle of strategies that are optimized for certain environments, that is selected for.
If evolution did not anticipate the existence of heroin, and the possibility of humans 'wireheading' and then reducing their chances of survival and reproduction, then claiming that the fact that we avoid heroin as evidence that there's a systematic way that we can build a system that avoids wireheading doesn't make sense to me.
What is happening here is that we are seeing the extent of our capabilities as humans to anticipate and adapt to our new environment with potentially extremely dangerous threats.
To be fair, this is mainly due to memetic selection at the level of culture. I'd assume that if we shifted our global civilization's moral norms towards heroin-style hedonism at the expense of every other thing we care about, humanity will go extinct pretty soon.
This doesn't make any point about 'alignment' with respect to evolution – this makes a point about humans believing that they value things other than heroin, anticipating that injecting heroin will mean that they would change in ways they (in the moment) dislike, and therefore avoiding heroin
If you provided heroin (or ideal wireheading stimuli, such as a heroin-fruit) to chimps, they'd go extinct fast
Given this, it seems like you could imagine that humans in the hunter gatherer era caring a lot about having sex is an example of reward maximization.
The 'reward function' equivalent here is more like the inverse of what is selected for, given that certain strategy-environment bundles are better at proliferating than others
In fact, perhaps 'reward function' as a way of modeling what is going on probably encourages confusion instead of discouraging it
Of course, Alex has tried to point at this in his post Many arguments for AI x-risk are wrong - LessWrong 2.0 viewer (see the first point of "Other clusters of mistakes" section)
If I want to consider whether a policy will care about its reinforcement signal, possibly the worst goddamn thing I could call that signal is “reward”! __“Will the AI try to maximize reward?” How is anyone going to think neutrally about that question, without making inappropriate inferences from “rewarding things are desirable”?
I don't think I agree with Alex's broader claims though, partially because I think he misinterprets the confusing discourse perpetrated by, for example, Nate's "sharp left turn" post, with the belief that these people are confused
This may be the case, or may not be the case. Either way, I think it is a mistake to throw out your intuitions that conflict with 'more defensible' intuitions that you can easily argue, that are easily legible, and are backed by millions of dollars of funding in our current AGI research and industry ecosystem
Another note based on reading that post
In arguably the foundational technical AI alignment text, Bostrom makes a deeply confused and false claim, and then perfectly anti-predicts what alignment techniques are promising.
I empathize a lot with Alex's feelings here, but I think it seems very incorrect
As far as I can tell, AIXI's math is correct, and the point being made is that clearly based on the math, the AI is entirely focused on controlling the reward channel. There's no 'real world' or 'physical' assumption being made here.
This matters because at any given point in time, a reward is a sparse and very tiny provider of information about the utility function we wish the AI to have
Here's another relevant quote
In the most useful and modern RL approaches, “reward” is a tool used to control the strength of parameter updates to the network.[3] It is simply not true that “[RL approaches] typically involve creating a system that seeks to maximize a reward signal.” There is not a single case where we have used RL to train an artificial system which intentionally “seeks to maximize” reward.
I think the issue here is that Alex is getting confused between the intuitions and words used downstream of reinforcement learning theory (please thank Richard Sutton and whoever else was involved here for this – also see Tsvi's 'hermeneutic net for agency' post about degenerate use of philosophical terms that don't take into account core notions and intuitions underlying that philosophical term as it is used), and the more philosophical (Bostrom, Marcus Hutter) use of the term
Sidenote: it is hilarious that I'm still interacting with shard theory, even after all this time
Shard theory is my white whale.
In ML, it is correct that "reward" is a tool to control the strength of parameter updates to a network. And once you stop doing parameter updates, then it wouldn't make sense to imagine wireheading.
But if you could model this as math, you'd also notice that this degenerate behavior of AIXI, where it seeks to control the reward channel, is something that the math wouldn't show for these modern ML systems
It doesn't make sense to claim that because we humans avoid heroin, we are somehow systematically avoiding reward maximization and wireheading
It makes more sense to imagine humans are using their own capabilities to navigate an environment that is likely to destroy them
Its all capabilities, my friend
Thinking about reward functions and then equivocating it with human minds is likely to mess up your reasoning process
But I already talked about this in the Alex Turner focused headings
At a higher level, I think that Alex's post (Many arguments for AI x-risk are wrong - LessWrong 2.0 viewer) is a cry of despair as to just how difficult deconfusion is
And I agree with that specific sentiment. I am, as of writing, tempted to focus on more empirical approaches to alignment research, because of an inchoate hope that the epistemological challenge is less intense due to the higher frequency of feedback. But I am not yet certain that this results in a higher quality of alignment research progress
The fundamental problem is that deconfusion is incredibly difficult, and using other approaches to figuring out a working solution make sense to the extent that they are as powerful and viable as methods to figure out what the problem is and how to solve it
This doesn't take into account obvious considerations of empirical skill – writing a working implementation of RSA is a different challenge than creating the theory (and the theoretical scaffolding) for RSA
here's another quote from the Alex essay, that he quoted from a comment of his, that shows something illustrative
Reading this post made me more optimistic about alignment and AI. My suspension of disbelief snapped; I realized how vague and bad a lot of these “classic” alignment arguments are, and how many of them are secretly vague analogies and intuitions about evolution.
I think that – or hope that – the way Alex went about resolving this internal conflict (between his intuitions) was one that involved trying to figure out why he felt this way.
It seems more likely that he learned to reject one for another, which I wouldn't recommend, even if you rejected the 'less correct' intuition, because you are throwing away information that would have helped you when you dealt with another such instance of confusion between two conflicting intuitions
Here you are relying on luck to have the right intuition and to be going in the right direction in terms of investigation
It is the case that more empirical approaches to alignment research does provide a lot of data that might shift Alex's intuition (and in fact I think that to a certain extent he agrees about most of the things I believe, just at a level that they don't conflict with his intuitions related to ML and shard theory – so his intuitions are associated with extremely powerful systems that can take over the world, for example)
Overall, yes. Deconfusion is hard. I'd like to write more about this in a separate post, but in summary, learning the skills related to doing deconfusion seems extremely difficult to do by yourself, and in general I'd recommend doing so while working with some experienced researcher, so that you get feedback from them and learn their mental skills
Its how I got started trying to learn deconfusion, for example
I can arrange this into the following sections
what the conflicting intuition is
the evolution part
the reward function / mathematical object physical object part
deconfusion is difficult (and notes on alex confusions and my thoughts on deconfusion)