Cryptanalyzing LLMs with Nicholas Carlini

‘Let us model our large language model as a hash function—’

Sold.

Our special guest Nicholas Carlini joins us to discuss differential cryptanalysis on LLMs and other attacks, just as the ones that made OpenAI turn off some features, hehehehe.

Watch episode on YouTube: https://youtu.be/vZ64xPI2Rc0

Links:

https://nicholas.carlini.com
“Stealing Part of a Production Language Model”: https://arxiv.org/pdf/2403.06634
“Why I attack”: https://nicholas.carlini.com/writing/2024/why-i-attack.html
“Cryptanalytic Extraction of Neural Network Models”, CRYPTO 2020: https://arxiv.org/abs/2003.04884
“Stochastic Parrots”: https://dl.acm.org/doi/10.1145/3442188.3445922
https://help.openai.com/en/articles/5247780-using-logit-bias-to-alter-token-probability-with-the-openai-api
https://community.openai.com/t/temperature-top-p-and-top-k-for-chatbot-responses/295542
https://opensource.org/license/mit
https://github.com/madler/zlib
https://ai.meta.com/blog/yann-lecun-ai-model-i-jepa/
https://nicholas.carlini.com/writing/2024/how-i-use-ai.html

This rough transcript has not been edited and may have errors.

Nicholas: We literally just like scrape the entire Internet and just like use whatever we can find and train on this. And so like now poisoning attacks become, you know, a very real thing that you have to worry about.

Thomas: Hello and welcome to Security Cryptography Whatever: Nights. I’m Thomas Ptacek and with me, as always, is Deirdre and David.

Deirdre: Yay.

Thomas: And we have a very special guest this week, which is what Deirdre Giuseta had to say. We’ve got with us Nicholas Carlini. Nicholas and I worked together at a little company called Matasano security 10, 15 years ago, always ago, Nicholas was one of the primary authors of a little site called Micro Corruption. So if you’ve ever played with Micro Corruption, you came across some of his work. But for the past many, many years, Nicholas has been getting his CS doctorate and developing a research career working on the security of machine learning and AI models. Did I say that right? That’s a reasonable way to put it.

Nicholas: That is a very reasonable way of putting it, yes.

Thomas: So this is like a podcast where like, we kind of flit in between security and cryptography, which is very mathematical. And one of the cool things about this is that the kinds of work that you’re doing on AI models, it’s much more like the cryptography side of things than a lot of other things that people think about when they think about AI security. So if you think about like AI or red teaming and alignment and things like that, those are people pen testing systems by chatting with them, which is a weird idea. Right. Like prompt injection is kind of a lot of that. But the work that you’ve been doing is more fundamentally mathematical. You’re attacking these things as mathematical constructs.

Nicholas: Yeah. When I give talks to people, I often try and explain, and security is always worth interesting. We’re thinking about things in multiple different angles. One of them is like, oh, this is a language model, it talks to you. The other is this is a function, numbers in, numbers out. And you can analyze it as a function, numbers in, numbers out. And so you can do interesting stuff to it. And I think, yeah, this is one of the fun ways to get lots of attacks, because most of the people who work on this stuff like to do the language side.

And so doing the actually math side gives you a bunch of avenues to do interesting work on this.

Thomas: So I think I have a rough idea of, from a security perspective, what the main themes of the attack that you’ve come up with are. And you probably have more because I haven’t read everything that you’ve Done. But before we get into that, I’m curious how you went from straight up pentester to breaking the mathematics of ML models.

Nicholas: Yeah. Right. So, yeah, so I’ve always really liked security and yeah, when I was an undergrad at Berkeley I really got into security and then I went Montesano for a couple summers. Yeah. Doing a bunch of stuff. And then I went to do my PhD and when I started my PhD, my PhD I started in system security. So my first couple papers were on return oriented programming and how to break some stuff where some people had, I don’t know, very silly defenses of like let’s monitor the, you know, control flow by looking at the like last 16 indirect branches and decide whether not that’s a ROP gadget or not. And there’s kind of some things that we rebroke.

And then I went to intel for a summer and I was doing some hardware security stuff and then I came back and I had to figure out what am I going to do a dissertation on. And I was getting lots of fun attacks, but fundamentally I wasn’t sure what’s a fundamentally new thing you could do because this field has been around forever and most of the research at the time was people were unwilling to implement some defenses because of the 2% performance overhead and like, how dare you make my C code 2% slower. I want my Python code that’s 100x slower or whatever. So it felt very silly. And so I was like, well, what’s something with this very little work on right now that seems like it is going somewhere? And it was machine learning. And so I decided to switch and try and write a dissertation in this new machine learning space. And yeah, then I just never stopped after that.

Deirdre: What year was this when you were actually switching over to focus on.

Nicholas: Yeah, I started to machine learning in 2016.

Deirdre: Okay.

Nicholas: Yeah. Which also means of course you get spectre. Like the next year after I go away from. I was very quickly like, should I go back? There’s all these fun new attacks now. But I stayed the course for a little while and no, I’m still following much of this stuff and it’s all very cool. But yeah, the thing in the back.

Thomas: Of my head is, were you this mathematically competent when you were working for us?

Nicholas: Yeah, well, so, yeah, so I did my undergrad in math and computer science and I got into computer science through cryptography initially. I. So in our high school it was an IB diploma thing that it’s kind of like you end up with some final project which is Kind of like an undergrad thesis, but a high school thesis, worse in many ways. And I did that on differential cryptanalysis.

Deirdre: Wow.

Nicholas: And it was like, okay, so it was like my S boxes were like four by four and whatever, but. So it was like a very small toy thing. But that’s how I started in security. And then that’s what got me to my advisor, Dave Wagner, who did cryptanalysis many years ago, and he ended up teaching my undergrad security class. And that’s how I ended up getting into research, because he had a bunch of cool stuff that he was doing. And so, yeah, that was sort of the flow of that. And so I’ve always.

Thomas: For your doctorate with Dave Wagner.

Nicholas: Yeah.

Thomas: Wow. Okay. Yeah, that’s pretty cool. So I guess an opening question I have for you is you’re highly mathematically competent and I’m highly mathematically incompetent. And one of the big realizations for me with attacking cryptography, which was a thing that we did a lot of at Monesano, was the realization that a lot of practical attacks were nowhere nearly as mathematically sophisticated as the literature or the foundations of the things that we were attacking. You needed basically high school algebra and some basic statistics. Right. Does that carry forward? Is that also true of attacking LLMs, or do I really need to have a good grip on like, you know, gradient descent and all that stuff?

Nicholas: Yeah. Okay. So you probably need a little bit more of some things, but I think just the things that you need are different. So, you know, it’s I think, probably first course in calculus, first course in linear algebra for the foundations and people. Okay. Of course, as with everything, there are lines of work on formally verified neural networks that are very, very far in the math direction that have proofs I don’t understand. And there’s lots of privacy theory that’s very far in the proof direction with differential privacy that I don’t understand. But to do most of the actual attacks, I feel like in some sense not understanding the deep math is helpful because lots of papers will say, here is some justification why this thing should work and it has complicated math.

And then they actually implement something. And the thing that they implement is almost entirely different than what they’re saying that actually it’s doing. And so in some sense, reading the paper is almost harmful because the code is doing something like whatever it’s doing, and you just look at the actually implementation and just, you need to understand some very small amounts of the gradients are 0 and 0 with floating point numbers is challenging as an object and so you should make the gradients not zero. And there’s a little bit of math which goes into making that happen, but it’s not some deep PhD level math thing that’s happening here.

Thomas: So tell me if I’m crazy to summarize the direction that you’re working in this way. One of the first things you got publicity for was an attack where you were able to sneak instructions into Alexa or whatever. Like just doing.

Nicholas: Sure, yes.

Thomas: It felt like almost steganographic what you were doing there, right? And then there’s a notion I think that’s pretty well known of you can poison a model by giving it, like when things are scraping all the images, trying to build up the image models or whatever, there’s poisoned inputs you can give there that’ll fuck the model up, right? And then after that there are attacks on those systems. Right? So if people try to poison models, you can do things to break the poisoning schemes. That’s a third kind of class of attacks there. And then finally, like the really hardcore stuff, from what I can tell sell is so for all of these models, they’ve been trained with basically the entire Internet times three worth of data, that’s that. And the huge amounts of compute that OpenAI or whatever have are the moat for these things, right? And the whole idea of these systems is you’re taking all that trading data and all that compute that’s running all that backpropagation or whatever it’s doing, right? And it’s distilling it down to a set of weights. And the weights are obviously much, much smaller than the corpus of training data there, small enough that you could exfiltrate it and then you would have the OpenAI model right there. And so that last class of attacks I’m thinking about are things where you have a public API of some sort, some kind of oracle essentially, to a running model and you can extract the weights so they haven’t revealed them to you. It’s not like a llama model or whatever where you downloaded it to your machine.

It’s running server side and it’s the crown jewels of whatever they’re doing there. And there are things that you can infer about those models just by doing queries. Is there another major area of kind of attacks there?

Nicholas: Yeah, this is the main ones that are on the security side. The one other thing that I do a little bit of is the privacy angle, which tries to say. So, yeah, so as you said, you have lots of data going into these models. Suppose that you have the weights to the model or you have API access to the model. What can I learn about the data that went into the model? And for most of the models we have today, this is not a huge concern because they’re mostly just trained on the Internet. But you can imagine that in the past people have argued they might want to do this and they don’t do it for privacy reasons. Now suppose you’re a hospital and you want to train a model on a bunch of patients medical scans and then release it to any other hospital so that they can do the scans in some remote region that doesn’t necessarily have an expert on staff. This might be a nice thing, absent other utility reasons that you might not do this, but this might be a thing you would want to do.

It turns out that models trained on data can reveal the data they were trained on. And so some of my work has been showing the degree to which it is possible to recover individual pieces of training data in a model, even though the model is 100x or 1000x smaller than the training data. For some reason the models occasionally pick up individual examples. So this is maybe the fourth category of thing that I’ve looked at.

Thomas: This is the thing I was thinking about when I was looking at your site is I think the natural thing to think about when you’re thinking about attacks against AI models is what can you do against a state of the art production model? Where state of the art means we’re at 4.0 or whatever’s past 4.0 and production means we have a normal API to it, we’re not getting the raw outputs of it or whatever, whatever the constraints are for actually hosting these things in the real world. But you have a presentation from several years ago, which I think was at Crypto, which was about an attack that was extracting hidden layers from. And it’s a simple neural network with RELU activation or whatever. And it sort of seems like compared to a full blown LLM model, like a toy problem. Right. Part of that is that we’ve kind of forgotten that until just a couple of years ago this was all AI stuff, was individual constructions around kind of basic neural network stuff. And that stuff still happens for application specific problems. So the ability to attack a system like that, if you can pull confident like people will build classifiers off of confidential information for the next 10, 20 years.

Thomas: Right? It’ll still happen.

Deirdre: And that’s definitely like a business selling point of like especially when they’re trying to, you know, if they localize it on a local device and they make it small enough and trainable enough. Like, you know, hopefully, quote, hopefully nobody but you is querying that model that you’ve trained on your local private data anyway. But maybe you sync it, maybe you back it up. Like, you know, maybe you want to allow your cloud provider to train on your private data store and it’s queryable from anywhere, quote, unquote, including them, because they want to help you and be, you know, proactive. And then the, the cat’s out of the bag. If you can extract the training data or even the weights from these models trained on your private data, that’s like.

Thomas: A fifth way to look at this thing. A fifth class of attacks. There is your natural way of thinking about an ML model might be like it’s a hash function of the entire Internet. It boils everything down to. And your research is based. One of your research targets is basically absolutely not right. It’s not a uniform random function across.

Nicholas: Yeah, exactly. It’s not a uniform. I mean it does behave like a hash function in many cases, but there are cases in which it does not behave this way. And this can matter if it’s a private example. And even in cases of models trained on all of the Internet, you can run into some of this where let’s suppose you have a model trained on all of the Internet and at that point in time someone had put something on the Internet by accident and then realized like, oops, I don’t want that to be there, I should take that down. And then they take that down, but the model stays around and now you can query the model and recover the data from the model and only the model, the leak has remained because you have trained in this meantime. And so there are still some other concerns there for these other reasons too. But yeah, there’s lots of fun angles of attack here.

Thomas: Okay, let’s zoom in on that fourth thing here. And then I think we can hit the poisoning and anti poisoning stuff later on because I think it’s super interesting. But also that paper is less headache inducing than the stuff we’re about to get into. Right. So yeah, I mean starting at like, so you know, your, your most recent, your most recent research that’s published is on actual production models like actually targeting things that are running at OpenAI or whatever. Right, but like that, that model problem from earlier of directly extracting hidden layers. Right. If I know enough to like, you know, reason about cryptography engineering, get me started along. Sure, yeah.

Nicholas: Okay. Okay, so here’s the motivation. So let’s suppose you’re A cryptography person. We have spent decades trying to design encryption algorithms that have the following property. For any chosen input, I can ask for an encryption of that and then make sure that you can learn nothing about the secret key. This is a very hard problem. Yeah, okay. Nothing, whatever.

Yeah, okay. Subject to math stuff about whatever. Yeah, okay, now let’s take the same problem and state it very differently. Let’s imagine a person who has an input. They’re going to feed it to a keyed algorithm, a machine learning model. They’re going to observe the output. And I want to know, can I learn anything about the keyed thing in this case, the parameters of the trained model? It would be wildly surprising if it happened to be the case by magic, that this machine learning model happens to be like a secure encryption algorithm. You couldn’t learn things about the weights.

That would just be wildly surprising. And so proof by intuition, it should be possible to learn something about the parameters. Now, of course, okay, what does it mean when you have an actual attack on a crypto system? It means with probability greater than 2 to the -128 or whatever, I can learn something about some amount of data, whereas for the parameters of the model, I actually want to actually do a full key recovery kind of thing. So they’re very different problems I’m trying to solve, but this is the intuition of why it should be possible. Okay, now let me try and describe very, very briefly how the attack we do works, which is almost identically an attack from differential cryptanalysis. And what we try and do is we feed pairs of inputs that differ in some very small amount through the model. We can trace the pairs through, and we can arrange for the fact that in these machine learning models you have these neurons that either activate or don’t. And you can arrange for one input to activate and the other one to not activate.

And then it’s actually sort of like a higher order differential, because you need four, because you want gradients, one on either side of the neuron. And then as a result of this, you can learn something about the specific neuron in the middle that happened to be activated or not activated. And you can learn the parameters going into them, that neuron. And in order to do this, you have to have assumptions that are not true in practice. Like I need to be able to Send floating point 64 bit numbers through which no one allows in practice. And the models have to be relatively small because otherwise, just like floating point error screws you over. And a couple other things you can’t do that Real models don’t have, but under some strong assumptions, you can then recover mostly the entire model. And this is the paper that we had at Crypto a couple years ago.

Thomas: So just to. Just to kind of bring us up to speed on the vocabulary here. Right. So we’re thinking about, like a simple neural network model right here, like a series of hidden layers and an input and its outputs. Like, each of those layers that you’re trying to recover is essentially a matrix and the parameters for each of the things. But each neuron is basically a set of weights.

Nicholas: Yeah, sorry. So parameters and weights, people generally mean roughly the same thing. So what’s going on here is each layer has a bunch of neurons in each neuron. Each neuron sort of has a connection to all the neurons in the previous layer. And so each neuron is doing something like a dot product with everything in the previous layer. And you’re doing this for every neuron in the layer. And so that’s where you get the matrix is because you’re doing a sum of all these dot products.

Thomas: And when we’re thinking about recovering parameters from the model, what we’re trying to do is recover those weights.

Nicholas: Exactly.

Thomas: Okay.

Nicholas: I was going to say there’s also these things called biases. So parameters are technically the combination of the weights and the biases, but in practice, the biases are like this. Oh, what? One term that doesn’t really matter too.

Thomas: Much, but a bias determines whether or not a particular neuron in a hidden layer is going to fire.

Nicholas: Yeah. What the equation for the neuron does is it multiplies the weights by everything in the previous layer and then adds this constant bias term.

Thomas: And that constant bias term is independent of the input.

Nicholas: Correct.

Thomas: Is the idea. That’s why it exists. That’s why it’s not just a factor of the weight. Right.

Deirdre: But this is also the stuff that’s like how you tweak your deep neural net. Your model to be different than any other is these little biases and these.

Nicholas: Neurons and the weights. Yes, both of them.

Deirdre: Yeah. And the weights. But these are all the parameters that actually make your.

Nicholas: This is what, when you’re training your model, this is what you’re adjusting. You’re adjusting the weights and the biases.

Thomas: And as an intuition for how these actually work. Right. Like, it feels a little similar to the DESX boxes. Right. Where like you’re wondering where the hell these weights come from. And the reality is they start out random and they’re trained with inputs, with, like, labeled inputs or whatever the training Method is right. And you’re basically using stochastic methods over time to figure out what the set of weights are. And there’s a huge number of weights in a modern model, right, that optimized for some particular app with the S boxes.

It was optimizing for non linearity. And for this you’re optimizing for correctly predicting whether a handwritten letter number three is actually three or whatever it is that you’re doing there, correct?

Nicholas: Yes. And you go through some process, you train them and then someone sort of hands them down to you and says here are the weights of the thing that works.

Thomas: Okay, so I guess I have two questions, right? First, I’m curious for the model attack, for when we’re directly extracting model weights for the toy attack, is what I need to say. What is unrealistic about that? What constraints would you have to add to that to be in a, you know, a more realistic attack setting? And then second, what are the mechanics of that attack? What does that look like?

Nicholas: Yeah. Okay, so let’s. So what I need to do is I need to be able to estimate the gradient of the model at arbitrary points. So for this I need a high precision math evaluated an entire model in order to get this good estimate of the gradient. And in practice, yeah, this means I need like floating point 64 inputs, floating point 64 outputs. I need the model to not be too deep or just the number of floating point operations start to just accumulate errors. I need the model to not be too wide or you run into problems that just because of how the attack works, you just start running into compounding errors. You also need to be able to have a model that is only using a particular kind of activation function.

So after you do these matrix multiplies, if you stacked matrix multiplies, this is just linear layer on top of linear layer. You need some kind of non linearity. And in cryptography you have these xboxes or whatever. And in neural networks you just pick some very simple function like the max of x and 0. And this is called the relu activation function. And the attack only works if you’re using this particular activation function. And modern language models use all kinds of crazy like Sui gelu, which is some complicated. Who knows even what it has like some sigmoid thing in there and it’s like doing crazy stuff.

And it doesn’t work for those kinds of things. Because what the mechanics of the attack need is I need to be able to have neurons that are either active or inactive.

Deirdre: Okay.

Nicholas: And as soon as you have this like activation function that’s doing some like, you know, sigmoid like thing. It’s no longer like a threshold effect of either it’s active or it’s inactive. It’s now like, you know, a little bit more active or a little bit less active. It’s sort of much more complicated. So that’s when the attacks break down. And so, yeah, so these attacks that we have, they’re very. For the full model extraction, these attacks we have, they’re very limited in this way. I will say one thing, which is that last year we wrote an attack that does work on big giant production models.

If all you want is one of the layers. So it can only get you one layer, it can only get you the last layer. But if you want the last layer, it does work on production models. And we did implement this and ran it on some OpenAI models and confirmed that. So we worked with them. I sort of stole the weights, I sent them the weights. I said, is this right? And they said yes. And so you can do that, but like it only worked for that one particular layer.

And it’s very non obvious that you can extend this past multiple layers.

Thomas: Were they, by the way, were they surprised? What was their reaction?

Nicholas: Yeah, they were surprised. Yes. The API had some weird parameters that were around for historical reasons that maybe were bad ideas, if you’re thinking about security. And they changed the API to prevent the attack as a result of this. Which sort of, I think is a nice. This is how, you know, when you succeeded as a security person, you’ve convinced the non security people that they should make a breaking change to their API, which loses you some utility in order to gain some security. And I feel like this is a nice demonstration that you’ve won as a security person.

Thomas: Okay.

Deirdre: Because I was just about to ask, what did you prompt inject. What did you say in human language to extract the last layer?

Nicholas: Yeah, so again, I don’t talk to the model at all like a model. I think of it still like a mathematical function. Okay. And you know, it’s.

Deirdre: You do something like 001-001.

Nicholas: Yeah, this one was like a known plaintext attack kind of thing. And I needed to collect a bunch of outputs and then you do some singular value decomposition, something, something, something, and weights pop out. And yeah, magic happens for only one layer, but it kind of works.

Thomas: This is not the logit bias thing, right?

Nicholas: This is the logit bias, which is why that no longer exists on APIs. I have killed logit bias. I killed logit Bias and log probes at the same time.

Thomas: We should talk about logits real quick because it seems like one of the big constraints that you have attacking these systems, right, is that you have this series of matrix multiplies and biases and sigmoid functions or relus or whatever the hell they’re using, and it all ends up with a set of outputs. And those outputs get normalized to a probability distribution function, right? You just run some function over it. Everything is normalized to between 0 and 1. Everything adds up to 1. It’s a probability distribution, right?

Nicholas: Those are the logits, those are the probabilities, okay? And then you take the log of that and you get logits.

Thomas: And the logits are raw outputs from the model, but they’re not what you would see as a normal person would see as the output of the model.

Nicholas: Correct. So for like a language model, what happens is you get these probabilities and then each probability corresponds to the likelihood of the model emitting another particular word. So, you know, hi, my name is. If it knows who I am, the word Nicholas should have probability like 99%. And, you know, maybe it’s going to put something else that’s a lower probability. Maybe. Probably the next token would be Nick, because they have some semantic understanding that, like, this is the thing that’s associated with my name. And then probably the next token is going to be something you know, is like another adjective or a verb or something that, like, describes.

Anyway, so this is the way that these models work. And usually you don’t see these probabilities. The model just picks whichever one is most likely and returns that word for you and hides the fact that this probability existed behind the scenes. And so when people say that language models are stochastic or random, what they mean is that as a person who runs the model, I have looked at the probability distribution and sampled an output from that and then picked one of those outputs. The model itself gives you an entirely deterministic process. It’s just I then sample randomly from this probability distribution and return one word and then you repeat the whole process again to the next token and next token after that. Right?

Thomas: And so, like, it’s not a typical API feature to get the raw output. It’s usually. But There was an OpenAI. There was an OpenAI feature where you could set manually biases for particular tokens or.

Nicholas: Okay, so because of. Because the OpenAI API was created when GPT3 was launched, when this was not a thing that, like, was like a production tool. This was like a Research artifact for people who wanted to play with this new research toy. And one of the things you might want to do is suppose that I’m asking the model yes or no questions and suppose that you can’t, like, you know, say you should be equally likely to say yes or no or whatever. One thing you can do is you can ask the model a bunch of yes, no questions and then realize, hey, the model’s Giving me yes 60% of the time, no 40% of the time. You could just bias the other token to be a little more likely. And then now you can make the model be 50, 50 likely to give you yes or no. And so for that reason and others like it, people had this logit bias thing that would let you shift the logits around by a little bit if you wanted to.

And no one really uses it that much anymore, but because in the past it was a thing that could be done, it was just like hanging around.

Thomas: And yeah, okay, tell me I’m wrong about this. But like, so you can’t generally get the raw outputs and you would, as an attacker, like to get the raw outputs for these models. And the logit bias API feature was a way for you to, with active, like oracle type queries, approximate getting the logits out of it.

Nicholas: Yeah, okay, so let’s. So in the limit, what you could do is you can say, I’m going to use the logic bias and I’m going to do binary search to see what value I need to set of this token in order for this to no longer be the most likely output from the model. And then you could do this sort of token by token. So initially I say my name is and the model says Nicholas. And I say, okay, great, set the logic bias on the value Nicholas to negative 10. So this is now less likely. And I say my name is, and the model still says Nicholas. And so I know now that the difference between the token Nicholas and whatever comes next is at least 10.

And so then I said, okay, what about 20? And now it becomes something else. And I say, okay, fine, 15 and 15 is back to Nicholas. And you could do this and you can recover the values for each of these. Now, okay, it turned out they also have something else that’s called top K or logprops for top K lets you actually directly read off the top five logits, but only the top five. So what you can do is you can then just read off the top five and then you can logic bias those five down to negative infinity and then new things become the top Five and then you just take those and you push those down to negative infinity and then you can repeat this process through that way. And so what OpenAI did is they allow the binary search thing to still work because it’s a lot more expensive, but they say you can no longer do log probes and logic bias together. And so you can get one or the other, but not both. It’s a standard security trick where you don’t want to break all the functionality, but you can do this one or that one.

Thomas: And you see why I love this, right? Because this is almost exactly like running like a CVC pattern Oracle attack against LLM.

Nicholas: Yeah, exactly. Yes.

Thomas: It’s the same attack model.

Nicholas: Yeah. No, this is like. You see this all the time in these security things where. Yeah, it’s. Yeah. People sometimes ask, how did you come up with this attack? And it’s like, well, because you have done all this security stuff before that there’s no new ideas in anything. You look at this idea over there and you go, well, why can’t I use that idea? And this is a new domain. And then you try it and it works.

Deirdre: It’s new enough.

Thomas: So for OpenAI type production models. From the last paper I read of yours, it didn’t seem like you could do a lot of direct weight extraction from it the way you could.

Nicholas: No. Yeah, just one layer.

Thomas: But the logit leak thing that they came up with, that API problem was giving you directly the last layer.

Nicholas: Correct. That’s what was going on there.

Thomas: So I guess I still don’t feel like I have much of the intuition for what the simplest toy version of this attack looks like. Not with the top K logic bias thing that I get, which is awesome. Right. But in the best case scenario for you. What are you doing?

Nicholas: Okay. Yeah. Actually, the techniques that we use for this one on the real model are very, very different than the techniques that we use when we’re trying to steal an entire model. Parameter for parameter. They almost have nothing to do with each other. Which one are you asking about, the real Model 1 or the small Model 1?

Thomas: The small Model 1.

Nicholas: Okay, sure. Let me try and describe what’s going on here and I’ll check that things are okay with you. Let’s suppose I want to learn the first layer of a model, what I’m going to do. And let’s suppose this model, maybe the very simple case, let’s suppose this model is entirely linear. There’s zero hidden layers, input matrix, multiply output. How could I learn what the weights are, what I can do is I can send zero everywhere and one in one coordinate.

Thomas: Exactly how you would attack a hill cipher or whatever you just do on matrix invert.

Nicholas: Yeah, exactly. And so you just look at this value and you can read off the weights. Okay, so this works very well. Let’s now suppose that I gave you a two layer model. So I have a single matrix multiply. I then do a linear, I have my non linearity and then another matrix multiply. Okay, now let’s suppose that I can find some input that has the property that one of the neurons in the middle is exactly zero. That’s the activation at that point and what this means.

So remember that the activation function for this relu activation has the maximum between the input and zero. So this is like it’s a function that you know is flat until it’s while it’s negative and then it becomes like this linear like Y equals X function that goes up after that. Okay, so what I’m going to do is I’m going to try and compute the gradient on the left side and the gradient on the right side. This is just the derivative in higher dimensions on the left side and the derivative on the right side. And I’m going to subtract these two values and it turns out that you can show that the difference between these two values is exactly equal. After some appropriate rescaling to the values of the parameters going into the model, I’m going to pause and let you think about this for a second and then ask me a question about it.

Thomas: We’re going to pause and let the year ask a question.

Deirdre: No, I’m just like, okay.

David: I used to always joke. I was like, I stopped learning calculus because it’s not real. In the real world we operate on discrete things. Have you ever seen half a bit? But you’re making a very strong case here that I should have paid more.

Thomas: Calculus.

Nicholas: This is what makes things harder here in this case compared to cryptography where we can operate on half a bit of things that we have these gradients we can feed through. If you had to do the same attack on discrete inputs, it would be very hard. But because we can do this, it becomes a lot easier.

Thomas: I’m fixed on two things here. One is that you were describing the pure linear attack of this and I’m like, oh yeah, it’s exactly like reversing a purely linear cipher. Which is the worst thing for me to possibly say here because attacking a purely linear cipher is also a classic CTF level. It’s like, here’s AES, but we screwed up the S boxes. So it’s purely linear, right? And how do you attack that? Right? And it’s like, I know that because I’m that kind of nerd, like really specifically practically exploitable dumb things like that. But there’s no reason why anyone would have an intuition for that. So it’s like, well, you put the one here and the zero there and they’re like stop talking about the thing I don’t understand. Right? But the other thing here is just like I was also struck by again, this is not that important to the explanation that you have.

But I’m just saying, like if you take the purely linear model and it’s like, well, that’s trivially recoverable, that’s also the intuition for how post quantum cryptography with lattices works, right? Is like you take a problem that would be trivially solvable with like Gaussian elimination, right? And then you add an error element that breaks that simple thing and then there you go.

Deirdre: But also you just keep adding layers and layers and layers and layers and layers and dimensionality and the error. And then all of a sudden you’ve got, you know, n dimensional lattices and you’ve got post quantum learning with errors or short spectre problem or anything like that.

Nicholas: So yeah, no, and this is all. Yeah. And you know, a very realistic person when talking about these attacks, you know, so in practice people often run these models at like eight bits of precision. And so a very realistic argument for like why these attacks that I’m talking about don’t apply for these things is because at eight bits of precision you’re essentially just in learning with errors sort of world where like you’re just like adding a huge amount of noise every time you do you compute a sum.

Deirdre: Okay, but it does work. But it doesn’t work. It does work at the last layer.

Nicholas: Which because so that feels they were using very different attack approaches. Okay, so that’s the important thing to understand here is the two attacks achieve similar goals, but the methodology has almost nothing to do with each other. And so at the last layer it’s. Because what we’re doing is there’s only one linear layer that we’re trying to steal. And because it’s only one linear layer, I can do something to arrange for the fact that all I need to solve is a single system of linear equations. I compute the singular value decomposition. All the noise is noise which is there correctly. But you can just average out noise as long as you’re in one layer.

As soon as you know, imagine you’re doing learning with error, but you could had only a single linear thing and you could just average out all the Gaussian noise. Right. This is no longer a hard problem.

Deirdre: No, easy.

Nicholas: Right. And so this is the thing that we were doing, and this is why we think it basically doesn’t work at two layers, is because you have this noise of things you don’t know and you have to go through a non linearity and then like, all bets are off.

Thomas: This is why the OpenAI attack, this is why the logit bias plus top k thing doesn’t work over multiple layers. Is that correct? But the toy problem attack, where you’re trying to recover multiple. Where you’re trying to recover weights directly. Kind of like the forward version of the attack going from inputs through it, right. That extends to arbitrary numbers of layers.

Nicholas: That goes arbitrarily deep. As long as you’re working in high precision and you don’t have sort of these errors accumulate, which is, you know, so, so our attack works up to, I don’t know, like 8ish layers and after that just like GPU randomness just starts to like, you know, have. Give you a hard time.

Thomas: Is that like a floating point thing or is that like a GPU thing?

Nicholas: No, I mean, it’s a floating point thing, but, you know, you want to run these things on GPUs. And so.

Thomas: Yeah, gotcha. I mean, so like, generally like the forward attack doesn’t seem super practical. It’s more like a theoretical finding.

Nicholas: Yeah, I mean, this is the, it’s the kind of attack that, you know, crypto, right. Like, you know, we say, here’s the thing that you can do, like proof of a proof of concept. And our initial attack was sort of even our attack was exponential in the width of the model. And that was pretty bad. And among other people, IDI Schmier showed how to make a polynomial time in the width, which was a fun thing.

Thomas: The width being the number of layers.

Nicholas: No, sorry. The depth is the number of layers. The width is how wide each layer is. Number of neurons.

Deirdre: Yeah. How many neurons in each layer?

Thomas: Gotcha.

Nicholas: Yeah, yeah. Imagine that the thing is going top down and people sort of imagine the depth being the number of layers you have and the width being how. How many neurons, like make the layer wide.

Thomas: So is there stuff that you can do in the forward direction with a more realistic attack setting?

Nicholas: Not that I know of.

Deirdre: Oh, okay.

Nicholas: Yeah, yeah.

David: Okay. So that’s sucking like weights out of the model. We’ve, you’ve also done some getting the.

Nicholas: Training data something else. Now, the polite way of saying that.

David: Fix one input and switch to the other input, which is like, how do we get. In what cases does this training data sort of leak directly out of the model?

Deirdre: How can we suck the training data out of the model?

Nicholas: Yeah, so let’s imagine a very simple case. Let’s suppose that you’re training a model on all of the Internet and it happened to contain the MIT license like a billion times, which is very plausible. This software is provided, like when they ask them all to continue, what’s it going to do? The training objective of a language model is to emit the next token that is most likely to follow. What is the next most likely token after all caps this software is provided is the MIT license. So this is what we’re doing with a training data extraction attack, because it’s no longer some kind of fancy mathematical analysis of the parameters. It’s using the fact that the thing that this model was trained to do was essentially just to emit training data. It incidentally happens to be the case that it also does other things like generate new content. But the training objective is maximize the likelihood of the training data.

And so that’s what in some sense you should expect. And this is. Okay, so that’s like, again, that’s the intuition. This is not, of course, what actually happens because these models behave very differently. But this is morally why you should believe this is possible, in the same way that morally why you should believe model stealing should be feasible. Now the question is, how do you actually do this? And so in practice, the way that we do these attacks is we just prompt the model with random data that, I don’t know, just a couple random things and just generate thousands and thousands of tokens. And then when models emit training data, the confidence in these predictions tends to be a lot higher than the confidence in the outputs. When models are not giving you training data.

And so you can distinguish between the cases where the model is emitting training data from the cases, when the model is generating novel stuff with relatively high precision. And that’s what this attack is going to try and do.

Deirdre: And what’s the diff between the confidence between I am regurgitating training data, I am 99% or versus not returning data, it’s like 51%.

Nicholas: Yeah. So it depends exactly on which model and on what the settings are for everything. But it’s like the kind of thing where within any given model you can have, I don’t know, like 90 plus 99% precision at like, you know, 40% recall or something. So like, you know, as long as you like. There exists a way to have a high enough confidence in your predictions that you can make this thing work for some reasonable fraction of the time.

Deirdre: Okay, so generally across models, if it’s regurgitating training data, it’s in the high 90s of confidence, and if it’s not, it’s not.

Nicholas: Okay, sorry, what I meant is, so yes, this is usually true, but I guess what I was saying was, okay, so you have to have a reference. So what you want to do is basically you’re doing some kind of distribution fitting thing where you sample from the model a bunch of times and then you have one distribution of what the outputs look like for the normal case and then you have another distribution which what the distribution looks like when it’s doing training data. And then you have a bunch of challenges where occasionally or the sequence, 1, 2, 3, 4, 5, 6, 7, 8, 9, whatever, has very high probability on the next token. But this is not memorized. So what you want to do is you want to normalize the probability that the model gives you according to some reference distribution of how likely is the sequence actually. And so one thing you can do is you can normalize with respect to a different language model and you can compute the probabilities of one, compute the probability on another model, divide the two, and if one model is unusually confident, then it’s more likely be training data because it’s not just some boring thing. Another thing we have in a paper is just compute the zlib compression entropy and just be like, if zlib really doesn’t like it, but your model really likes it, then maybe there’s something going on here.

Thomas: Okay, hold on, hold on. Let me check myself here. You first and I’ll try to show.

David: No, I’m going to take us off on a compression tangent, so you should finish asking. All right, questions here.

Thomas: So the ATTCK model here is, I’m assuming in like the research that you’re doing, the attack model is I have the whole model, I have this big blob of weights or whatever. And the idea is that that blob of weight should not be revealing sensitive data, right? It’s all just weights and stuff and it’s all very stochastic and statistical and whatever, right. And so what I’m going to do is I’m going to feed the model. Feeding through the model is basically just a set of matrix multiplications through each of those things, right? More than that, but not much more than that, producing an output that is a set of predictions on what the next token is going to be. Right. So what I’m going to do is where normally I would give it very structured inputs, right, which are going to follow, like, not predictable, but like a structured path through the model. I’m going to give it random inputs, which it’s not what it is designed to do. And then I’m going to look at the outputs and I’m going to discriminate between shit that it made up that are just random collections of tokens or whatever, like garbage in, garbage out versus like traces that will show up there with high confidence outputs that reveal I didn’t give you any structure in, and you gave me a confident prediction of this out.

So that was there in the model. You gave me that to begin with. And then it sounds like what you’re saying is one of the big tricks here is what is that distinguishing. How do you look at those outputs and say this is training data versus not training data? Because Dierdre had asked, is it a 50 to 90% thing? And it sounds like that’s generally not the case. It’s not like it literally lights up, here’s the training data. But one thing to do there is diff it against another language model and see if you get like, okay, and then I can just look at the differences between those two things and catch things that were singularly, not just probably singularly inputs to the one language model, but also just the way it was trained. Right. So it could be training on both models.

But like the way the training worked out in this particular random input, it just happens to. That’s awesome.

Nicholas: Yeah, exactly. That is exactly what’s going on. And you know, and yeah, you can try and bias exactly which reference class you’re using. And so, yeah, so another thing you could do is, yeah, just use zlib. I don’t know if you want to go into this. We do. Okay, go ahead, David.

David: Yeah, so this kind of let me go off on a little bit of a tangent here. A while back, I remember reading a blog post, I think it was from Matt Green, where he was like trying to do an introduction to cryptography and like random oracles. And he said he would ask his student, like in his intro crypto class every year or his intro grad crypto of like, you have an amount like a truly random sequence. There is no way to compress it. There’s no way to store it other than to like store the whole sequence. And then like, some student would. He would say this in a way that some student would always like, want to argue with them. And then like, that student loses.

And this is like kind of a fake setup because it’s basically the definition of a random sequence is that you can’t do that. So you can’t really argue that. You could if you’ve defined it out of the problem. But this just kind of always sat in the back of my head. And when I first started seeing, like, attacks on ML, like pre LLMs, probably ones that you were doing, to be honest, like, I. Something kind of connected in my head where I was like, okay, like, we have all this training data and then we have like the kind of the entire world that like, maybe the training data is supposed to represent. And basically we’re creating this model. And this model is like morally equivalent to a compression function in my head in the sense that, like, we have a bunch of data and we want to have a smaller thing that then outputs all the other stuff, which means that, like, there’s some amount of like, actual information that exists in the training data and all of the, like, data we’re trying to represent.

And like, this model can’t possibly, like, if this model is smaller than that data, it’s going to lose information somewhere. And so my mental model of ML attacks like pre LLMs was just like, well, duh. Like at some point you’re losing information. Like, I don’t understand the math. I’m glad that Nicholas is doing it. But like, if you’re losing information at the model step, like, you’re necessarily going to be able to find some input that’s going to give you a wrong answer. Otherwise you would have to store the entire random sequence. Is that a legitimate way of thinking about things? And I bring this up when you mentioned Zlib, because compressing a obfuscated binary and seeing if it compresses the same as the unobfuscated binary is like, I think this was a trick from Halvar Flake on how to do deobfuscation.

Nicholas: Yeah, no, this is a valid, entirely valid way of thinking about this as what’s going on for these things. And yeah, this is like one of the intuitions you may have for why models have these attacks is they’re necessarily this lossy function. Now, there are caveats to this where this intuition breaks down in practice for some reasons where it looks like, okay, so let me give you one fun example of this train. You train your model on your machine, whatever you want, whatever architecture, whatever data, mostly you Want I train my model, my machine, same task, but like my model, my machine, whatever I want. How can I construct an input that makes your model make a mistake? I can construct an input that makes my model make a mistake and then send it to you. And some reasonable fraction of the time, I don’t know, like between like 10 and 50% of the time, it will fool your model too. If it was just the case that it was pure random points that happened because you got unlucky, the hash function hit a different point, you would not expect this to work. But this transferability property shows you that there is actually a little bit of real actual signal here, that there’s actually a little bit of a reason why this input was wrong.

Nicholas: And it’s not just a noise function. But there’s, yeah, something going on here.

Thomas: Are you satisfied, David?

David: Yeah, but I would say, well, clearly no, but the answer was no. But I don’t know, maybe this is a way to kind of pivot into like, hardening against prompt injection in the sense of not in the like, jailbreak. Oh, like ask my grandma to do it or ask the prompt to do this thing for me because I don’t have any hands, which I think is one that I saw on your, your site before. But I was just like, trying to get models to be more accurate. In general, at some point, if you have a lossy function, you’re going to always be able to find something that’s going to give the wrong answer. Because at the end of the day, we’re just kind of computing a rolling average. Is that an accurate way of thinking about things?

Nicholas: I mean, yeah. This is one of the reasons why I think these kinds of attacks are relatively straightforward to do, is you have some statistical function and it is wrong with some probability. And as an adversary, you just arrange for the coins to always come up heads and then you win more often than you lose. I don’t know.

Thomas: I want to get back to the idea of what it takes to harden a model against these kinds of attacks, what it would take to foreclose on them, just because I think that’s another indication I get of the, you know, how serious or fundamental the attack is. And also as an attacker, I like reading those things backwards and working out the attacks. But before we get there, I do want to hit. And this is totally, probably, totally unrelated to the stuff that we were just talking about, but the two things we were talking about, the two earlier, the first two broad attacks that we came up with was poisoning models and then breaking, poisoning Schemes which you had a brief full disclosures bat over, which is another reason I want to hit it. But yeah, so like, there’s a general notion of it’d be valuable to come up with a way to generate poisoned images that would break models, because that would be a deterrent to people scraping and doing style replication of people’s art and things like that. Right. My daughter’s an artist and she’s hearing me saying this out loud and wanting to throw things at me. So I think I understand the motivation for those problems.

Thomas: Right. But what does that look like in terms of the computer science of the attacks and the defenses there?

Nicholas: Yeah. Okay, so there’s a problem of data poisoning, which maybe I’ll state precisely, which is, suppose that I train a model on a good fraction of the Internet. Someone out there, I don’t know, is unhappy with you in particular, just wants to watch the world burn. Whatever the case, they’re going to put something up that’s bad. You train your model including, among other things, this bad data. And as a result, your model then goes and does something that’s bad. And whatever the objective is because dependent on whatever the adversary wants and how often it works is dependent on how much data they can put on the Internet and these kinds of things. This is a general problem that exists.

It’s funny, it didn’t used to be a thing that I really thought was a real problem, because back in the day you would train on MNIST. MNIST was collected by the US government in 1990. The probability of they had the foresight to be like, some people are going to train handwritten digit classifiers on this. In 20 years, we’re going to inject bad data into this so that when that happens, we get to do something bad. This is not an adversary you really should be worried about. But then we no longer are doing that. Now we literally just scrape the entire Internet and just use whatever we can find and train on this. And so now poisoning attacks become a very real thing that you might have to worry about.

Okay, so hopefully that explains the setup for this. Now, what’s the question you had?

Thomas: How do I poison something?

Nicholas: Okay. Okay. Yeah. Okay, great. Yeah. Okay. This is very. Okay, so let’s suppose that I want to make some particular person’s.

When I put in my face, then it is recognized as this person should be allowed into the building. And I think someone’s going to train a model on the Internet and then use this for some access control or something. What I would just do is the very, very Dumb thing of I’ll upload lots of pictures of my face to the Internet and I’ll put the text caption or whatever next to it is like, I’m the President of the United States. And you just repeat this enough times and then you ask the model, who is this person? And it will say, the President of the United States. There’s no clever engineering thing going on for most of these attacks. I mean, it can be done. You can do a little bit of fancy stuff to make these attacks get a little bit more successful. But the basic version is very, very simplistic where you just.

You repeat the thing enough times and it becomes true.

Thomas: Okay, but that’s not what, like, the UChicago poisoning scheme was doing.

Nicholas: Okay, okay. So this is. Yeah, so this scheme is a different paper where what they’re trying to do is they’re trying to make it so that. Suppose that I’m an artist and I want someone to. I want to be able to show people my art because I want people to, like, find my art and pay me, but I don’t want to allow these machine learning models to be trained on my art so that someone can, say, generate me art in the style of person, and then I don’t have to pay the person because the model can already do that. Which is an entirely realistic reason to be frustrated with these models. And I think it will not just be the artists arguing this point in some small number of years because they’ve sort of gotten lucky to be the first people to encounter this problem. But I think this will become a problem for many other domains in the near term too.

But, yeah, so what their poisoning scheme does is it tries to put slight modifications of these images so that I can upload them so they look more or less good to other people, but so that if a machine learning model trains on them, it doesn’t learn to generate the images in that person’s style. So, like, it looks good to people, but it does not. It does not. It cannot be used reasonably well for training of the training models.

Deirdre: It’ll always look like a jank knockoff.

Nicholas: Yeah, well, so the idea is just when someone else says, you know, generate art in the style of Nicholas, it just like, doesn’t. It can’t do that. Like, it’s almost as if I’d never trained on my data.

Thomas: Is this like. Is this a situation where it’s kind of like reverse steganography where, like, the style transfer thing is dependent on a really precise arrangement of the pixels in the image? Right. And if you do things that are psychometrically the same, it looks the same to me, but the arrangement of the pixels is essentially randomized or encode something weird in it, then it’s not going to pick up the actual.

Nicholas: And then when you break it the.

Thomas: Same way that you would break steganography, which is like, if you shake it a little bit, it breaks.

Nicholas: Yeah, yes, yes, that’s exactly. Almost like it’s all right. So. So you don’t just. It’s not so precise this way. But yeah, so the way that you add this noise is you generate noise that’s like this transferable in some way. So it fools your models and you hope that because it fools your models, it will fool other models too. And you do this only at the low level to high order bits of pixels.

And then. Yeah. So what’s the attack? Literally, the attack is shake it and then try again. Is like you take the image and you add a little more noise in some other way or in some cases just rerunning the training algorithm a second time. You can get lucky.

Thomas: But is that practical? If you were OpenAI, would it be practical to build that into your ingestion pipeline?

Nicholas: Yeah. Okay, so I think there’s two reasons why you might want to do this as a poisoning thing. One is I’m worried against someone who is like accidentally scraping my data. Like, they, like, they want to play by the rules. But like, I’ve put my data on my website and I want to make sure and my website has like, you know, don’t crawl this whatever thing. But then someone else copies my images to their website and the crawler crawls theirs. In this case, like, presumably OpenAI is going to play by the rules, more or less. They’re not going to like, you know, do the extra little stuff to like, make sure they can evade these poisoning people.

And I think in that case, you’ve basically done more or less the right thing. But if you’re worried about someone being like, I hate artists, they have wronged me, I am going to intentionally create these models so that I can put the artist out of jobs. Then this person you’re not going to win against. And so the question is, who’s the adversary? I think the adversary is most likely to be the OpenAI adversary. But the problem with these schemes that we found for many of these schemes, so people have done this in the past, not only for art, but for images of people’s face that just like, they don’t want models to learn what they look like because they’re just worried about surveillance kinds of things which also again entirely valid thing to be worried about. The problem with these schemes is that the person who is adding the poisoning must necessarily go first and they need to predict what model is going to be using to train on their data in a year. And it turns out for many of these schemes that if you just change the underlying algorithm that they no longer remain quite so effective at this poisoning objective. So they’re fragile and just like if you get unlucky and it turns out that OpenAI discovers a slightly better method to doing this training that they’re not out to get you but like it just is a change to the algorithms, oftentimes these things no longer work quite as well. And so we’ve seen this.

Thomas: I mean it’s sort of the same sense as like, you know, to break, you know, GIF R attacks when people used to embed Java jars into gifs and things like that, you would just like you would thumbnail and unthumbnail images or just do pointless transformations just because like it was some totally, totally irrelevant other wasn’t because of anything about the image itself. It was just like part of your pipeline became for best practices reasons that and OpenAI could do basically the same thing for whatever reason without and break the scheme. What’s your like as a researcher, your intuition for how successful that avenue is going to be long term?

Nicholas: Yeah. So I’m somewhat skeptical that this is the kind of thing that will work long term. I think this is the kind of thing that you can do in the short term and like will maybe make a little bit of a difference. My concern with this line of work is I don’t want people, there’s a kind of person who is worried about this kind of thing and as a result will not upload their images to the Internet, for example. And then someone tells them, I have a defense for you which is going to work and will protect your images. You should upload your images with my defense and therefore you will be safe. And then they do this and then a year and a half goes by, the algorithms change and then it turns out that they’re now in a worse off position because they otherwise would have been safer. But they relied on a thing that does not work.

And this is, I guess my primary concern with these kinds of things is I think if you’re in a world where you’re already going to put it online, it literally can’t hurt you. It’s just like it’s not going to make things any Worse. But if you’re in a world where you’re a paranoid person and your next best alternative was not put it on the Internet and now you’ve decided to do it because you trust this thing now I think there are problems and so I think this is the kind of stuff that people should use but be aware what they’re actually getting and what they’re not and not just use blindly.

Deirdre: Yeah. As opposed to like encode or whatever. Render your art in this way. It is untrainable. It’s more like eh. It’s.

Nicholas: We tried our best.

Deirdre: Yeah. It’s like a best effort and like maybe.

Nicholas: Yeah. And it’s the kind of thing where.

Deirdre: Your stuff behind a Patreon or whatever.

Nicholas: Yeah, right, exactly. This is the kind of thing where you know, almost like, you know, imagine that, you know, we’re going to encrypt things but everything is going to be like limited to like, you know, let’s say like a, I don’t know, 70 bit key or something. Yeah, well like you know, someone wants it, like they’re gonna get it. It will delay them a little while maybe but like it’s like good to do like it’s strictly better to encrypt under 70 bit key than not. But like, you know, it’ll take someone who really cares a little while and they’ll get the data back. And so like if your alternative was, was not then like yeah, I guess do it. But if it’s like, you know, if it like is important to you that no one gets it, then we don’t yet have something that actually adds significant security here. Yeah.

Thomas: So if I’m building models, if I’m doing like production work on this at a very low level at a place like you know, anthropic or OpenAI, open API. What is your work telling me I should be doing differently right now?

Nicholas: What am I?

Thomas: Which of. Besides not exposing logits directly as outputs?

Nicholas: Yeah, I don’t know. I think a big thing is like the attacks we have right now are quite strong and we have a hard time doing defense stuff. I think this goes back to one of the things you were mentioning a little while ago of like what, what, what, what works? We don’t have great defenses even in the classical security setting. The classical just make the image classifier produce the wrong output things that we’ve been working on for a decade. Dave Evans has a great talk. He has a slide where he talked about, okay, so cryptographer things work when it’s 128 bits of security and cryptographer 126 bits of security. Like, you know, like, run for the hills, destroy the algorithm, start again from scratch. Like, okay.

But like, okay, if someone were to do an attack on AES like that, like, worked for the full sort of full round, like, someone would start to try and have another version at some point. Like, people would get scared. Like, you know, small numbers of bits. Okay. Okay. In system security, you usually only have, I don’t know, let’s say like, 20 bits. Like, you know, stat canary pointer authentication, something like this. Like, you have some reasonable number of bits and it’s broken.

If it’s five bits of security or something, you have to really get it down. Okay. In machine learning, when you have defenses that work, the best defenses we have that work in image classifiers have one bit of security. The best defenses, they work half the time. And when I have an attack paper, the attack paper brings it down to zero bits. It works every time. So this is the thing that we’re operating in. Even if the best defenses we have now were to be applied, like, it goes from like, you know, they’re tissue paper not working to working half the time.

So, like, you know, we’re like, really in a space where, like, these are things, the things that we have as tools are not reliable. And so I think, you know, the big thing people are trying to do now is to build the system around the model so that even if the model makes a mistake, like the system puts the guardrails in place.

Deirdre: Yeah.

Nicholas: So that, you know, it just, like, won’t allow the bad action to happen, even if the model tells it to do the dumb thing or something like this. And this, I think, is for the most part the best that we have. There is hope that maybe with these language models, there are smarter things you can do. This was very early. I don’t know how well this is going to go, but there is some small amount of hope that because language models are. I hate the smart in some sense is the argument people make. They can think their way out of these problems or something like this. This is like, they see people that.

Deirdre: A human doesn’t likely to see or something like.

Nicholas: Yeah, and so. Right. But like, I think, you know, fundamentally, the tools we have right now are very brittle, and at some point maybe we’ll fix the problem in general. But for right now, we’re just trying to find ways of using them so that you can, like, make them work in many settings which, like, so, for example, if you just don’t use them in Settings where when mistakes happen, bad stuff goes wrong. Like this is fine, like this, like this is like all these chat bots right now, right? Yeah, okay, yeah, let’s not do that. But like, you know, like if I’m going to use the, one of these chatbots now and like I’m asking if like write some code for me. Like, you know, either maybe it makes a mistake or an adversary is trying to like, you know, go after me or whatever. Like if it’s going to give me some code, I’m going to look at the output, I’m going to verify and then I’m going to run it.

Yeah, and like this is like it’s not like going to be a fundamental thing but if instead I was just like, you know, you know, ask a question, pipe to language model, pipe to like pseudo bash. Like, you know, this would be bad like if you know, like this was like the thing that you had in your, like somewhere in your, like that like actually problems would go wrong. And so like I think the thing people are doing right now is just like, well just don’t do that. Like just not take, let the model take action. But we are very slow, very maybe more quickly than I would want moving into a world in which this is something that actually does happen and these attacks start to start to matter in these ways. And I think this is a big part of what we’re trying to figure out is at the very least what I would like to not be surprised by attacks. This would be if you can’t fix anything, let’s at least be in a world where the person makes a decision and says, I know this might be risky, I am going to do it anyway as opposed to here’s a thing that I’m going to do and I didn’t realize I could be shooting myself in the foot, but I have. And at the very least there is some acceptance of the risk.

And a person has sort of weighed the trade offs and decided it is worth it in some small fraction of cases that maybe they get attacked because maybe it turns out that the value you get from doing this is just very, very large. And I am willing to pay the cost of fraud in order to have some other amazing thing.

Deirdre: And do you think there’s any correlation between the size of these models and the size of their training set on the likelihood of things going wrong? If you have a much tighter target with a much tighter type of capability and a much tighter narrower training set, is it less likely to just suddenly go off in a random direction? Because for reasons or whatever.

Nicholas: Yeah. So it is the case that larger things are more or less harder to attack, but it’s not by orders of magnitude. The smallest handwritten digit classifier things we have, we still can’t get that problem. Right. Like, we still can’t better than humans classify handwritten digits. And we ought to be able to figure this out. But apparently this is a thing. Okay. But it is easier to get that right than to recognize arbitrary objects.

Deirdre: Right. But mostly I’m like, are we likely to have models that are less likely to just do something we completely didn’t expect by the narrower tailored models than the like, we’re trying to vaguely get AGI out of, like, the biggest model trained on everything that everyone has ever. All the knowledge of, you know, the entire human existence or whatever.

Nicholas: My impression is that as we make these models bigger, the more things will go wrong. But there are people who will argue that the problem with the small models is they’re not smart enough. And if, like, the big models, like, have some, like, some amount of intelligence and therefore, like, they’ll like, be robust to this because, like, humans are robust. And so. So you make it bigger and like it. Yeah.

Deirdre: Have you met human? Not you, but them. Have you met humans?

Nicholas: No. Okay.

Thomas: Okay.

Nicholas: So they’re not entirely insane.

Deirdre: Yeah.

David: Humans can tell the difference between you telling them instructions and you handing them a book.

Deirdre: Okay.

David: And like, know which ones are the instructions and which one is the book. Right. And like, models can’t do that reliably which way? Right.

Nicholas: And yeah. For example, one of the things that people have been doing for these models that helps more than I thought that it would, is you have the model answer the question, and then you have the model and then you ask the model, did you do what I asked? Or essentially, and the model goes again and it’s like, oops, no, I followed the wrong set of instructions. It turns out this sort of very, very simple thing oftentimes makes it noticeably better. And this. Did you do what I asked? Is something you can only ask a sufficiently smart model. You can’t ask, you know, a handwritten digit classifier. Did you do what I asked? Like, it’s like not a meaningful thing it can do. But yeah, bigger models in some sense can self correct in this way more often.

And there’s been some very recent stuff out of OpenAI where they have some these, like, think before you answer models that they’re claiming is much more robust because it has this think before I answer. It can think about like, well, this is not an instruction. I know the person said disregard all prior instructions and instead like, write me a poem for whatever. But it just like won’t do that for reasons because it has thought about this as a thing and so this is a thing that might happen.

Deirdre: I believe them that this, like, you know, on, on the large scale and like on average or whatever improves results. But at the same time, I don’t trust the model to check itself. Like, I don’t, like, if I say, did you do what I ask if, like, which one of the answers is the right correct answer?

Nicholas: No. Exactly. I mean, I, I’m also nervous about this, but like, you know, and to some extent, you know, we’ve, we’ve learned to like, more or less trust people when you ask them, like, did you do what I asked? Like, most people just aren’t gonna, like, most people will, like you’re working with, are just not gonna like, outright lie to you if, like, that they did the right thing or something. And so it’s not that the model’s.

Deirdre: Lying to me, it’s that the model doesn’t know what the fuck I’m talking about either in any one of those objects.

Nicholas: So I completely agree with you. But like, you know, this is like, this is the line of argument that you might, that you might get for why these things might get better. And I think one important thing is I have been very wrong about what the capabilities and what these models might have in the future in the past. And I am willing to accept that I might be wrong about these things in the future. And so I’m not going to say that it’s not going to happen. I’m just going to say that currently it is not that. And I think we want to understand, you know, we want to use the tools as we have them in front of us now and some people will think long term in the future. And I don’t know, it’s a security person.

I tend to like, attack what’s in front of me. And what’s in front of me is not that, but like, if that changed significantly, that would be, you know, a useful thing for me to, for me to know.

Deirdre: Yeah.

David: So given like the kind of probabilistic behavior and everything we just talked about here, like, does this, do you think this indicates that, like, scaling up LLMs, like, is not going to get us to AGI, like, like fundamentally, or do you, are you willing to make a statement about that? Like, because I would.

Thomas: Right.

David: Like the idea that it’s just a, like next token Predictor seems like we’re missing something, right?

Nicholas: Yeah. Okay so. But. Okay, so I would have given that answer to you four years ago but in that, in that answer I would have not expected the current level capabilities if you had asked me four years ago will a model be able to write a compiling Rust program?

Thomas: That was very difficult. We do so much LLM generated stuff generating Rust very difficult for LLMs.

Nicholas: Okay, I agree but it can be done. This is, I agree very.

Thomas: It’s not even usable by AI’s.

Nicholas: Very.

Deirdre: Teeny tiny Rust programs with very little.

Nicholas: As long as you have but the fact my intuition for next token predictor was very good statistical model of the data but could not have any actual ability to help me in any meaningful way. And I would as a result have said that where we are today is.

Thomas: Not possible today I can tell Claude or whatever I need to build a one to many ssh to WebSockets proxy and it will do that well from scratch with nothing but prompts. Just like three or four prompts.

Nicholas: Yeah, exactly. Right, right. And so this is. Yeah, okay, so this is my statement is to say I don’t know what’s going to happen in the future, but the world that we’re living in right now based on the impression of just statistical language model was wrong in the past. Now I still believe maybe wrongly that we’re not going to get to AGI on this sort of trend at least this is like. I don’t know, the skeptical person in me is like it, I don’t know, feels like we’ll run into some wall, something’s going to happen. I don’t know. Data training exponentials don’t always go forever.

This kind of thing. It feels to me like something would go wrong. But I would have said the same thing five years ago and I would not have told you that what we can do today is possible. So I have much less confidence in the statement I am making now. And I think that mostly it’s maybe just me being wanting it to be true and not actually something that I believe in some sense. But I don’t know, I. I have been very wrong in the past and I feel like I still think it’s the case but I’m willing to accept that I may be wrong in the future in this way.

Thomas: And of course the correct answer is there will never be AGI. This stuff is unbelievably cool. I think one of the things I like about it most is that a lot of the background technique stuff for how you’re coming about this is that we come from kind of the same basic kind of approach to this stuff. Right. Is like pen testing, finding and exploiting vulnerability stuff. And a lot of the machine learning and AI model stuff is kind of inscrutable from an outsider. And kind of the same way cryptography was for me too. Right.

A lot of the literature for that stuff. If you don’t know which things to read, it’s really hard to find an angle into it. But the work that you’re doing is for somebody like me, just a perfect angle into this. Right. You basically have the equivalent of a CBC pattern Oracle attack against an AI model. And working out how that actually works involves learning a bunch of stuff, but it’s a really targeted path through it. I think it’s unbelievably cool stuff and I’m thrilled that you took the time to talk to us about this. This stuff is awesome.

I’ll also say you wrote a blog post which I think pretty much everyone in our field should read, which is how I use it. I’m. I got the title wrong. That’s like How I use LLMs to Code or whatever.

Nicholas: Yeah. I don’t remember what it called it. Something like this.

Thomas: Wherever I talk to. At this point, whenever I talk to an AI skeptic, it’s like, go read this article. And it’s a 100% hit rate for converting people to. If not like sold on just the idea of AI and LLM, it’s just like, yeah, clearly this stuff is very useful.

Deirdre: Yeah.

Nicholas: I wrote this for myself three years ago. I was like the. I used to not believe these things were useful. I tried them. They became useful for me. It’s the same idea as giving this whole AGI kind of whatever thing. I don’t believe that this is something many people did not believe this would be possible. The world we’re living in today and you ought to just try it and see.

Nicholas: And I think the way that many people look at these tools and say but it can’t do. And then they list obscure things that it can’t do. That like, when was the last time you really needed to count the number of Rs in Strawberry? Tell me honestly. People will still go on this as a thing it can’t do, but it can’t correctly do arithmetic. Do you not have a calculator?

Deirdre: This is the problem with calling it AI though.

Thomas: This is the way that we’re going to defeat it when it becomes runaway AGI is that we’re going to defeat it by Knowing that it can’t count the number of Rs in Strawberry. So it is important to call these things out.

Nicholas: No, exactly. No. Right. I think, yeah. The terminology around everything is terrible. As soon as something’s possible, it’s no longer AI. The whole way of everything has always been. But I think that people should be willing to look at what the world actually looks like and then adjust their worldview accordingly.

And you know, you don’t have to be willing to predict into the future far. Yeah. And just as long as when a new thing becomes capable, people don’t look at it and say that can’t be done. Or like I think the most useful thing for people to do is to like actually think for themselves to make predictions about like what would actually surprise me if it was capable now because then in a couple of years they can actually check themselves a little bit where if the thing that you thought wasn’t possible, it still isn’t possible. Good. You’re relatively calibrated for your thoughts. But if you’re. No way is a model going to be able to do X and then someone trains something that makes that happen, you should be willing to update that.

Okay. Maybe I’m still a little confident in this one direction or the other and it’s generally a useful way of getting that.

Deirdre: Speaking of predictions.

Thomas: Yeah.

Nicholas: Yes.

Deirdre: Do you have predictions of the next n years of attacking LLMs or do.

David: You perhaps have an aggregator of predictions?

Nicholas: I do not know what I will expect for the next 10 years of attacking models. My expectation is that they will remain vulnerable for a long time and the degree to which we attack them will depend on the degree to which they are adopted. Where. If it turns out that models are put in all kinds of places because they end up being more useful than I thought or they become reliable enough in benign settings, then I think the attacks will obviously go up. But if it turns out that they’re not as useful and when people actually try and apply them, they just on auto distribution data, they don’t get used whatever, then I expect that the attacks will not happen quite the same way. I think that it’s fairly clear where we are now that at least these things in front of us are useful and will continue to be used. It’s not like they’re going to go away. I just don’t know whether they’re going to be everywhere or going to be in the place where people who need them for particular purposes use them for those purposes and not anyone else.

And I think the world in which they go everywhere. It doesn’t have to be because everyone wants them. It might just be the. We live in a society where it’s a capitalistic world where they’re just more economically efficient than hiring actual human. And so the companies are going to be like, I don’t care. You get a better experience on customer support talking to a real human. The model is cheaper, and even though it makes an error, one out of a thousand cases, that’s less than the cost of the person was in order to do the thing in that world, I think attacks happen a lot. And so it’s very dependent on.

On how big these things go, which I think is hard to predict.

Thomas: So it turns out it is possible to cryptanalyze an LLM, which is an unbelievably cool result. Thank you so much for taking the time to catch us. This is amazing.

Nicholas: Yeah, yeah, thanks. Of course.

Deirdre: Thank you! All right, I’m going to hit the tag. Security Cryptography Whatever is a side project from Deirdre Connolly, Thomas Ptacek and David Adrian. Our editor is Nettie Smith. You can find the podcast online @scwpod and the hosts online @durumcrustulum, @tqbf and @davidcadrian. You can buy merch online at merch.securitycryptographywhatever.com. If you like the pod, give us a five star review wherever you rate your favorite podcast. Also, now we’re on YouTube, with our faces, our actual human faces, human faces not guaranteed, on YouTube. Please subscribe to us on YouTube if you like to see our human faces. Thank you for listening!

Cryptanalyzing LLMs with Nicholas Carlini

Latest Posts

Vegas, Baby!

E2EE Storage Done Right with Matilda Backendal Jonas Hofmann and Kien Tuong Truong