What Does AI Need to Conduct Automated Code Review at Scale?Chris Anley of NCC Group on LLMs, False Positives and DARPA's AI Cyber Challenge
How can generative artificial intelligence be adapted to automatically pinpoint and fix software vulnerabilities in large amounts of critical code?
Finding answers to that question is one of the "exciting prospects" tied to the AI Cyber Challenge recently announced by the White House, said Chris Anley, chief scientist at British cybersecurity consultancy NCC Group.
The $20 million challenge, run by the U.S. Defense Advanced Research Projects Agency, features Anthropic, Google, Microsoft and OpenAI contributing not just infrastructure but also in-house expertise to help guide participants (see: White House Debuts $20M Contest to Exterminate Bugs With AI).
"No doubt, there's going to be a huge amount of really exciting research that's going to come out of this," Anley said. "It's a really exciting effort squarely focused on a problem that's essential for the U.S. and U.K. national security and the security of our allies."
Participants will face challenges in the DARPA contest. Anley said large language models such as ChatGPT shouldn't currently be used to conduct code review, not least because of their propensity to hallucinate and make up answers. But finding ways to marry existing static code analysis tools' capabilities with LLMs could ultimately facilitate large-scale automated AI-driven code reviews.
In this video interview with Information Security Media Group, Anley also discussed:
- Current challenges that preclude DevSecOps teams from using AI to conduct code reviews;
- The potential offered by marrying code analysis tools with LLMs' natural language interface;
- How AI could help overcome limits with static analysis tools, provide targeted guidance and markedly reduce false positives in code review.
Anley is chief scientist at NCC Group. He has been carrying out security audits since 1996, performing thousands of penetration tests, code reviews and design reviews on a variety of platforms, languages and architectures for many of the world's largest companies. He promotes, advises and assists with NCC Group research programs, and he carries out independent research into new and emerging security threats.
Note: This transcript has been edited for clarity.
Mathew Schwartz: Hi, I'm Mathew Schwartz with Information Security Media Group. Earlier this year, the White House announced a new AI Cyber Challenge. The focus is to rapidly find and fix flaws, using artificial intelligence and machine learning. Joining me to discuss this initiative is Chris Anley, chief scientist with NCC Group. Chris, thanks for being in the studio today.
Chris Anley: Well, thank you very much, Mat. It's a pleasure to be here.
Mathew Schwartz: Great to have you here. Now you have published a fair amount of research on large language models such as ChatGPT, looking at what they can be used for what they can't be used for. And with that approach, if you will, we've been talking about secure coding for a long time. Do you think AI/ML is potentially well suited? Or is there a question mark there, when it comes to helping address some of these things that we're going get into about the DARPA challenge that the White House has announced?
Chris Anley: Absolutely. So it's interesting, I think large language models give us new capabilities in code review. There are questions over their reliability. With AI code review, folks are probably familiar with the tools by now or at least have heard about it. There's a behavior called hallucination, where the model will produce an apparently factual response to a prompt where the fact is entirely fabricated. And obviously, if you're looking for security vulnerabilities, or you're writing code, that's the very last thing you want. Because one incorrect line in a 100,000 line program causes the entire thing to collapse. So the fact the accuracy of the models is extremely important. But having said that, the flexibility and the natural language interface of language models can really help provide some new capabilities, especially in terms of understanding broader context. So for the first time, we can integrate bug reports that are written in natural language, and then present some code and arguably have some kind of connection between the two, which is extremely useful. So it's that sort of bringing together different sources of information, and integrating things. It's not just a large language model in isolation, looking at code. And so that's the exciting part, I think, or one of the exciting ones.
Mathew Schwartz: What I've been hearing, from talking to experts in domains where they might be able to use generative AI tools, is that they are best seen - at least currently - as a supplement for experts, helping guide them or helping ensure they're using their time, to the best advantage. And I really want to dive into that with you in this discussion. I wonder, though, if we should step back. And maybe, if I should have asked at the very beginning, if we should baseline a little bit, exactly what we're talking about, when we're talking about LLMs - large language models - and how they can be applied across various domains, including security?
Chris Anley: I'll give a brief explanation of how large language models work. So first of all, what they do, is they take as input, a sequence of words - and we should say tokens, which is sort of words or word fragments, but we'll just say words, because that's easier. So we take a sequence of words as input, and the model outputs, for all possible words, the likelihood of a given word being the next word. So the cat sat on the, and then it outputs all possible words, and mat would have a high probability of being the next word. And if you want to generate a longer sequence, you just take the entire sequence plus that new word, and then you submit all of that as the input again, and then you repeat and repeat and repeat and you get a longer and longer output until the model says, "end of sequence, we're done, that's all I have to say on this," or until you choose to stop. So that's what they do.
A couple of questions occur with this. People have conversations with these models. How does it remember what I've said earlier on in the in the conversation? The answer to that is simply that the entire conversation is provided as the prompt. So the entire conversation in one input is provided to the model. And then it responds from that point on. So that's the "what do they do?"
How do they do it is equally straightforward. Perhaps unsurprisingly, we use a technique called deep learning. And this involves training a neural network using a large number of the sequences, a very large number of the sequences. In the case of large language models, all of the text written by anyone, anywhere over the last few centuries. We're talking about a very large volume of text. And then we take each sequence, and we input it into the model, we measure the difference between what we expected it to predict as the next token, because we know what the next token is. So we measure the difference between that and what we actually saw. And then we apply corrections to the network to the weights in the network, and we propagate those corrections back through the network.
The network gradually learns to or gradually statistically drifts to predicting the next token, as accurately as possible. So the response is slightly better the next time, and then we put more text in and it's slightly better and slightly better, and so on. So the deep learning part of this is that as this process continues, the weights in the model begin to drift towards representations that help it more accurately predict the next word. For example, some words tend to be used in the same part of a sentence. So, nouns or verbs. Now the model wouldn't represent that as nouns or verbs, it would just have weights that pointed it towards a group of a group of words as the next possible word. So, the cat sat on the analogy goes here, kind of thing. And then deep, the magic of deep learning is that as the training process proceeds, and it becomes harder and harder for the model to get better and better, the abstractions that it forms become more and more higher level. So you start off with, perhaps verbs, nouns, groups of words that hang together, and then you have portions of sentences or phrases, conjunctions, and negations. And then into things like the past, present and future, future tense, and then into perhaps styles - prose, song, poetry, speak like a pirate, whatever the style of language.
And this then gets to the point where the text can be happy or sad, or joyous or stern or whatever, and even left wing or right wing, and some, some fairly abstract properties of the text get teased out. So, for example, you could take ChatGPT, and ask it to describe a forest in a paragraph. And it will tell you about the dappled sunlight through the trees, and the babbling brook, and the little bunny rabbits and all of those things. But then you can ask it to make its description more right wing, for example, and it will then talk about responsible management of resources, managing your forests, responsible land ownership, perhaps hunting as a leisure activity, independence, the importance of outdoor activities. If you ask it to make it more left wing, it might talk about the importance of preserving natural spaces for the common good, the importance of green spaces for mental health, or whatever. The model has drifted towards that concept. So, very high-level abstractions end up being represented in the weights in the model. And that's surprising.
So there's two surprising things about this. The first thing is deep learning itself. It's surprising that the models drift towards these nested, higher and higher levels of abstraction. That's quite surprising. But the second surprising thing is that we're able to calculate with words in this way, right? It's surprising that we get such a good result simply by supplying such a large amount of text and having these abstractions get teased out of the text. Then we arrive at a point where we can ask the model to write a poem about whatever's in the news today, and it will. That's the sort of the surprising thing, I suppose, the mathematical logical nature of our language, how straightforward it is, and the fact that we can calculate with words in this way.
So that's what they do and how they do it. It's a very oversimplified explanation, but that's the gist. Then we apply this to code. And this is where things start to become quite interesting, because code is a mathematical expression of a sequence of actions in an imperative programming language. So do this, then do this, then do this. And in code, again, we build nested abstractions, we have functions that do a thing, and then a function calls that function, and so on and so forth.
There's potential for the discovery of the underlying abstractions in code in a way that wasn't present previously, simply because of the of the way that the deep learning works, and the way that large language models work. So that's a sort of future potential.
So, coming back down to Earth for a moment, the ways in which these technologies can be directly applied through natural language, the bridge between natural language and code, those areas where we have verbal bug reports, or comments. All of these things that we weren't previously able to parse in static analysis tools, we can now get useful signals from, about the security properties of code. In particular, things like comments about, "we'll fix this," or "we should probably add authentication here," or something like that, that comment, in a large language model, that becomes very strong signal for where there might well be a security issue. So there's some exciting possibilities in terms of the combination of static analysis and large language models, and large language models' capacity to potentially incorporate natural language text around the security issues. So your GitHub list, or your bug tracking database, whatever it might be, all of that long-form text that's been written about a security issue, can become useful now. Whereas we couldn't really use it before, other than having a human reader and their reacting to whatever that comment might be.
Mathew Schwartz: Great explanation. Thank you. There's a lot of excitement around the possibilities that this offers. And I know that you've conducted some security tests around ChatGPT, to see where some of the limits are today, and to question what will need to change to apply this more broadly. Obviously, you're the one who's published these results. But one of the things that jumped out to me was that you noted that generative AI often has a tough time seeing the big picture. I think you were alluding to this just a little bit before, when you were talking about authentication would go well, here. These are a little more targeted sorts of interventions in a code base, as opposed to writing a whole codebase from scratch.
Chris Anley: Yeah, so the big-picture point is that depending on the model you're using, there can be a fairly short token limit on the amount of text that you can have in that prompt. The size of most modern code bases far exceeds that token limit. Finding innovative ways to focus in on a piece of code, and yet retaining the prompt the broader context is going to be important, I think, in applying these tools.
Mathew Schwartz: And is that token limit to do with something inherent in the model of the generative chatbot that you're working with? Do you think that that will get longer or does that become so computationally intensive that we would have been sacrificing entire rain forests with each query?
Chris Anley: You're absolutely right: there are computational concerns about the size of models and all of that, but there are architectural answers to the token limit problem to greatly extend it from, say, 4,000 tokens up to 32,000 tokens and beyond. But I think it's fair to say there's still a disconnect between the size of a code base you would want to review and the token limit of a large language model. So that that problem of carrying the context with the prompt? There's going to have to be some engineering there in order for these tools to be to be usable.
Having said that, they do a respectable job. They'll point a human code reviewer in the right direction, provided the security vulnerability is at the level of a single function, or a handful of functions. So you can take a small bit of code that you're concerned about, and then you've got a good chance of getting a pointer in the right direction, in terms of code review. There are problems - as I said in the paper that I published - in terms of they sometimes miss issues. There are problems in terms of, that they sometimes falsely identify issues. But this is across the board with large language models; hallucination is an issue. But in the hands of a human code reviewer, someone who knows what they're looking for already, and just needs help in locating the specifics, I think they definitely can be useful to point you in the right direction. So it's more as you were saying it's more of a decision support than a decision-making tool to supplement.
Mathew Schwartz: For somebody who already has some domain expertise?
Chris Anley: Absolutely.
Mathew Schwartz: On to DARPA's AI Cyber Challenge. Obviously, you don't bring in DARPA if you're not aiming high, trying to solve some problem that has eluded our ability to solve it for a long time. In this case, what it will take to make this a reality? If we knew, we wouldn't need DARPA, would we? So when you look at this cyber challenge, there is no guarantee it will deliver. But if it does, what would be some of the big points you think would need to happen, for this to work?
Chris Anley: I mean, looking at the launch announcement, it's an incredibly, incredibly interesting project. There are some amazing partners in industry, DARPA themselves, as you say, are an incredible organization. So if anyone, if any organization, can address this, they're it.
With the challenge itself - the competition, the level of funding of $20 million - if you look at the revenues of large static analysis tools for security firms that exist at the moment, it's not really comparable. It's clear that this is this is a prize fund and a competition that people are entering. This isn't the scale of investment of a large SaaS tool.
Having said that, there were comments in the launch announcement that I thought were really interesting, especially from Dave Weston at Microsoft, who runs the Operating System Security Division. He was pointing out that one of the most potentially useful applications of large language models is to bridge the gap between existing SaaS tools, and large language models' text understanding. Specifically, he mentioned false positive reduction, which is absolutely, 100% correct.
We audit software on a daily basis, we find hundreds of bugs. And those hundreds of bugs are the result of winnowing down many, many false positive results. So Dave's comment on that is absolutely right. And that is one of the exciting things that we may we'll be able to do, is to whittle down the false positives, again, using perhaps using the interaction between language and code as logic. At the moment with static analysis tools, we have approaches based on regular expressions, searching texts, particular constructs, sometimes pausing the text. So going most of the way through to compiling the text, even compiling the code, creating these intermediate representations, and then reasoning about that in an automated way.
Bridging the gap between the natural language and that logical representation that we have today in SaaS tools is a really exciting prospect. I think, as always, Dave Weston is right on the money. It's exactly where the benefit is. I think it's going to be hugely exciting and this is very close to my heart, because pretty much my entire professional career has been finding and helping fix security vulnerabilities of exactly this kind. It's a hugely important issue for corporations, but also for national security - the U.S., the U.K. and our allies. This issue is very much front-of-mind at the moment. And this technology may well offer significant benefits in addressing this problem.
It's great that DARPA has announced this challenge that the president has announced this challenge, and with the kickoff sessions and the approaches, and with the partners they have on board, it seems like if anyone can make this successful, then DARPA can. So it's a very optimistic project.
Mathew Schwartz: Excellent. Well, one of the challenges for me with secure coding is it seems to be such an obvious imperative. And yet, we still see so much difficulty getting organizations to do it, getting people excited about it. I love what you said about how this can help somebody who is looking for these flaws more quickly and accurately focus their efforts that there could be the shortcuts there to help say, look, we've gotten rid of all the junk, you can focus on what actually needs to get focused on, that has got to be exciting for anybody who is doing this. It also sounds like this is a bit more of a bite-sized thing. So far, you've articulated some things that AI ml can help with, it won't necessarily give us this automated all-in-one defensive system, correcting vulnerabilities in real time that DARPA is maybe hoping it reaches as an end state, which is a long way of my asking: Does all of this reinforce for you the need to remember to do DevSecOps? That the cleaner you can get your code in the first place, the easier the maintainability, the safer and more secure, it's going to be in the long run?
Chris Anley: Oh, 100%. That's absolutely on point. The majority of code bases that we see have what you might call code hygiene issues that are relatively straightforward - a hundred issues that are straightforward to detect, but just fixing them costs money. So, for example, lots of code has credentials embedded within it, for various reasons to do with time, or ease, perhaps. Developers, in some cases, may not appreciate the security impact of having credentials in your code. It can be a very serious issue, of course, because code isn't handled in the same way that credentials are handled. The entire development team will have access to the code within with tech industry turnover at around 20% per year, it means a fifth of your company's walking out the door every year with administrative credentials to your systems.
Those are the sorts of issues that people don't necessarily consider when they're trying to solve a problem in an extremely short period of time, their team leads telling them: you've got to get this feature out today. So that's how credentials end up in code: there isn't a straightforward process for the developer to address the issue.
So I should probably explain what the right way is. The right way is to use a credential vault specifically designed for that purpose, and for the organization to have provided a kind of golden path for the developer. So the developer doesn't need to seek authorization. There's plenty of examples in the code base that they're working with of exactly how to do this right. You go to this place, you call this API, the credentials are gathered at runtime, we end up with no credentials in the code; that's a win.
But like I say, the sort of economic reasons behind why these things happen - credentials in code is a great example - are a very hard problem to solve. It's partly education, but it's also partly top-down. We need to have executive buy-in in order for the issue to go away. We need to have education so that the developers understand what the issue is. And ideally, we need to have some kind of automated mechanism for detecting whether these issues are in the code. Because otherwise, we can't address the issue at scale. It's doing a manual code review over everything for all of these issues, and that simply won't scale.
So automated solutions are extremely important. Other kinds of code hygiene issues are things like out-of-date dependencies, and calling potentially vulnerable, potentially bad functions. Meaning functions that have been historically associated with vulnerabilities like unbounded string copy functions in C, or there's a whole litany of potentially dangerous functions. Concatenating strings into a SQL query, causing SQL injection, and then a data breach. All of these sorts of issues, the answers to how to fix them are relatively well understood.
But at scale, it becomes quite difficult for the organization to manage these issues, which is why automation is important, because that gives you very much a force multiplier, and that's why this DARPA challenge is so exciting, bringing real focus on the on the importance of this issue. So absolutely, DevSecOps is absolutely important. Code hygiene, in terms of security, is important. And these things, although they sound simple and they're relatively easily described, they can be very difficult to guarantee at scale in an organization. So that's kind of a challenge at the executive level, rather than necessarily individual dev teams.
Mathew Schwartz: So multiple challenges, some of them cultural, that will be perhaps needed to enable this DARPA AI Cyber Challenge to succeed. But it sounds like there's some optimism: we might get to some good places, even if we don't know exactly what they are yet.
Chris Anley: Well, no doubt, there's going to be a huge amount of really exciting research that's going to come out of this. The partners involved are being very, very free with access to their models, access to their technologies and their people. It's a really exciting effort squarely focused on a problem that's essential for the U.S. and U.K. national security and the security of our allies. So, absolutely, it's very, very important.
Mathew Schwartz: Well, Chris, thank you so much for giving us some insights into what we can do today, what we hope to do tomorrow when it comes to building more secure code, possibly with the use of generative technologies. So thank you so much.
Chris Anley: You're very welcome. Thank you for thank you for inviting me.
Mathew Schwartz: I was speaking with Chris Hensley, chief scientist at NCC Group. I'm Mathew Schwartz with ISMG. Thank you for joining us.