Many security research bottlenecks aren’t simply a matter of limited bug-detecting capability—they’re about deciding which attack surface to examine in the face of constrained time and attention. When staring at hundreds of changed functions in a patch diff, or trying to prioritize among thousands of Linux kernel subsystems to audit, manual intuitive exploration doesn’t scale.

What follows is a recording and transcript for my talk at inaugural Offensive AI Con, the first conference dedicated to offensive AI cybersecurity (San Diego, 5-8 Oct ‘25), in which I present a method for transforming hard security prioritization problems into listwise ranking problems that LLMs can solve efficiently and cheaply. Instead of trying to score every item independently (which tends to inflate) or compare every pair of items (which explodes combinatorially), we use a listwise document ranking algorithm to find the most relevant items in a data set through iterative sampling and convergence detection.

As part of this research, I developed two open-source tools that you can use to automate target selection, prioritize vulnerability candidates, or make any “which option?” decision at scale:

  • Slice: Finds vulnerabilities through build-free static analysis by strategically arranging context for LLMs to examine. Successfully reproduced discovery of a use-after-free in the Linux kernel SMB server for ~$3 per run.
  • Raink: Ranks arbitrary data sets using LLMs with O(N) complexity. Can prioritize ~2,700 GitHub repos, ~3,000 kernel subsystems, or ~1,500 patch diff functions and consistently surface the most security-relevant items.

Intro

What’s up, Oceanside? Looking at this title1 now, I can see it’s quite a mouthful; I think I went with “too long to be wrong.” If you want the shorter version, we’re talking about scaling vulnerability research with large language models.

A bit of an intro about me: I’m Caleb Gross. I got started doing offensive cyber ops for the US government; now I do offensive security and application security in the private sector. I’m super interested in using AI for initial access vulnerabilities. You can find me at the handle noperator most places online.

An ambitious goal

Let’s go back to the basics and consider a very simple goal: What if we could find and fix all the vulnerabilities in open source software? This is a pretty ambitious goal, admittedly, but we should be. There’s a reason why we all showed up at a conference that sounds like it came straight out of a sci-fi novel, right? I think it’s because we all have this belief at some level that we are standing at an inflection point for solving classically difficult problems in the security space when leveraging AI effectively. It may be time to revisit long standing assumptions about what is and is not possible when solving security in a big way.

If we were to develop a very simple pipeline for how this could work, it would look, first of all, like locating your targets. Which repos do we need to run static analysis on? Dissect the code base, chunk it up into small enough pieces that an AI can perhaps help find bugs, do some triage, and then verify reports and fix them.

I’ve been spending the better part of the past year developing tools that can easily fit into a vulnerability researcher’s workflow to help use AI to assist with some of these problems. I published a blog post a month or two ago on this tool called Slice. It’s an open source tool (Aaron very kindly referenced it in his talk yesterday, so thank you). The core idea here was: Can we try to test this assumption that I think a lot of us have, which is if you arrange the right minimal pieces of context in front of a large language model—the same context that would cause me or you to say, yeah, there’s clearly a vulnerability here—then maybe an AI would come to the same conclusion.

That’s the core premise of Slice. It can very reliably reproduce discovery of a use-after-free in the Linux kernel. This is the same bug that Sean Heelan blogged about back in May in the Linux kernel SMB server implementation. It’s pretty fast and cheap—like three bucks per run, runs in just a couple minutes. While I think this is very exciting and interesting, it’s also it’s not terribly surprising. I have had the feeling for a while that if we just get the context right and if the model is sufficiently capable, then it should be able to find complex vulnerabilities like this without doing dynamic analysis. (Can it just read the code?)

I also feel like we have sufficiently covered this idea at Offensive AI Con of using LLMs to find bugs. I feel like we’ve substantiated the claim pretty sufficiently by some really excellent presentations this morning and yesterday. What I want to focus on is, if we keep this problem in mind of finding all the bugs in all the repos, we should recognize we are resource constrained. Our brains are only so big. We only have so many GPUs. There are only so many hours in the day. Even I am time constrained—I have 20 minutes left in this talk!

A more reasonable goal

We should consider maybe the more reasonable goal, a reasonable compromise: Can we just find the highest-impact vulnerabilities? Interestingly, that’s both a bit easier, but also quite a bit harder. Easier because the set of impactful vulnerabilities is going to be at most a subset of all of the vulnerabilities, so presumably we should be doing less work if we’re just finding the high-impact ones. But it’s more difficult because—what is “impactful”? How do you quantify that? There have certainly been attempts at assigning a number or quantity to this idea of “impactful.” We use CVSS scores as one of the main ways we do that—but it’s not trivial.

I think what it’s worth spending our time on here is, let’s talk specifically about target selection, and also about which vulnerabilities we might choose to verify at the end of the day if we have a candidate list of a thousand vulnerabilities that fall out of our Slice tool (or whatever else we’re using). The core idea here that we need to latch onto is that this is a matter of resource constraint where we have to prioritize and optimize for impact while recognizing we just don’t have the resources to do everything. This is a problem that we experience in a lot of ways, and we’re going to talk about that more.

What would be really useful here is to have a tool (like Slice, a tool for finding bugs) that could just intake a list of things and use an AI to just find “the best thing in the list of things.” That is very intentionally a very hazy, fuzzy problem—but I think it’s important because it allows us to be very flexible. Ideally, since we’re talking about big scale, it should also operate very efficiently, even at huge input. So those are two qualities we’re going to revisit throughout this talk.

Whichcraft and wizardry

We just had lunch, so let’s wake up a bit and talk about a fun topic: Magic. “Whichcraft” and wizardry. We have a “which” problem in the security industry—have you noticed? It’s not this kind of witch, but this kind, which is that you have multiple options in front of you and you need to decide which option to choose. Often in really difficult security problems, this is not just choosing between two options, but rather between many. There’s overwhelming cognitive overload when trying to decide which target to select or which path to take when doing vulnerability research.

We should consider some notable statements from some of the talks we’ve heard so far yesterday and today (I was scribbling down as I was listening to various talks): What do we pay attention to? How do we find signal in the noise? I think Kyle mentioned—I thought it was super interesting—we’ve done a lot of work on scanning for sensitive information in files, but not about selecting which files you should actually pay attention to. Smart targeting inside of an internal network assessment. Trying to identify high risk targets for fuzzing. Voting on parallel thoughts, as we heard today. And Ruikai and Olivier got two quotes because “in the interest of our time and sanity”—I thought it was a really useful way to represent how it feels to try to decide like which of these vulnerabilities is actually worth deep diving on and validating and reporting. The OODA loop.

There is a common theme between all of these problems expressed across a lot of the domains that we’ve heard about, and it is: Trying to decide which thing to do when there’s an overwhelming number of options, and it requires generally a person to figure it out. This is common to a lot of roles in the offense (but also defense) industry. I think the role of a SOC analyst is super interesting here, given that there’s an overwhelming number of alerts and the role of an analyst is to decide—both leveraging their SOAR system or their own intuition—which one do I investigate? This is very difficult to do.

It’s very difficult to automate answers to these questions—so we rely on wizards. Rob Joyce today talked about those exploit developers that just have the “magic glow.” This is your seasoned expert that just has the right instinct about which items to gravitate toward when there are a ton of options and it’s not clear or deterministic which option you should choose.

I personally encountered this problem a lot when patch diffing for N-day vulnerabilities. If you’ve ever used BinDiff to examine one patch versus another, you’ll arrive at this screen, which shows you all of the functions that changed in the patch. There are some similarity scores you can look at, you can see how much the function changed, you can kind of glance at the symbols—but it’s really up to you to just follow the “code smell.” That can be effective, but at a really large scale, I would ask: Do we feel like this is really working for us? SSL VPN appliances are constantly exploited in the wild. I would say that relying solely on a human’s intuition is useful—it’s certainly a very fun thing to do, and it’s personally gratifying—but to solve problems at the degree of scale that we need to, I think it is insufficient.

Notably, this is not a matter of what. We know what to do. We know how to fix a format string vulnerability. We know how to fix a command injection. It’s not a matter of not knowing what to do—it is a matter of not knowing which to do, given that we are constrained in our resources and we have to prioritize. That’s a super difficult thing to do. You should recognize I’m repeating that theme because it is at the core, I think, of a lot of really difficult security problems, especially when you try to scale.

Pitfalls in common approaches

When it comes to finding signal in the noise, let’s consider the easiest way to do this:

If you have a thousand vulnerabilities, just try to quantify what you care about. Score each one and then sort the vulnerability scores (i.e., in a pointwise manner). That’s a pretty naive approach—it works on some level, but also there’s a tendency to inflate toward the upper end of the scale. We see this a lot in the CVSS data set where, so far this year, there are over 200 CVEs all with CVSS score of 10 (the max score). That kind of breaks down when you still have to figure out, “Okay, well, among all the top-scoring vulnerabilities, which one should I focus on?”

You instead could consider: What if we do a pairwise A/B comparison of every vulnerability to every other vulnerability? Large language models are pretty good at this kind of thing. They can take like data and unlike data and give you an indication of which one is most impactful (or worse). They’re pretty good at hotdog-not-hotdog (or cancer-not-cancer, if we want to get morbid like we did yesterday). But this breaks down at large scale because with a traditional sorting algorithm that might have exponential time complexity in the worst case, the calls to an LLM just explode, and an inconsistent decision early on in the sorting process can really throw off results at the end. So this maybe is a matter of just finding a better harness.

A few claims

I want to make a few claims. These are the main things that I really want you to walk away with from this talk:

  1. Firstly, this repeated theme that a lot of our classically difficult problems are probably just prioritization problems in the face of severe resource constraints—and generally we solve those problems just by putting human eyeballs on the problem.
  2. But the second more important claim is that if we can transform these problems into a listwise ranking problem, then we can use LLMs to solve these very consistently, quickly, cheaply, and to great effect.

Listwise ranking with LLMs

To help illustrate this, I’m going to demonstrate an open-source tool called Raink (appropriately named). A lot of my tools are short, imperative—I kind of follow that pattern. We’re looking at a patch diff in SonicOS. This is the operating system that underlies the SonicWall firewall. We have an advisory which says there’s an auth bypass. It’s very vague: “An auth bypass in this SSL VPN, something something, related to Base64-encoded session cookies and not doing it right.” The patch is a quite enormous patch—it’s like ~1,500 changed functions. When you’re patch diffing, what you’re hoping for is hopefully less than ~10 functions changed because you can keep all that context in your head and make a pretty quick intuitive decision about which function to pursue. With 1,500, that’s hours, maybe days of hunting through a patch. So instead what I did is I first used Binary Ninja to extract decompiled code for each of the functions. Given that this is a stripped binary (no symbols) I did a quick summary, and then ranked the resulting functions according to how relevant they seem to be to this advisory. It works very well. The item highlighted in red is the correct function. What we see in this right hand mini-map (it’s like your Sublime Text code map), it shows you the shape of the whole data set. You can kind of see as we sample the data set over and over, an inflection point naturally emerges in the data set, which gives us a point to distinguish the very relevant items from the ones that aren’t.

Imagine turning the graph on its side and you would see on the left hand side (or on the top hand side), you have the relevant items where there’s a clear spike in the relevance scores. Honestly, this feels like magic to me even though I’ve been doing this for the better part of a year. I just ran this tool over and over just to watch it work, and I hear in the background, “ping, ping”—emails from OpenAI saying, “Your credit balance has been topped off.” It’s super cool, and it’s a CLI tool so you can just pipe it into your normal Unix pipeline. That’s the way I like to work and I assume a lot of others do too.

How it works

This uses document ranking. It’s just like doing a Google search asking for the best pizza in San Diego, where your documents are all the restaurants or web pages or whatever. But instead, our advisory is our search query. Our changed functions in the patch are the documents, and the most relevant ones go to the top. We sample the data over and over. We find the items that most often rise to the top when they’re relatively ranked in a small batch. And we collect all of those top performing items. We throw away the ones that aren’t relevant. We double down on the ones that clearly have some signal, and we recurse and rank those and we keep closing the gap until we have a highest ranking item.

This is clearly illustrated by choosing a domain name. I saw someone on Twitter said, “I want a math-y domain name.” That’s a weird problem to try to solve. It’s very fuzzy. Like, what makes a domain name “math-y”? I don’t know—it’s hard to objectively score, but you can say relatively, which domain name is “math-ier” than all the others?

Let’s take a small data set of 25 TLDs, split them up into five groups of five. Make five calls to an LLM, just ranking the small batches. We’ll see that .xxx goes to the top (“math-y”, apparently!) along with .simp and .xyz. If we run a few more trials (shuffle the whole data set, re-sample it over and over, see which items come to the top), we see that .simp and .xyz consistently, no matter which sample they landed in, go to the top. The data is beginning to speak to us through the LLM. Those are the items that are most relevant. If we were to graph this, we would see that .simp and .xyz are there on the top. This is the same shape of graph that I was showing in the tool demo a moment ago.

If we run a few more trials, we will see that we can see the shape of the data in higher resolution so that there’s not just a linear relevance curve, but there’s a clear inflection point where you can see the items that are relevant, clearly distinguished from the ones that are not. We can see the shape more clearly if we look at the rate of change (think of second derivative, or acceleration). You can see that near the top, the score increases more rapidly, which gives us a clear cutoff point to say, “Okay, probably everything under this line, we can toss it. Everything on top, we probably want to double down and recurse and start ranking those more aggressively.”

We can take the whole TLD data set (roughly ~600 TLDs) and throw it into this ranking tool. This is operating in real time. It’s very fast, super token-efficient, probably costs like a dime to run this. We see that .academy, .university, and .institute go to the top. If you don’t want that “academic” leaning, you could just refine your prompt and say, “I want the harder math-y stuff.” It might give you .int very interestingly, or .plus.

Computational complexity

We should address the namesake of this talk. The computational complexity here allows us to really tackle the scale problem. If we’re using a traditional sorting algorithm like merge sort or quick sort, we’re making a very expensive guarantee about the data set that we don’t need to make for a lot of security problems; we don’t need to guarantee the entire ordering of the whole data set, but rather we just need to find the top items. If we recognize that, then we can use some pretty useful tricks—like cutting the data set in half every time, or even converging early and not running more trials if we see that the position of that inflection point stabilizes.

This is really powerful because there is no fine-tuning here, there’s no domain-specific models, there’s none of that. It’s just an off-the-shelf open-weight or frontier model. Just hit the API. The key point here is that the complexity of the entire problem is localized in just that small call to an LLM for 10 items at a time. That is, if it can solve and rank your data 10 items at a time, then it can rank 10,000.

We can also compare unlike items, so a SOC analyst can rank their alerts and get a meaningful answer to the question, “Which one is more important? This weird phishing email that I got, or a brute-force attempt on my public login form?” You could also feasibly employ this as an explicit decision-making framework in an offensive workflow. Consider that for smart targeting and operating inside of an internal network, it’s a matter of constantly taking all of your available context and deciding, “Would it make sense to go deep, or keep looking broadly? Which target would bring me toward my goal?” Given that this is a large language model, it can just explain itself, which is a very useful property of using large language models for this purpose!

Case studies

Let’s look at a few case studies. If we go back to the original problem—we want to find all the vulnerabilities (or at least, the impactful ones)—then our first job is to try to identify which target we want to look at. In this case, I took the ~2,700 top-starred C, C++, and Python repos on GitHub and asked, “Which of these are most widely used? Which are doing security-sensitive or safety-critical stuff? Which parse untrusted inputs? What are the kind of targets I feel like I would want to consider if I were intuitively navigating this data set—and can an LLM help me rank them?”

When we look closely at the results, unsurprisingly Linux kernel goes to the top. Redis is up there, too; there was just a CVSS-score-10-RCE that dropped in Redis last night! I was very pleased that openpilot landed in this set—that’s an open source autonomous driving framework. Chromium.

Let’s take the top item (Linux kernel) and say, “Within this project, what are the subsystems that we would actually want target for vulnerability research?” If we look at the MAINTAINERS file, we’re looking at like ~3,000 distinct subsystems in the Linux kernel where we can ask roughly the same question: “Which of these is most likely to allow something nasty if I were to be able to exploit it? Is there something that crosses trust boundaries, or reads data off the network?” We see KVM, NFS, KSMBD, (which has been the focus of a lot of research over the past year), WiFi drivers—that’s the kind of stuff we would want to see.

I want to be clear: The point I’m making here is not that an LLM can do this and you never could. The idea is just that it’s super expensive and not feasible to keep humans at the gate of making prioritization decisions like these if we want to scale really aggressively. What we would want to see, naively is: Are we at least getting the kind of results that I feel like I would want to see if I were the one ranking this data set? In a lot of cases, almost unilaterally—yes! And if I don’t see exactly what I want, I just tweak the prompt until the data is starting to take the shape that I would expect it to.

If we were to go back to that example of CVSS scores, there are over 200 vulnerabilities that all are scored at 10. If we run this through a ranking data set, we will see the top items are “RCE in Cisco appliances.” You can see item #11 is that Redis use-after-free RCE that I just mentioned. The idea here is, instead of just manually browsing a list (maybe you can overlay other data points like EPSS scores, but you’re still solving the same kind of problem—trying to quantify something that really wants to just inflate toward the upper end of whatever scale you give it), instead of seeing a pre-auth command injection on IP camera (which is a lot of the stuff you will see if you look at the top CVSS scores), you see items that are the kind of things I would really be concerned about.

Closing thoughts

A couple of closing thoughts. What I would really like to communicate here is: Please examine the shape of your hard problems that you’re trying to solve, and ask, “Might this be feasibly solved with an AI if I were just to try to approach it from a different angle? Or transform it into a kind of problem that we already know how to solve, and that a large language model can do at scale quite effectively?” I have felt for a while that I’ll just wait for models to get better, rather than improving my scaffold or harness; I’m starting to feel instead that a lot of models, even open-weight models, are quite capable for the kind of problems we need to solve. There’s a lot of super impactful work that we could do right now if we just recognize the kind of problem we have and try to solve it appropriately.

The things we have that work are build-free analysis at scale using Slice; you don’t have to compile the code or anything, just read the code and it works pretty well! And then ranking a data set and using convergence detection to operate really efficiently. I still want to apply more focus to dynamic verification of those results. Also, the cURL project has a bit unfortunately been the sacrificial guinea pig for a lot of AI-assisted disclosure. I’ve seen some encouraging results just in the past few days that Daniel Stenberg has gotten some better results—he is somehow still patiently reading through and saying, “Yeah, you know what, AI actually helped here.” So thank you, Joshua [Rogers]; that was really cool milestone moment.

You don’t have to take my word for any of this. The tools are both open-source, Raink and Slice. I’m super excited about where this is going. I can tangibly feel being at an inflection point in the industry. Thanks so much for having me.


  1. O(N) The Money: Scaling LLM-Based Vulnerability Research via Static Analysis and Document Ranking ↩︎