TL;DR: Raink—a novel, general-purpose listwise document ranking algorithm using an LLM as the ranking model—can be used to solve non-trivial security problems.

A very simple explanation of how the Raink algorithm works:

  • Split big list of items into small groups (e.g., 10 items per group)
  • Ask the LLM to rank/order each small group according prompt relevance
  • Shuffle everything and repeat this process several times
  • Keep track of how each item performs across different groups
  • Focus more attention on the items that consistently rank highly

The approach is fast, cheap, and handles far more items than would fit in a typical LLM’s context window. Key features of the algorithm and its implementation:

  • Linear computational complexity. Maintains a breakthrough listwise O(N) complexity, compared to O(N log N) for quicksort and O(N^2) for pairwise comparison.
  • Statistical robustness. Uses a multi-pass Monte Carlo random shuffling approach to avoid sensitivity to initial input ordering and prevent a single incorrect judgement from throwing off the whole ranking process.
  • Dynamic optimization. Automatically sizes input batches to fit in a limited context window, based on estimated input token counts.
  • Output token efficiency. Very conservative in the number of output tokens emitted per API call (e.g., only 90 output tokens per call for the default batch size of 10).
  • Recursive refinement strategy. Ensures the most important items receive the most scrutiny.
  • Fast performance. Aggressive concurrency for “embarrassingly parallel” API calls.
  • Error mitigation. Validates LLM response contents with missing item detection and retry capability.

Application to security

Okay, the algo and tool might sound great—but what can we actually do with this technique?

N-day vulnerability identification

That is, take an obscure security advisory and reverse-engineer the root-cause by examining the patch.

  • The problem: A large firmware patch of 1,651 changed functions in a stripped binary (so, no function names or symbols to help us deduce what’s changed).
  • The goal: Identify which function in that patch actually fixed a vulnerability.
  • The solution: Raink successfully identified the correct function in the patch as a top-ranked item (subsequently confirmed by a human expert security analyst).

Empirical results for an input dataset of 1,651 items to be ranked:

  • 3,280 LLM API calls (81.9% less than quicksort, 99.9% less than pairwise ranking)
  • 295,200 output tokens (only 17 cents for GPT-4o mini)
  • 5 minutes total runtime
  • 30 cents total cost (accounting for input tokens)

Note that this scales linearly with dataset size, meaning that for even larger datasets, the cost and time does not increase significantly. I’ve applied this analysis to multiple firmware vendor patches from Fortinet, Citrix, Check Point, SonicWall, and others. The beauty here is that:

  • The algorithm is general-purpose and can be applied to any list ranking problem.
  • The underlying model can be easily hot-swapped for any LLM (including local models).
    • Even non-SOTA models (e.g., GPT-4o mini or 14B Phi-4) can perform well if the problem is decomposed well enough.
  • There are no special requirements for:
    • The LLM (no fine-tuning or special training).
    • The shape of input data (no special formatting or pre-processing).

Fuzzing native code

There’s been some past work on using LLMs to write fuzzing harnesses (which I’ve also had success with), but LLMs can also be applied to other steps in the overall fuzzing chain:

  • Ranking recently updated and 100+ starred C++ repos for fuzzing consideration
  • Referencing build documentation from a GitHub repo and automating the build/instrumentation process
  • Ranking functions from source code and build artifacts (e.g., shared library exports) to identify suitable fuzzing targets
  • Writing a harness for the top-ranked fuzzing targets

I’ve been using Raink for fuzzing target identification, with the aim of completely automating the end-to-end testing of unfuzzed open-source projects—all the way from initial repo identification to crash triage and root-cause analysis.

Using this technique, I’ve recently found a heap overflow in the MIDI-parsing logic of open-source music software.

Web application pentesting

We can extend the same decomposition-and-ranking process to web applications. Consider that each of a webapp’s potential injection points (e.g., cookies, headers, GET query string parameters, POST body data) can be treated as an individual document. While looking for SQL injection vulnerabilities in a live web application, we can provide the names, observed values, and web page context of these injection points and ask an LLM, “Which of these injection points may be most likely to be stored in a database on the backend?” I’m currently applying this technique to the labs on PortSwigger’s Web Security Academy.

In my preliminary testing of those labs, it takes only 1 min 14 sec to identify SQL injection by:

  • Crawling a webapp
  • Extracting and ranking injection points
  • Passing the top-ranked injection points to sqlmap for high-signal vulnerability discovery

I’ll soon follow up with a visual demonstration of this technique in action.

SOC incident case triage

The problem of analyzing SOC cases maps very cleanly to document ranking. Many products (e.g., SOAR) advertise “AI-assisted triage,” which generally just amounts to deterministic rules for sorting quantitative case attributes. Instead of looking for the absolute highest scored case (quantitative/pointwise problem using estimated severity score, etc.), we need only look for the relative top ranked case (qualitative/listwise problem comparing case contexts). I haven’t directly tested this defensive application of Raink but I expect it to be trivial to implement successfully—especially since this use case has no “right answer” but rather is meant to guide allocation of manual (or automated) effort downstream. SOC cases naturally provide a lot of context, which can be overwhelming for a human but instrumental for an LLM.

Network traffic capture analysis

The “documents” would be individual packets, or TCP conversations, etc. The “query” would be, “Which of these network streams looks anomalous (or malicious) compared to the others?”

Attack surface management

Which of these internet-facing web applications deserves a manual pentest more than all the others (based on its age, complexity, apparent business function, etc.)? This is a huge problem that I’ve personally worked on for years; very similar to the SOC use case in terms of shifting focus from quantitative attributes to qualitative context.

Source code review

Which of these function call chains from the app’s entry point seems to invoke sensitive functionality that would warrant deeper security testing? Lots of overlap with N-day analysis.

The list goes on

The critical step in all of these examples is reducing a complex, domain-specific security problem to a general document ranking problem. A creative security engineer will be able to add many more use cases for Raink beyond the limited scenarios I’ve described here.

Shortcomings and future work

A few pain points:

  • Context window size. For very large input item sizes, we may still struggle to fit small batches into the context window. I’ve mitigated this by ranking context-rich summaries of input items (instead of the raw items themselves), but this is still a limitation that requires extra work.
  • Verification. When using an LLM to build a target C++ project, did the build succeed? What’s our criteria for detecting that? When ranking candidate changed functions in a firmware patch, how can we verify that the top result is actually the right one? This currently still requires some human analysis but I think LLMs can be applied to this stage as well. For example, we can run iterative deep analysis across the top N results from the ranking algorithm and then additionally rank those analyses; that doesn’t solve the verification problem but it does get us a step closer.
  • Insertion efficiency. I suspect that we can achieve linear insertion performance into an already ranked list, but still need to explore this. Nailing this down will be critical for “continuous security” scenarios like attack surface management where new assets will need to be inserted into an existing ranked list of assets.

I think Raink—as an algorithm but also in its existing implementation—has enormous potential to cause a “leapfrog” effect by transforming the way we use LLMs to solve problems efficiently (including domains beyond security). Applying Raink toward external, black-box web application testing likely to generate the most traction in this space since it’s a problem shared by every company with an internet connection. I expect that recording statistics like time spent, network requests sent, LLM API calls executed, vulnerabilities discovered, etc. will help demonstrate how impactful a simple ranking algorithm can be.