OpenAI’s GPT-4 model is consistently better and faster at exploiting cybersecurity vulnerabilities than human experts. Academics argue that this skill is a recent addition to the wheelhouse of AI models, but will only improve further over time.
Researchers at the University of Illinois at Urbana-Champaign (UIUC) recently published a paper on this use case. There, we pitted several large-scale language models (LLMs) against each other in the field of exploiting security vulnerabilities. Given an exploit description from CVE, a public database of common security issues, GPT-4 successfully attacked 87% of the vulnerabilities tested. All other language models tested (such as GPT-3.5, OpenHermes-2.5-Mistral-7B, Llama-2 Chat (70B)) and dedicated vulnerability scanners are unable to exploit the provided vulnerability even once. did.
LLM was given “one-day” vulnerabilities (so-called because they were so dangerous that they needed to be patched the day after they were discovered) to test. Cybersecurity experts and exploit hunters have built their entire careers around finding (and fixing) one-day vulnerabilities. So-called white hat hackers are hired by companies as penetration testers to outwit malicious agents looking for vulnerabilities to exploit.
Fortunately for humanity, GPT-4 was only able to attack known vulnerabilities (described in CVE). When identifying and exploiting bugs, LLM had a success rate of only 7%. In other words, anyone who can craft a good ChatGPT prompt doesn’t have the key to the hacking apocalypse (yet). That being said, GPT-4 remains a unique concern as it not only has a theoretical understanding of the vulnerability, but also the ability to autonomously execute the steps to execute the exploit through an automation framework.
And unfortunately for humanity, GPT-4 is already ahead of us in the exploitation race. Assuming that cybersecurity professionals are paid $50 an hour, the paper states, “Using an LLM agent, [to exploit security vulnerabilities] It already costs 2.8 times more than human labor. LLM agents are also easily scalable, as opposed to human labor. ” The paper also estimates that future LLMs, such as his upcoming GPT-5, will be further enhanced in these capabilities, and perhaps even further enhanced in discovery skills. This is sobering considering that vulnerabilities like Specter and Meltdown still have a place in the minds of the technology industry.
As AI continues to be used, the world will continue to change irrevocably. OpenAI specifically asked the paper’s authors not to publish the prompts used in this experiment. The authors agreed and said they would publish the prompts “only upon request.”
AI queries put a lot of strain on your environment, so be careful when trying to replicate this (or anything) yourself in ChatGPT. A single ChatGPT request costs nearly 10 times more than a Google search. If you are familiar with this energy difference and want to operate LLM yourself, here is how an enthusiast ran his AI chatbot on her NAS.