Since it was first released to the public late last year, ChatGPT has successfully captured the attention of many. OpenAI’s large language model chatbot is intriguing for a variety of reasons, not the least of which is the manner in which it responds to human users. ChatGPT’s language usage resembles that of an experienced professional. But while its responses are delivered with unshakeable confidence, its content is not always as impressive.
Before proceeding to the research results, it is important to understand that artificial intelligence systems encompass a variety of different techniques and technologies to make decisions and predictions. Without a full understanding of the technologies behind OpenAI’s chatbot, it is impossible to make a truly accurate assessment about the scope of its capabilities. In lieu of that, the best that we can do is treat it as a black box and assess its responses to various prompts.
For this blog, we tested ChatGPT’s ability to perform basic static code analysis on some vulnerable code snippets. At first glance, the responses it delivered were astounding, but as with any good research, it was necessary to scratch the surface to see its true value.
Figure 1. The vulnerable code refactored using dynamic memory allocation.
Buffer Overflow in C
The first piece of code tested was a simple buffer overflow example. ChatGPT quickly determined that it contained a vulnerability due to the length of the string printed exceeding the size of the fixed-length buffer. When asked to categorize, it assigned a CVSSv2 vector of AV:L/AC:L/Au:N/C:P/I:P/A:P and labeled it as CWE-119: Improper Restriction of Operations within the Bounds of a Memory Buffer. When asked how to make the code more secure, ChatGPT suggested increasing the size of the buffer, using a more secure function (snprintf in place of sprintf), or dynamically allocating memory for the buffer based on the length of the string. It recognized the limitations of assigning these labels without further context. More impressively, ChatGPT could refactor the code based on any of the fixes that it suggested – for example, using dynamic memory allocation.
Going further, we asked ChatGPT to show how an attacker could exploit this vulnerability, and it did not disappoint. Since we did not give it any constraints, it updated the code snippet with a variable the size of the buffer and a shellcode to echo ‘Hello World.’ First, the code would have to be compiled in order to execute it. Although this safeguard can be easily disabled, modern compilers like GNU Compiler Collection (GCC) can prevent code from writing explicitly declared variables into smaller buffers acting as a primary safeguard to avoid such issues. When asked how it would exploit the vulnerability after the code was compiled, ChatGPT generated a python script with the same shellcode.
Figure 2. Python script with the same shellcode.
Most interestingly, when asked how to exploit the code from a Linux shell, ChatGPT provided detailed step-by-step instructions on how to use objdump, gdb and msfvenom to generate the payload.
Figure 3. Step-by-step instructions on how to use objdump, gdb and msfvenom to generate the payload.
It could have used a simpler method, but knowing the offset of the buffer, the AI took a much more aggressive approach by spawning /bin/sh, declaring:
‘This effectively gives the attacker a command prompt with root privileges.’
This would be true if the vulnerable code were running with root privileges already, which is not always the case. When asked why it is assumed that the user is root, ChatGPT explained that the user will be inherited from the parent process running the code.
‘I apologize for any confusion. […] in a realistic scenario, the user who executes the shellcode may not have root privileges.’
The second code we tested was an example of DOM-based Cross-Site Scripting (XSS). This time, ChatGPT would not commit to a particular category as it found multiple issues, including XSS, if the value of ‘name’ is not properly sanitized. When asked specifically about the XSS, it gave a CVSSv2 vector of AV:N/AC:M/Au:N/C:P/I:P/A:N and labeled it as CWE-79: Cross-Site Scripting (XSS) or CWE-494: Download of Code Without Integrity Check. Additionally, ChatGPT suggested multiple ways to improve the security of this code, including sanitizing user input using certain PHP functions, running the code in strict mode, and using a content security policy (CSP). Once again, it was able to quickly refactor the code to make it more secure, describing the changes that it made.
Figure 4. ChatGPT found multiple issues with very little context.
What was interesting about the refactored code is that ChatGPT wrote new functions to implement these changes rather than just expanding the existing code block. This suggests that ChatGPT is at least aware of some basic software development practices and when to apply them.
Code Execution in Ruby
The last code we tested was a potential code execution in Discourse’s AWS notification webhook handler. ChatGPT first described the code as shown below:
‘This code appears to be a Ruby script that defines a Sidekiq job for confirming an Amazon SNS subscription. The purpose of this job is to verify that the SNS message received is authentic and then confirm the subscription by visiting the SubscribeURL.’
Although this description is a bit vague, it is a good enough starting point to describe the code. ChatGPT also provided a line-by-line breakdown of the code. The descriptions of each line describe the methods being used and how they relate to their parameters, and while they can be helpful, ChatGPT could not infer any more context. Interestingly, when supplied with the same code with the variables and identifiers changed, it provided an almost identical response, suggesting that it was doing more than just looking up definitions based on naming conventions.
Initially, ChatGPT did not find any issues with this code:
‘Overall, this code appears to be secure and well-written, using appropriate error checking and an AWS SDK to verify the authenticity of the SNS message.’
ChatGPT finding no issues with this code is not surprising since exploitation of this code relies on creating a custom endpoint and injecting a crafted X509 certificate. While the code itself is not obviously problematic, ChatGPT is a little too trusting with user-supplied input that could result in code execution when calling the ‘open’ method under the right conditions. The complexity of this vulnerability is much higher than that of the previous examples and requires an understanding of the AWS SDK of which to take advantage. However, when specifically asked about security issues, ChatGPT was able to identify the root causes of the vulnerability, even if it did not know the exact method of exploitation.
Figure 5. Descriptions of the changes that ChatGPT made when asked to improve its security.
Though these insights are basic, they can be helpful when finding vulnerabilities. In this instance, the code block was quite small – only 16 lines without whitespace – but that is not often the case. When trying to find and/or exploit a vulnerability, the starting point is often much broader in scope. Additionally, since any block of code often contains multiple issues, an elegant description of problematic elements can help filter out the noise when searching for the cause of a particular behavior.
When starting with a large block of code, an AI-powered static analysis tool could be valuable in helping researchers reduce the amount of time and effort required to narrow the search.
The Right Tool for the Job?
Although only a few tests are highlighted here, we provided ChatGPT with a lot of code to see how it would respond. It often responded with mixed results. With the three examples above, it did quite well finding potential issues. These examples were chosen because they are relatively unambiguous, so ChatGPT would not have to infer much context beyond the code that it was given.
To get the most out of ChatGPT, it is important to be as clear and specific as possible. When supplying it with larger code blocks or less straightforward issues, it did not do very well at spotting them, but that is no less true about humans trying to do the same job.
Although static analysis tools have been used for years to identify vulnerabilities in code, they have limitations in terms of their ability to assess broader security aspects – sometimes reporting vulnerabilities that are impossible to exploit. ChatGPT demonstrates greater contextual awareness and is able to generate exploits that cover a more comprehensive analysis of security risks. The biggest flaw when using ChatGPT for this type of analysis is that it is incapable of interpreting the human thought-process behind the code.
For the best results, ChatGPT will need more user input to elicit a contextualized response detailing what is required to illustrate the code’s purpose.
The responses that ChatGPT delivers are not always accurate. In OpenAI’s defense, ChatGPT’s purpose is to simulate human conversation, and, in that regard, it is wildly successful. As a tool specifically designed to generate chat, it should be no surprise that ChatGPT is particularly good at writing clear and concise responses. Additionally, ChatGPT could be particularly useful for generating skeleton code and unit tests since those require a minimal amount of context and are more concerned with the parameters being passed - another thing at which ChatGPT excelled in these tests.
It is flexible enough to be able to respond to many different requests, but it is not necessarily suited to every job that is asked of it. That said, ChatGPT provides a real source of intrigue for more specialized AI-powered tools in the future. Still, as with any exciting new tool, it is necessary to push the limits to find the most suitable use cases. There is already a variety of ChatGPT-powered plugins for different software from Integrated Development Environments (IDEs) to disassemblers, so it is only a matter of time until some applications rise to the top.