TL;DR: We used ChatGPT to perform automated token reviews using our ERC20 token review framework and compared the results to standard token security reviews. While ChatGPT did not hit the accuracy bar to clearly demonstrate that it should be integrated into our asset review process, it did show enough potential that we will continue to investigate further.
One of the functions performed by our Blockchain Security team is researching the most efficient and effective ways to review token contracts and make decisions on whether to list assets on our centralized exchange. Whenever we find things that aren’t safe, we ask project teams to take the appropriate measures to mitigate those risks. If they don’t, we don’t list their tokens.
The Blockchain Security team leverages in-house automation tools developed to aid security engineers in reviewing ERC20/721 smart contracts at scale. With the emergence of ChatGPT by OpenAI and the buzz around its ability to detect security vulnerabilities, we wanted to test how well this would work as a frontline tool applied at scale rather than just a one off code reviewer. In this blog post, we detail our approach to an experiment with ChatGPT and reveal the results of our token security experiment plus lessons learned.
ChatGPT For those unfamiliar with ChatGPT, refer below for the output of ChatGPT when prompted “how would you describe yourself?”:
“As an artificial intelligence language model, I don't have a physical form or a personal identity, but I can help answer your questions and provide information on a wide range of topics. I'm programmed to understand and process natural language, so I can communicate with you through text-based conversations. I'm designed to provide helpful and informative responses, and I can draw upon a vast amount of knowledge and data to answer your questions to the best of my ability.”
ChatGPT has shown promise to be beneficial at improving productivity across a wide range of development and engineering tasks. The tool can optimize code, identify vulnerabilities, and perform other various tasks based on the prompt or prompts that it is given.
Experiment approach
The goal of the experiment was to determine ChatGPT’s token security review accuracy by comparing the review results with the results of a standard review performed by a blockchain security engineer. Random ERC20 tokens were selected from publicly available sources to conduct this experiment. Blockchain Security engineers leverage our internal proprietary secure trait analyzer tool to conduct automated review of every function within a token’s smart contract and output a risk score based on the risks tagged to functions (see here for more information on our ERC20 review framework). Similarly, in order to compare ChatGPT’s accuracy to that of a standard review, we would need ChatGPT to produce a risk score for the randomly selected tokens.
Apples to oranges
If we simply asked ChatGPT to produce a risk score based on this token’s smart contract security, the tool would not have sufficient information or context to produce a response that we could compare to the automated review using our proprietary or performing manual reviews. This is because the ChatGPT and our in-house reviews would not be assessing smart contracts security the same way.
The Blockchain Security team maintains an industry leading ERC20 review framework to identify and mitigate risks to digital asset custody. For ChatGPT, this standard security review framework is not exactly clear. You can ask ChatGPT ‘how would you determine a security risk score if you were given a smart contract to review?’. However, the results prove the token’s security risk score is not being assessed using the same review framework.
In order for ChatGPT to produce a risk score under the same framework the blockchain security engineer uses to review the token we had to feed the tool more information. Specifically, we had to teach it how to identify risks defined by our security review framework.
Prompt engineering
Prompt engineering is a developing AI field that is maturing with the use of tools like ChatGPT. In order to get the AI tool to produce intended results one needs to clearly articulate how the task should be performed or handled.
To have ChatGPT produce a risk score using our ERC20 security review framework we gave it the following prompt:
I want you to act as a blockchain security engineer. Your task is to identify security risks within a token smart contract based on the risk associated with its functions. Here is our framework [ + risk framework]. Are there any of these risks within the following smart contract?
[+smart contract code]
We defined our framework in the prompt for ChatGPT and asked it whether any risks were present. Using this prompt we were able to get the tool to indicate which risks were present which would then allow us to calculate a security risk score. We settled upon this prompt for the experiment.
Experiment results
We compared 20 random smart contract risk scores between ChatGPT and our automated security review leveraging internal tools. ChatGPT produced the same result as our reviews 12 times. However, of the 8 misses, 5 of them were cases where ChatGPT incorrectly labeled a high risk asset as low risk, which is the worst case failure: underestimating a risk score is far more detrimental than overestimating.
While the efficiency of a ChatGPT review is remarkable there are still some limitations that impair the tool’s accuracy. Primarily, ChatGPT is not capable of recognizing when it lacks context to perform a robust security analysis. This results in coverage gaps where additional dependencies go unreviewed. An initial triage would be required to scope the review for the tool each time in order to prevent any coverage gaps. Secondly, the tool can be inconsistent; when we asked the same question multiple times we would not always get the same answer. It also appeared to be influenced by comments in the code and seemed to occasionally default to comments rather than function logic. Finally, OpenAI continues to iterate on ChatGPT versions resulting in additional output instability. Detailed prompts that may have worked in the past to provide consistent outputs could produce alternative outputs following a version change. Prompt maintenance and output quality control may be needed to ensure consistent responses and avoid any operational failures.
While ChatGPT shows promise for its ability to quickly assess smart contract risks, it does not meet the accuracy requirements to be integrated into Coinbase security review processes. We anticipate that with further prompt engineering, we can drive the accuracy of the tool up. However, the tool cannot be solely relied upon to perform a security review.
Assuming we are correct that we can increase the accuracy, we would expect a good first use case for the tool would be to serve as a secondary QA check. In future, Security engineers can leverage the tool to perform an additional control check to catch any risk that may have gone overlooked. ChatGPT prompts are saved for future use by engineers and are planned on being improved overtime.
Reducing security risks in Web3
Coinbase makes security a priority and maintains industry-leading security frameworks to review all tokens before they are listed on our centralized exchange. Simply put: we review the code so you don’t have to.
As part of the growing web3 ecosystem, Coinbase is keen on securing the future of web3. To that end, Blockchain Security works directly with asset issuers to mitigate critical risks when identified. As we continue to look for valuable ways to introduce AI improvements into our security process, we will continue to experiment with emerging tools that show promise. Until an automated and/or intelligent process can show the same degree of accuracy as our existing processes, these tools will only be used as a supporting tool, rather than a primary tool. We are excited to explore new tools that can help us further protect the future of our Crypto Economy.
Experiment was performed using ChatGPT-3 and an arbitrary selection of ERC20 tokens from Etherscan. Special thanks to Von Tran for his work on the chatGPT experiment.
Company,
Dec 18, 2024