A Comparative Look at Static Code Analysis and Large Language Models

Introduction

Static code analyzers are widely used in security and safety-related software systems. These tools identify code patterns and structures that may be hazardous or exploitable. Addressing these violations will improve code security, safety and quality. Various standards (such as SEI CERT Coding Standards, MISRA-C, etc.) define rules and recommendations for avoiding certain code patterns.

These rules and recommendations focus on well-defined syntax patterns. They are independent of the code's semantics. There is no connection with the implemented functionality and the underlying environment of the code. Consequently, static code analyzers do not comprehend the semantics, leading to numerous false positive violations.

"If there is one thing I hate about SAST tools, is false positives..."

A false positive is a finding that, after manual inspection, turns out to be irrelevant. It is flagged by the checker but poses no harm, and no correction is required.

In this article, we investigate how Large Language Models (LLMs) can assist in evaluating static code analysis violations. Our goal is to differentiate between true and false positives. Furthermore, we provide an overview of how LLMs can improve code quality by understanding the semantics. They can identify issues beyond the capabilities of syntax-oriented static code analyzers.

Judging the violations of static code analyzers

Manual examination and evaluation of a static code analysis violation is cumbersome. We need to consider lots of things to properly evaluate a finding. Once judgment is passed, the appropriate correction can take considerable time. Here are two perspectives we need to take into account when examining findings. Note that both of these heavily rely on semantic understanding.

Understanding the violation
First, it is necessary to clarify why a given violation arises. What was the original intent of the rule? Why is it beneficial to comply with this rule? What problems may arise from not following it? What are the consequences of a fix?
‍Evaluation of the context
‍Understanding the implemented functionality and environment of the code is essential. Without it, a blind correction can easily lead to the introduction of additional bugs.

This is not a complete list and you can very easily dive deeper. Code corrections shall only be performed after evaluating all perspectives. Doing this manually, is an exhausting job.

While SAST tools can recognize vulnerable patterns, the above two perspectives are mostly unknown to them. That's where LLMs come in.

Example

Let’s consider the following task. Here is the specification / pseudocode.

Take an integer input (assume input > 1). If the input and 10 are relative primes, take the reciprocal value of the input, and then return the length of the repeating cycle of the decimal part. Otherwise, return -1. Note: if we calculate the reciprocal value of a positive integer (if it is relatively prime with 10), there will be a repeating cycle in the result. Example, input is 7, 7 and 10 are relative primes, 1/7 is the reciprocal value, in decimal format 0.142857142857, 142857 is the repeating cycle, it is 6 long, return 6.

Here is a simple implementation of this function.

1  sint16_t calculate(uint32_t input)
2  {
3    uint32_t remainder = 1;
4    uint32_t dividend = 10;
5    uint32_t quotient, new_remainder;
6    uint16_t i;
7
8    if (input % 2 == 0 || input % 5 == 0) {
9      return -1; // input and 10 are not relative prime
10   }
11
12   for (i = 0; i < input; i++) {
13     quotient = dividend / input;
14     new_remainder = dividend % input;
15 
16     if (new_remainder == remainder) {
17       return i + 1;
18     }
19
20     dividend = new_remainder * 10;
21   }
22
23   return -1; // cycle not found
24 }

The violations raised by the static code analyzer [1] are shown in the table below. I asked GPT-4 to evaluate the findings one by one. Decide whether these violations are relevant or not, and classify their category. For instance, fixing it improves code quality, readability, maintainability, or it affects the functionality.

When constructing the prompt, I gave it the code, the findings and some instructions. The relevance and suggestion are generated by GPT-4.

Line 3 - Reserved name 'remainder' used
Relevance: Yes, it is better to avoid reserved names to improve code readability and maintainability.
‍Suggestion: Change the variable name 'remainder' to 'initial_remainder' or another non-reserved name.
Line 3,4 - Unsigned variable 'remainder' and 'dividend' is initialized by signed constant
Relevance: Yes, it's better to keep consistent types to avoid potential issues and improve code readability.
‍Suggestion: Change the constant to an unsigned value: uint32_t dividend = 10u;
Line 5 - Variable 'quotient' is set but not used
Relevance: Yes, removing unused variables can improve code readability and maintainability.
‍Suggestion: Remove the unused variable 'quotient'.
Line 8 - Use parentheses when different operators are used in expression
Relevance: Yes, using parentheses can improve code readability and prevent potential issues with operator precedence.
‍Suggestion: Add parentheses around the conditions: if ((input % 2 == 0) || (input % 5 == 0)) {
Line 8 - The operands of a logical '||' in function 'calculate' should be parenthesized
Relevance: Yes, this is a duplicate of the previous violation, and adding parentheses will improve code readability.
‍Suggestion: Same as above.
Line 8 - Implicit conversion between signed and unsigned type shall not be used
Relevance: Yes, avoiding implicit conversions can prevent potential issues and improve code readability.
‍Suggestion: Change the input type to uint16_t (since it won't exceed the maximum value of uint16_t) and update the function signature accordingly: sint16_t calculate(uint16_t input)
Line 12 - Implicit conversion between signed and unsigned type shall not be used
Relevance: Yes, avoiding implicit conversions can prevent potential issues and improve code readability.
‍Suggestion: Change the loop variable type to uint32_t: uint32_t i;
Line 17 - Implicit conversion between signed and unsigned type shall not be used
Relevance: Yes, avoiding implicit conversions can prevent potential issues and improve code readability.
‍Suggestion: Change the return type of the function to uint16_t and update the function signature accordingly: uint16_t calculate(uint16_t input). Also, update the error return values to a valid unsigned value like UINT16_MAX.
Line 20 - Implicit conversion between signed and unsigned type shall not be used
Relevance: Yes, avoiding implicit conversions can prevent potential issues and improve code readability.
‍Suggestion: Change the calculation to use an unsigned constant: dividend = new_remainder * 10u;
Line 9, 17, 23 - Provide only one exit point in 'calculate' function
Relevance: Partially relevant. Having a single exit point can improve code maintainability and readability, but it may not always be necessary.
‍Suggestion: Consider using a 'result' variable to store the return value and return it at the end of the function. However, note that the current implementation is also acceptable.

If you skipped the table, please go back and read it. The power of LLMs is immediately visible when checking the evaluation of GPT-4. It gives clear reasoning about the relevance and the suggested fixes. This is helpful in finding the right path during the correction of the violations.

On the other hand, verification of GPT-4 evaluation is essential as it is not always correct. Violations #4 and #5 are considered as a duplication, however if we investigate them,

#4 refers to the % and == operators inside of the input % 2 == 0 expression (which proper correction is (input % 2) == 0), meanwhile
#5 refers to the operands of || operator, and here the proper correction equals to the suggestion in #4 ((input % 2 == 0) || (input % 5 == 0)).

So the proper correction of this line is: ((input % 2) == 0) || ((input % 5) == 0)

Beyond static code analysis - understanding the semantics

The evaluation of the static code analysis violation by GPT-4 is a significant help. Let's go even further. GPT-4 can also analyze the code without any starting point. I asked it to find semantic issues and security vulnerabilities in the code. For the same code, without any change the following findings were provided.

Integer overflow: In the for loop (line 12), when i < input, there is a possibility of integer overflow if input is larger than the maximum value of uint16_t. It would be better to change the data type of i to uint32_t.
→ Score! This is an integer overflow, a serious functionality issue. This overflow was invisible for the SAST tool.
Possible infinite loop: The for loop (line 12) may continue indefinitely if the condition new_remainder == remainder is not met. It is recommended to have an upper bound for the loop to prevent infinite loops.
→ Score! For large inputs the code can stuck in an infinite loop (because of the overflow). Again, missed by the static analyzer.
Unused variable: quotient is calculated at line 13, but it is never used elsewhere. Consider removing it if it's not needed.
→ True, but this one is easy. It was found by the static analyzer as well.
Return type mismatch: The function returns a signed 16-bit integer sint16_t, but potentially, i + 1 could be larger than the maximum value of sint16_t. As a suggestion, you may want to use int32_t as the return type instead.
→ Score! This one is interesting. Both tools found the bug, but the reasoning isn't the same. SAST tool reported the inconsistency between type of i and the return type, while GPT-4 reasoned that a larger integer value (i+1) could be interpreted as a negative signed integer. This clearly demonstrates superior understanding of both the context and the problem.

GPT-4 also provided the corrected C code for the issues above.

Conclusion

Augmenting SAST workflows with semantic understanding LLMs like GPT-4 can significantly decrease the time spent on identifying false positives. While the evaluation of SAST output by GPT-4 is not perfect, and we can't rely on it completely, it provides great help. It's not out of the box, not yet at least.

But the real strengths of LLMs are revealed when we let them work ‘on their own’, as they can find semantic issues, security vulnerabilities, and functionality-related problems as well. Semantic examination by GPT-4 revealed findings that were invisible to the SAST tool. But again, be aware that blindly accepting the proposed solutions can lead to further problems.

Thanks to Andras Szell, Istvan Szenasi and Daniel Szpisjak for their contributions and feedback.

References

[1] The measurement was made with a certified commercial static analyzer