Verification of dynamic behavior with Large Language Models

Introduction

In a previous post we have seen the code analyzing capabilities of Large Language Models (LLMs). But verification and validation of code is more than static code analysis. In the software industry many other techniques are used as well. In this post, I will focus on testing the dynamic behavior of the code.

Dynamic testing means that code is executed with different inputs to verify the correct behavior by checking the internal states and/or outputs.

According to testing theory, the behavior during normal operation is often error free, and bugs are mostly hiding around edge cases. These often lie in the most complex parts and require tricky inputs. Occasionally, these edge cases are forgotten by the developers, testers or reviewers. The consequences are well known. Significantly increased correction effort and cost. The later in the lifecycle you find it the more expensive it is to fix. And, while customers are great testers, they usually don't like finding bugs.

LLMs can help in dynamic testing. You will see that, not only will it find errors, it's capable of generating tests, fixes, and verification tests as well. Wow!

Let's get started.

Our target is a relatively simple standalone C function. We'll generate a whole test environment with test cases. For better understanding, we'll keep things simple. At the end, I will show you a couple ideas you could explore further.

I used the Chatbot UI framework to access OpenAI's GPT-4 model. It's mostly like ChatGPT, but has a couple handy features, like saving prompts and search.

Setting up our toolset

Chatbot UI enables the creation of prompts with variables. Once a prompt is phrased, we can save it and use it later. When reusing a saved prompt, all we need to do is pass the variables and off we go. Chatbot UI is open-source, check it out on Github.

Here is what we want GPT-4 to do:

Analyze the source code semantics from a security point of view.
Propose corrections for bugs found.
Create a test environment in which the code can be tested. Also include test cases so we know we don't break existing functionality.
Provide test cases which fail for the original code but pass for the corrected code proving that the bug is eliminated.

I created the following prompt to inspect and test any C function. The whole prompt is available in the Appendix.

The prompt variable is the C function itself, we shall provide it when using the saved prompt.

Running the above defined prompt template

Action!

The code under test is the following.

sint16_t calculate(uint32_t input)
{
  uint32_t remainder = 1;
  uint32_t dividend = 10;
  uint32_t quotient, new_remainder;
  uint16_t i;

  if (input % 2 == 0 || input % 5 == 0) {
    return -1; // input and 10 are not relative prime
  }

  for (i = 0; i < input; i++) {
    quotient = dividend / input;
    new_remainder = dividend % input;
	
    if (new_remainder == remainder) {
      return i + 1;
    }

    dividend = new_remainder * 10;
  }

  return -1; // cycle not found
}

After the function was analyzed, the findings were listed by GPT-4 (step #1). It found several, and gave corrections. Here is the most interesting one. It changed the datatype of iterator i and the return value of calculate was modified from 16 bit to 32 bit to avoid the possibility of overflow. The corrected source code (step #2):

sint32_t calculate_fixed(uint32_t input)
{
  uint32_t remainder = 1;
  uint32_t dividend = 10;
  uint32_t new_remainder;
  uint32_t i;

  if (input % 2 == 0 || input % 5 == 0) {
    return -1; // input and 10 are not relative prime
  }

  for (i = 0; i < input && i < UINT32_MAX; i++) {
    new_remainder = dividend % input;

    if (new_remainder == remainder) {
      return i + 1;
    }

    dividend = new_remainder * 10;
  }

  return -1; // cycle not found
}

Let's see how it did with generating test cases.

Generating test scenarios

GPT-4 provided the following test environment and gave detailed information on how to run it.

#include <stdio.h>
#include <stdint.h>

int16_t calculate(uint32_t input)
{
  // Original (unchanged) function code
}

int32_t calculate_fixed(uint32_t input)
{
  // Corrected function code (as provided above)
}

void test_calculate()
{
  struct TestCase
  {
    uint32_t input;
    int32_t expected_result;
  } test_cases[] = {
    {1, -1},
    {2, -1},
    {3, 1},
    {5, -1},
    {7, 6},
    {11, 2},
    {13, 6},
    {65537, 65536},
    {78137, 78136},
  };

  size_t n_test_cases = sizeof(test_cases) / sizeof(test_cases[0]);

  for (size_t i = 0; i < n_test_cases; i++)
  {
    struct TestCase *test = &test_cases[i];

    int16_t  result = calculate(test->input);
    int32_t  result_corrected = calculate_fixed(test->input);

    if (result != test->expected_result)
    {
      printf("Test case %zu FAILED for original code: Expected %d for input %u, but got %d\n", i + 1, test->expected_result, test->input, result);
    }
    else
    {
      printf("Test case %zu PASSED for original code\n", i + 1);
    }

    if (result_corrected != test->expected_result)
    {
      printf("Test case %zu FAILED for corrected code: Expected %d for input %u, but got %d\n", i + 1, test->expected_result, test->input, result_corrected);
    }
    else
    {
      printf("Test case %zu PASSED for corrected code\n", i + 1);
    }
  }
}

int main()
{
  test_func();
  return 0;
}

This isn't too complicated, but it doesn't need to be. The two functions are clearly visible, as well as the test cases. We also get some nice logging of the results. The instructions on how to use it were the following.

‍

‍

Almost off-the-shelf, easy to follow the instructions: copy the original and corrected code into the file, compile and run it. Only the names needed to be customized and that instead of Linux I used Windows (although, if I told it I probably would have gotten Windows specific instructions 🙂). The test environment itself works immediately, zero-shot prompting.

Test cases were provided also for the edge cases that fail for the original code and pass for the corrected code. It is worth noting that, it needed a bit more guidance. An additional question was needed, after which GPT-4 corrected itself, and the newly received input vector could be added to the list. To be fair, with a bit of prompt engineering we could probably make this work out-of-the-box. Also, if we do prompt chaining, it is very simple to generalize this.

Running the test environment, I got the following.

‍

Boom, exactly what I wanted! These tests verify the proper dynamic behavior during normal operation, and reveal the vulnerability of the original code by showing the possibility of overflow (Test cases #8 and #9). The corrections are also working as expected.

Impressive, isn't it?

Conclusion

Generative AI is very powerful. We started out with a single implemented C function, and voilà, with an okay prompt, we got a test environment, bugs, fixes and test cases. Notice the "okay" in the previous sentence. We could have spent some time fine-tuning the prompt to get even better results.

The examined code was very basic. Even with its simplicity, it demonstrated the power of LLMs and their semantic understanding. Naturally, there are many paths you can take from here. Here are a few examples.

By providing a textual description (or requirements) about the aim of the code, LLMs are able to create better and more exhaustive test cases.
Asking for enhanced summary report generation in the test environment, or asking for a summary table in json, csv, etc.
Technical parts can be tailored by changing the programming language to be used to generate the test environment or the underlying OS on which the tests are running.
LLMs can be asked to focus on detailed test case generation, or asking for varied input vectors, etc.

On the other hand, we should never forget to verify the answers of LLMs. It may contain mistakes and incomplete parts, just as we have seen in the edge case generation above. LLMs are great tools, but for now, they need supervision for certain tasks. That's probably going away, but no one knows when. And remember, a year ago most AI experts thought today would only happen in 5-10 years.

Appendix

The following prompt template was provided for GPT-4.

You are a world-class software engineer and tester.
Given the C code: 
{{code}}
1. Go and do a detailed inspection of the code and identify semantic issues or security vulnerabilities in the code. Explain the bugs. Respond with a numbered list.
2. Provide the fixed C code that addresses the problematic points. The name of the function shall be the original name with '_fixed' postfix.
3. Create a C test environment with the following requirements.
- It shall contain test cases to verify normal behavior.
- It shall contain test cases for edge cases which point to the weaknesses of the original code. It means for these test cases original code shall fail, meanwhile corrected code shall pass.
- It shall contain both original and fixed functions so that it can be immediately compiled and run.