Constructing AI-Proof Coding Tests: All You Need to Know 

by Codeaid Team

AI is reshaping landscapes across all industries, and the tech sector is no exception. In fact, about 92% of US-based developers already use AI-based tools for code writing and debugging.

Do AI-proof Coding Tests Exist?

But when it comes to software development recruiting, the ultimate challenge comes with the increased use of AI by applicants during their coding assessments. That’s because its use makes it increasingly hard to understand how much of the test is solved by the applicant and how much – by AI. 

In this case, how do you understand the candidates’ true capabilities? Is there such a thing as an AI-proof coding test? Or do we need to find out a way to assess coding skills even when candidates use AI? 

In this article, we reveal the answers to these, along other questions based on some key trends and our own experience in striving to create authentic AI-proof tests. To provide in-depth insights, we put some of our developers to the test and challenged them to try and use AI to solve our coding challenges.  

Keep reading to find out what happened. 

Understanding AI’s Weaknesses in Coding  

One thing we know for sure: AI thrives on precision.  

Give it clear, single or limited multi-step tasks, and it performs admirably. However, when complexity and creativity come into play, AI falls short.  

So far, we have noticed that AI tools fail in a few use cases. For example, they struggle to: 

  • Solve complex, multi-step problems: AI still struggles to follow more complex logic when working with multiple parameters. 
  • Reuse code: If there is an open-source library out there, instead of using it, AI would try to write the code. This doesn’t mean it could not solve the task, but it would generate more code doing so, which is less effective. 
  • Interpret images: ChatGPT and other AI tools are still not sophisticated enough to understand image input. However, some creative solutions to this obstacle have already started emerging.  
  • Produce code in volume: For example, ChatGPT 3.5 has a length limitation of 3,000 words. This includes the initial input provided to the AI tool. So, this does not leave much room for generated code. 
  • Conceive novel solutions: If a coding challenge requires developers to create a new solution (rather than use a well-known solution that already exists), AI is most probably going to struggle to invent something new. That’s because AI relies on a huge database of information to give answers and solve problems, so when presented with something that does not exist in its database, it could give faulty or incomplete information. 
  • Efficiently handle edge cases: AI is failing to cover edge cases that have special requirements or are not listed explicitly. That’s because, to solve such problems, you usually need some degree of creativity and an out-of-the-box mindset, which AI still doesn’t possess. 

Strategies to Construct AI-Proof Tests 

Crafting AI-proof tests is no simple task. But with the right strategies and using AI’s weaknesses as an advantage, constructing coding assessments that stand strong against AI’s problem-solving abilities are possible.  

Let’s delve deeper into these strategies: 

Creating Longer Multi-Step Tests 

One effective method is the introduction of lengthier, multifaceted tests that demand a sequence of logical steps to solve.  

AI tools, such as ChatGPT or BARD, thrive when given precise, single-step tasks but struggle as the complexity and number of steps increase. Based on our observations, the AI’s success rate drops significantly with each additional logic or functional step.  

To make things even harder for AI, you can utilize lengthier tests with visual input.   

Key takeaway: Constructing longer tests that require a series of interrelated problem-solving steps and integrating visuals can make AI less effective in solving them. 

Using Edge Cases 

A good developer should be able to handle edge cases – scenarios that occur outside the typical operation of the software. AI often falls short in this aspect. It might efficiently solve the “happy flow” (the simplest possible scenario) but falter when edge case handling is expected.  

Key takeaway: By integrating a significant number of edge cases in your tests, you make it harder for developers to leverage AI to their advantage during coding tests. Such cases also help showcase how well developers grasp coding problems and what their problem-solving approach is. 

Coming up with Unique Problems 

AI is great at solving well-known, generic problems. However, it still struggles when faced with unique challenges that demand creative and innovative solutions.  

But why is that? 

While AI tools like ChatGPT can generate novel combinations of existing ideas, truly innovative or revolutionary concepts are currently beyond their capabilities. So, when AI is given a problem that doesn’t closely match something in its training data, it can struggle. 

Key takeaway: Creating custom problems can help you test the inventive capabilities of the candidate, as AI is not currently capable of creative or inventive thought in the same way humans are. 

Limiting Dependency on Open-Source Libraries 

Currently, AI systems are not adept at identifying and utilizing open-source libraries.  

When faced with a problem that could be solved using an existing library, AI would choose to write code from scratch.  

Additionally, even if these models were to leverage open-source code, they would not know about the latest open-source content out there. That’s because ChatGPT 3.5 only knows about the web as of 2019 and ChatGPT4 is only updated until September 2021.  

Key takeaway: Using tests that clearly instruct candidates to use an open-source library is a great way to reduce their reliance on AI to solve the coding challenge.  

Considerations to Take in Mind About Using AI in Coding Tests 

Even though AI faces some challenges, it still has potential to be utilized as a coding tool. While detection may be difficult, understanding how it can affect test scores is something you should be well-informed about.  

Here are a few things to take into consideration:  

Variations in AI-Generated Code 

AI’s ability to generate diverse variations of code means it could be hard to detect its use when grading coding assessments.  

However, when focusing on creating an AI-proof grading system, recruiters can leverage certain characteristics of AI-generated code to their advantage. AI-produced code usually has higher similarity scores than code written by humans. Even though you can instruct AI to create several versions of the code, most developers are unlikely to do so.  

This means that the resulting code will probably have higher similarity scores. From an employer’s perspective, this could be a marker of AI use and thus viewed negatively and penalized. 

Adapting to the World of AI 

Given the sophistication of current AI models and their continually evolving capabilities, detecting their involvement in coding tests can be tricky.  

However, recruiters can do a few things to make their screening process more effective. For example, they can look out for indicators of AI use, such as an unusually quick submission time, that might suggest the possible use of AI. This way, they can put candidates through additional evaluations.   

But while you can adapt your grading system and add more steps to your evaluation process to assess whether candidates are using AI to pass coding tests, the question remains: Can AI models like ChatGPT actually solve complex coding tests?  

That’s why we decided to experiment with one of our own tests, putting it up against ChatGPT4, to further evaluate the strategies we’ve been developing and refining.  

So, here’s what happened when we applied theory into practice. 

The Final Verdict: We Ran Our Tests by ChatGPT4, and This is What Happened 

In a quest to understand AI’s capabilities, we ran all of Codeaid‘s coding challenges through ChatGPT.   

What happened exactly? 

ChatGPT managed to score 100% on our simplest and shortest challenge – the Shape Data Model test, which we have pulled out of our platform. This only proved our point that short coding tests are generally not AI-proof and are not effective when used in the candidate screening process.  

Then we ran our more complex challenges through ChatGPT. The scores it got varied between 20% to 50%. But this was achieved only after many iterations (usually on the first attempt ChatGPT scored 2-3%). 

For example, to get 50% on our Player Team Generator challenge, it takes about 2 hours. However, it’s important to note that this timeframe was achieved by a senior developer who knows the challenge very well and how to effectively instruct the AI tool. A novice to the challenge would likely take thrice the time and, without a firm grasp of the task’s requirements, may struggle to effectively use the code generated by GPT4.  

Where Does ChatGPT Fail? 

While it is possible to get higher scores if you work with AI on improvements and corrections, it still tends to fail. For example, in our Player Team Generator challenge, it failed in the following areas:  

  • Recognizing unwritten requirements that are general best practices: ChatGPT implemented endpoints handling properly but only returned HTTP return codes that were requested by the challenge description. So, ChatGPT failed to return all the error codes and only returned 200 and 400 errors. It also did not return a proper error message but used a generic one. 
  • Recognizing edge cases and special conditions: In our Player Team Generator challenge, it returns players when the requested skills are not perfectly matching the expectations. For example, if you request a player with a specific skill that is not available, the system should still return a player with the best skill. This is a hard challenge to work around relying solely on ChatGPT because you would actually need to understand and basically solve the problem by yourself. Only then you can use ChatGPT just to facilitate writing the code based on your instructions as opposed to relying on it to come up with the solution from scratch. 

But even with a 20% to 50% success rate, a few questions arise: Is it GPT4 passing the test, or is it the human operator who knows precisely what to ask for and how to refine the generated code? The line between AI as a tool and the human’s role in guiding the AI becomes blurred.  

So, where does this leave us?  

Our experience affirms the initial notion that constructing complex problem-solving tasks with multiple edge cases is a viable strategy for crafting GPT-resistant coding challenges.  

Until evidence to the contrary surfaces, this remains the best way to guard against AI assistance and ensure a fair evaluation of candidates’ coding skills. 

AI-Proof Coding Tests: The Takeaway 

AI’s role in coding tests is complex and constantly evolving. With the potential to solve intricate problems but also some major weaknesses, AI pushes us to construct more effective coding test assessments.  

Understanding AI’s limitations helps us create more solid AI-proof tests and ensure the real abilities of all candidates. At Codeaid, we embrace the challenge, continually adapt, and strive to shape the future of coding tests.  

If you are looking for top-tier coding assessment platforms, trust Codeaid. Our latest AI Interviewer tool ensures you can confidently recruit the finest developers.

Related Blogs

  • Codeaid & Recruitee Integration Guide

    Recruitee Integration Guide

    Assessment tests provide hiring teams with a better understanding of how well a candidate is likely to perform in a given role. By integrating assessments directly into the candidate journey through Recruitee, you can streamline and reduce the effort involved in the hiring process. Initiate assessments within Recruitee, and conveniently receive scores and view detailed […]

    Read More
  • Integrating AI Tools in your existing HR workflows

    Would you want to save 75% of the time you put into initial screenings during recruitment?

    Read More
  • futuristic-networking-technology-remix-with-woman-using-virtual-screen 1

    Evaluating AI Interview Software: What Features Matter Most?

    The advent of artificial intelligence in recruitment changed the rules of the game for hiring in organizations – especially those in the competitive tech industry. In fact, the AI recruitment market is worth $661.5 million as of December 2023.

    Read More