Constructing AI-Proof Coding Tests: All You Need to Know 

by Codeaid Team

AI is reshaping landscapes across all industries, and the tech sector is no exception. In fact, about 92% of US-based developers already use AI-based tools for code writing and debugging.

Do AI-proof Coding Tests Exist?

But when it comes to software development recruiting, the ultimate challenge comes with the increased use of AI by applicants during their coding assessments. That’s because its use makes it increasingly hard to understand how much of the test is solved by the applicant and how much – by AI. 

In this case, how do you understand the candidates’ true capabilities? Is there such a thing as an AI-proof coding test? Or do we need to find out a way to assess coding skills even when candidates use AI? 

In this article, we reveal the answers to these, along other questions based on some key trends and our own experience in striving to create authentic AI-proof tests. To provide in-depth insights, we put some of our developers to the test and challenged them to try and use AI to solve our coding challenges.  

Keep reading to find out what happened. 

Understanding AI’s Weaknesses in Coding  

One thing we know for sure: AI thrives on precision.  

Give it clear, single or limited multi-step tasks, and it performs admirably. However, when complexity and creativity come into play, AI falls short.  

So far, we have noticed that AI tools fail in a few use cases. For example, they struggle to: 

  • Solve complex, multi-step problems: AI still struggles to follow more complex logic when working with multiple parameters. 
  • Reuse code: If there is an open-source library out there, instead of using it, AI would try to write the code. This doesn’t mean it could not solve the task, but it would generate more code doing so, which is less effective. 
  • Interpret images: ChatGPT and other AI tools are still not sophisticated enough to understand image input. However, some creative solutions to this obstacle have already started emerging.  
  • Produce code in volume: For example, ChatGPT 3.5 has a length limitation of 3,000 words. This includes the initial input provided to the AI tool. So, this does not leave much room for generated code. 
  • Conceive novel solutions: If a coding challenge requires developers to create a new solution (rather than use a well-known solution that already exists), AI is most probably going to struggle to invent something new. That’s because AI relies on a huge database of information to give answers and solve problems, so when presented with something that does not exist in its database, it could give faulty or incomplete information. 
  • Efficiently handle edge cases: AI is failing to cover edge cases that have special requirements or are not listed explicitly. That’s because, to solve such problems, you usually need some degree of creativity and an out-of-the-box mindset, which AI still doesn’t possess. 

Strategies to Construct AI-Proof Tests 

Crafting AI-proof tests is no simple task. But with the right strategies and using AI’s weaknesses as an advantage, constructing coding assessments that stand strong against AI’s problem-solving abilities are possible.  

Let’s delve deeper into these strategies: 

Creating Longer Multi-Step Tests 

One effective method is the introduction of lengthier, multifaceted tests that demand a sequence of logical steps to solve.  

AI tools, such as ChatGPT or BARD, thrive when given precise, single-step tasks but struggle as the complexity and number of steps increase. Based on our observations, the AI’s success rate drops significantly with each additional logic or functional step.  

To make things even harder for AI, you can utilize lengthier tests with visual input.   

Key takeaway: Constructing longer tests that require a series of interrelated problem-solving steps and integrating visuals can make AI less effective in solving them. 

Using Edge Cases 

A good developer should be able to handle edge cases – scenarios that occur outside the typical operation of the software. AI often falls short in this aspect. It might efficiently solve the “happy flow” (the simplest possible scenario) but falter when edge case handling is expected.  

Key takeaway: By integrating a significant number of edge cases in your tests, you make it harder for developers to leverage AI to their advantage during coding tests. Such cases also help showcase how well developers grasp coding problems and what their problem-solving approach is. 

Coming up with Unique Problems 

AI is great at solving well-known, generic problems. However, it still struggles when faced with unique challenges that demand creative and innovative solutions.  

But why is that? 

While AI tools like ChatGPT can generate novel combinations of existing ideas, truly innovative or revolutionary concepts are currently beyond their capabilities. So, when AI is given a problem that doesn’t closely match something in its training data, it can struggle. 

Key takeaway: Creating custom problems can help you test the inventive capabilities of the candidate, as AI is not currently capable of creative or inventive thought in the same way humans are. 

Limiting Dependency on Open-Source Libraries 

Currently, AI systems are not adept at identifying and utilizing open-source libraries.  

When faced with a problem that could be solved using an existing library, AI would choose to write code from scratch.  

Additionally, even if these models were to leverage open-source code, they would not know about the latest open-source content out there. That’s because ChatGPT 3.5 only knows about the web as of 2019 and ChatGPT4 is only updated until September 2021.  

Key takeaway: Using tests that clearly instruct candidates to use an open-source library is a great way to reduce their reliance on AI to solve the coding challenge.  

Considerations to Take in Mind About Using AI in Coding Tests 

Even though AI faces some challenges, it still has potential to be utilized as a coding tool. While detection may be difficult, understanding how it can affect test scores is something you should be well-informed about.  

Here are a few things to take into consideration:  

Variations in AI-Generated Code 

AI’s ability to generate diverse variations of code means it could be hard to detect its use when grading coding assessments.  

However, when focusing on creating an AI-proof grading system, recruiters can leverage certain characteristics of AI-generated code to their advantage. AI-produced code usually has higher similarity scores than code written by humans. Even though you can instruct AI to create several versions of the code, most developers are unlikely to do so.  

This means that the resulting code will probably have higher similarity scores. From an employer’s perspective, this could be a marker of AI use and thus viewed negatively and penalized. 

Adapting to the World of AI 

Given the sophistication of current AI models and their continually evolving capabilities, detecting their involvement in coding tests can be tricky.  

However, recruiters can do a few things to make their screening process more effective. For example, they can look out for indicators of AI use, such as an unusually quick submission time, that might suggest the possible use of AI. This way, they can put candidates through additional evaluations.   

But while you can adapt your grading system and add more steps to your evaluation process to assess whether candidates are using AI to pass coding tests, the question remains: Can AI models like ChatGPT actually solve complex coding tests?  

That’s why we decided to experiment with one of our own tests, putting it up against ChatGPT4, to further evaluate the strategies we’ve been developing and refining.  

So, here’s what happened when we applied theory into practice. 

The Final Verdict: We Ran Our Tests by ChatGPT4, and This is What Happened 

In a quest to understand AI’s capabilities, we ran all of Codeaid‘s coding challenges through ChatGPT.   

What happened exactly? 

ChatGPT managed to score 100% on our simplest and shortest challenge – the Shape Data Model test, which we have pulled out of our platform. This only proved our point that short coding tests are generally not AI-proof and are not effective when used in the candidate screening process.  

Then we ran our more complex challenges through ChatGPT. The scores it got varied between 20% to 50%. But this was achieved only after many iterations (usually on the first attempt ChatGPT scored 2-3%). 

For example, to get 50% on our Player Team Generator challenge, it takes about 2 hours. However, it’s important to note that this timeframe was achieved by a senior developer who knows the challenge very well and how to effectively instruct the AI tool. A novice to the challenge would likely take thrice the time and, without a firm grasp of the task’s requirements, may struggle to effectively use the code generated by GPT4.  

Where Does ChatGPT Fail? 

While it is possible to get higher scores if you work with AI on improvements and corrections, it still tends to fail. For example, in our Player Team Generator challenge, it failed in the following areas:  

  • Recognizing unwritten requirements that are general best practices: ChatGPT implemented endpoints handling properly but only returned HTTP return codes that were requested by the challenge description. So, ChatGPT failed to return all the error codes and only returned 200 and 400 errors. It also did not return a proper error message but used a generic one. 
  • Recognizing edge cases and special conditions: In our Player Team Generator challenge, it returns players when the requested skills are not perfectly matching the expectations. For example, if you request a player with a specific skill that is not available, the system should still return a player with the best skill. This is a hard challenge to work around relying solely on ChatGPT because you would actually need to understand and basically solve the problem by yourself. Only then you can use ChatGPT just to facilitate writing the code based on your instructions as opposed to relying on it to come up with the solution from scratch. 

But even with a 20% to 50% success rate, a few questions arise: Is it GPT4 passing the test, or is it the human operator who knows precisely what to ask for and how to refine the generated code? The line between AI as a tool and the human’s role in guiding the AI becomes blurred.  

So, where does this leave us?  

Our experience affirms the initial notion that constructing complex problem-solving tasks with multiple edge cases is a viable strategy for crafting GPT-resistant coding challenges.  

Until evidence to the contrary surfaces, this remains the best way to guard against AI assistance and ensure a fair evaluation of candidates’ coding skills. 

AI-Proof Coding Tests: The Takeaway 

AI’s role in coding tests is complex and constantly evolving. With the potential to solve intricate problems but also some major weaknesses, AI pushes us to construct more effective coding test assessments.  

Understanding AI’s limitations helps us create more solid AI-proof tests and ensure the real abilities of all candidates. At Codeaid, we embrace the challenge, continually adapt, and strive to shape the future of coding tests.  

Related Blogs

  • Addressing the Shortage of Software Engineers: Effective Strategies & Tips 

    In the rapidly evolving tech world, software engineers are some of the most sought-after professionals. Yet, the demand for skilled experts outstrips the supply, leading to a global shortage of qualified software engineers.

    Read More
  • What is Pre-employment Testing 

    6 Types of Pre-employment Testing and Their Benefits 

    Let’s face it: New hire retention is a continuing problem and ensuring high retention rates can be challenging.

    Read More
  • From Code to Deployment: The Power of Full-Cycle Testing in Coding Assessments 

    Let’s face it: Nobody likes buggy software. But poor-quality software is a reality, and it has a large cost.

    Read More
  • Architecture Skills Tests

    Architecture Skills Testing Conundrum

    Developing software applications is not just about writing code. It is also about architecting the system to be understandable, extendable, reliable, and scalable.

    Read More
  • Benefits of gender diversity in the workplace

    Building a Diverse and Inclusive Workplace: Strategies for Ensuring Lasting Diversity

    In this buzzing era of universal connectivity, we’re noticing a seismic shift in how businesses roll out their operations.

    Read More
  • subjectivity in coding test grading

    Decoding Subjectivity in Coding Test Grading: How to Ensure Fair Assessments

    In 2022, poor-quality software in the US cost about $2.41 trillion.
    So, it’s no wonder that an increasing number of tech recruiters are searching for more effective ways to find the best talent.

    Read More
  • reducing gender bias in tech hiring

    Breaking Barriers: Coding Tests Help Reduce Gender Bias in Hiring

    Women have made significant strides in the tech industry in recent years, and they continue to propel change and innovation in the sector.

    Read More
  • Rethinking Seniority in Coding Assessments

    When hiring for a developer position, there is usually a target seniority to keep in mind.

    Read More
  • Coding Assessments in a ChatGPT world

    The coding assessment market is scrambling, just as many other markets are, to determine the risks and opportunities that ChatGPT brings.

    Read More
  • Programmer Training: Why Hands-On Learning is Key

    While many software developers are proactive in learning new technologies on their own time, relying solely on this approach can leave gaps in their knowledge and skills. In a more direct nutshell, they might not be learning what you need them to learn.

    Read More
  • Feedback in Coding Assessments

    Beyond the Score: The Importance of Feedback in Coding Assessments

    Coding tests can tell you many things about a candidate, but what most don’t do is tell you how they respond to feedback.

    Read More
  • Authenticity in Coding Tests: A Deeper Dive

    To make sure everyone’s playing fair, most coding test platforms run plagiarism or similarity checks on submitted code.

    Read More
  • front end developer testing

    Front-end Developers – Pixel-Perfect Testing 

    Gone are the days when a piece of software’s success was determined solely by its functionality.

    Read More
  • Cracking the Code: Making Coding Skills Tests a Win-Win

    Let’s face it, no one is thrilled about taking a coding test, especially top-notch candidates who have their pick of the litter when it comes to job opportunities.

    Read More
  • Coding Tests – Online vs Take-home 

    Finding good developers has become more difficult and you have to kiss a lot of frogs to find a prince or a princess.

    Read More
  • Coding Test: Why 2 Hours are Not Enough

    Globally, there is a significant developer shortage and that, coupled with remote work being the norm these days, incentivizes companies to go beyond the usual group of candidates and testing techniques.

    Read More
  • Stop Testing for Programming Language Skills – Test for What Matters Most

    Capturing the full spectrum of programming qualities that make someone a good developer is a hard task.

    Read More
  • How to be a better technical recruiter

    Tech recruiters are the gatekeepers of the tech industry. They get to decide which candidates have a shot at getting hired, and which ones don’t.

    Read More
  • How to Test Coding Skills

    Why should you test coding skills? Let’s face it: it’s easy for developers to make themselves sound like Developer of The Year when you don’t have any evidence that proves otherwise.

    Read More
  • How to Streamline Recruitment Process for Tech Talent in 6 Simple Steps

    Business owners and recruiters know how complex, time-consuming, and even expensive it could be to hire a talented professional.

    Read More