Test flakiness is losing you more hours than you realize

Among the many things that can frustrate a software tester (and there are many, given the complexity of current QA operations), few are as hair-splitting as flaky tests.

Even if you haven’t come across the term, you have almost certainly faced the agony of dealing with flaky tests if you are a tester.  

But what exactly are flaky tests, and why are they such a drain on testers’ time, effort, and resources?

What is a flaky test?

Essentially, a flake test produces a non-deterministic outcome. You run the same test, with the same code, multiple times in the same environment – and it will sometimes pass and sometimes fail. 

Sounds a bit ridiculous, doesn’t it? The same test throws up different results when run multiple times in the same environment. You cannot conclude if the test has passed or failed. 

The Exasperating Impact of Flaky Tests

Obviously, flaky tests waste time by triggering build failures. If you cannot declare a test passed or failed, you cannot move forward or backward with that code. This brings the dev pipeline to a screeching halt. 

Often these scenarios leave engineers to keep rerunning tests and try to identify a cause. But the time spent trying to figure out the fallacies in a single test holds all CI changes hostage and needlessly delays software delivery. 

Flaky tests are difficult to debug because they are difficult to reproduce. Additionally, these failures don’t always indicate the presence of a bug. But they can’t be ignored, which leads to wasting dev time. 

On top of that, if your pipeline runs regression tests before each commit, a flaky test leads to further delay as the flaws can appear to be related to commits, even if that is not the case. 

Let’s look at the effect of flake tests on a macro level with an example. 

The success of a test suite depends on all the tests running within it. Even the most positive pass rates may be misleading, given the presence of flaky tests. 

Let’s say your test suite comprises 300 tests, each with a fail rate of 0.5%. The pass rate for each test is 99.5%. While this may sound great, apply that across the entire suite by raising the pass rate by the total number of tests. Therefore the test suite pass rate in this case is:

(99.5%)^300 = 22.23%

Now, we see that the complete test suite is passing less than 25% of the time. When you use the same formula with a test suite comprising 400 tests, the pass rate drops to 13.5%. A suite with 500 tests pushes the pass rate down to 8%. 

Next, imagine that devs & engineers have worked to improve the pass rate to 99.7%. Using the same formula, the pass rate of the test suite with 300 tests is:

(99.7%)^ 300 = 40%

If the pass rate doubles with a meager .2% improvement, it is good reason to fight against a failure rate of even 0.5%. In particular, Agile and DevOps teams struggle with scalability hurdles and blockers in automation test cycles when dealing with such fractional percentages. 

While every test will not pass 100% of the time (obviously), teams must do as much as possible to reduce failure rates. If the example above is anything to go by, changes in a fraction of a percentage will boost results significantly, improving scalability, pipeline speed, and reduced maintenance costs. 

Reduced Confidence in Test Suites

A major consequence of too many flaky tests is the extent to which testers, devs, and engineers lose confidence in the test suite itself. This attacks team culture and corrodes faith in the workflow itself. 

The problem with flaky tests is that they mess with one of the core aspects of automated tests: determinism. If your code and test conditions have not changed, your test results should not change either. If that does occur, it cancels out the benefits of CI/CD (which hinges on automation) and causes the team to lose trust in their codebase and/or toolset. 

When a flaky test shows up, engineers often try to bypass the issue by rerunning it until it passes. In certain teams and organizations, such a rerun is triggered automatically. 

Another option is to execute retests at the build level. This may, again, unblock the CI pipeline by trying to resolve newly emerging flaky tests. But this approach requires meticulous tracking and immediate fixes of flagged tests – translating to extra effort. 

The problem with the retry approach is this, especially with short deadlines:

  • How many times must the test fail, in consecutive retries, before you declare it an actual failure?
  • After enough tests exhibit flakiness, how confident are you that the test suite has been optimally written in the first place?

Other times, when devs realize they are dealing with a flaky test, they temporarily disable it and perhaps file an issue to remedy it in the future. While this unblocks the pipeline, you end up with reduced test coverage, which has consequences, especially in the long term.  

Additionally, how much time do busy QA professionals have to invest in fixing flaky tests, especially those they did not write? Often, what you get with this approach is an intimidating backlog that no one has the time, effort, or inclination to work on. 

Common Causes of Flaky Tests

Async wait

Often, a test needs to wait for a specific duration to proceed. The wait is for another action that gets completed so the test can continue. For example, if the test is verifying the functionality of a button, it must wait for the button to fully load before clicking it to check how it works. 

This waiting is implemented through sleep commands in automated tests. However, such sleep statements are imprecise. If a test waits (“sleeps”) for 20 seconds, it may pass within a certain environment (the page loads fast enough due to a solid internet connection, for e.g.). However, in other circumstances (like weak internet), the page may take longer to load, which would cause the test to fail.

Flakiness in these cases can be resolved by replacing sleep commands with wait commands. By doing so, tests wait for a particular condition to come true (like the button loading). It periodically checks for the condition until it waits out after a specific timeout value. 

Additionally, WaitFor statements allow much higher timeout values to be set because, unlike sleep statements, they don’t need the test to wait out the entire specified time. For e.g., if the condition becomes true in 5 seconds, the test will not wait for 20 seconds to proceed. It will process the minute the button shows up and click it. 

Insufficiently Isolated Tests

If tests are being run in parallel, it is common for each test not to get the resources it needs to run effectively. Resource contention can skew tests towards failure, even though they would pass seamlessly in ideal conditions. 

Even if tests are executed serially, one test in the pipeline could change the system state, leaving subsequent tests to run with compromised or altered resources.

Ideally, tests should be isolated and should be responsible for setting up the system state that is required for it to run. It should also erase all changes after execution so that other tests have a pristine environment to work with. 

Unfortunately, in practice, tests are often crafted by devs, making certain (perhaps incorrect) assumptions about the shared system (database, memory, files, etc.). Such speculations must be entirely avoided or at least validated before writing tests on them. 

Poor Test Data

Seamless test execution requires accurate and easily integrable test data. Independent tests should have adequate access to independent datasets, with the latter stored in isolation. 

If a test is dependent on data from another test, then said data must be protected from becoming corrupted during each test run. This is required to ensure that the next test also runs as expected. Test data must remain legitimate throughout the pipeline.

Inadequate Number or Variety of Test Environments

Given the wide variety of hardware and software specifications users can choose from when buying desktops or mobile devices, any test will have to be validated in multiple environments to ensure product validity in the real world. 

However, the term “test environment” doesn’t just refer to the UI end-users interact with. It also includes services, specific features/aspects of the software each test interacts with, network conditions, and third-party APIs required for your app to get the job done. 

Without providing for all these variables, tests will deliver inconclusive results. 

The best way to remedy this is to utilize a testing platform with in-built access to real browsers, devices, and operating systems for testing purposes. Use platforms like Testsigma, an open-source testing ecosystem that allows you to execute tests in your local browser/device or run them across 800+ browsers and 2000+ devices on its cloud-hosted test lab.

Closing Notes

Removing flaky tests from your builds is essential for a clean pipeline that doesn’t get blocked unnecessarily. Since writing tests is easier than maintaining them, start with reverse-engineering the causes outlined above. Do as much as possible to prevent these issues from poisoning your test code, and you should see a significant reduction in manifestations of flakiness. 

Sharing it to help others:

Leave a Comment