Debugging Intermittent Test Failures

What are Intermittents (aka Oranges)?

Intermittents are test failures which happen intermittently, in a seemingly random way. Often you’ll write a test that passes fine locally on your computer, but when ran thousands of times on various CI environments (some of them under heavy load) it may start to fail randomly.

Intermittents are also known as Oranges, because the corresponding test jobs are rendered orange on treeherder.

These intermittent failures are tracked in Bugzilla. When a test starts being intermittent a bug is filed in Bugzilla (usually by a Mozilla code sheriff).

Once the bug exists for a given test failure, all further similar failures of that test will be reported as comments within that bug. These reports are usually posted weekly and look like this:

5 failures in 2740 pushes (0.002 failures/push) were associated with this bug in the last 7 days.

See an example here.

Sometimes, tests start failing more frequently and these reports are then posted daily.

To help with the (unfortunately) ever-growing list of intermittents, the Stockwell project was initiated a while ago (read more about the goals of that project on their wiki).

This project defines a scenario where very frequently failing tests get disabled. Ideally, we should try to avoid this, because this means reducing our test coverage, but sometimes we do not have time to investigate the failure, and disabling it is the only remaining option.

Finding Intermittents

You will have no trouble finding out that a particular test is intermittent, because a bug for it will be filed and you will see it in Bugzilla (watching the Bugzilla component of your choice is a good way to avoid missing the failure reports).

However, it can still be useful to see intermittents in context. The Intermittent Failures View on Treeherder shows intermittents ranked by frequency.

You can also see intermittents in Bugzilla. Go to the settings page and enable “When viewing a bug, show its corresponding Orange Factor page”.

Reproducing Test Failures locally

The first step to fix an intermittent is to reproduce it.

Sometimes reproducing the failure can only be done in automation, but it’s worth trying locally, because this makes it much simpler to debug.

First, try running the test in isolation. You can use the --repeat and --run-until-failure flags to mach mochitest to automate this a bit. It’s nice to do this sort of thing in headless mode (--headless) or in a VM (or using Xnest on Linux) to avoid locking up your machine.

Sometimes, though, a test will only fail if it is run in conjunction with one or more other tests. You can use the --start-at and --end-at flags with mach mochitest to run a group of tests together.

For some jobs, but not all, you can get an interactive shell from TaskCluster.

There’s also a handy page of e10s test debugging tips that is worth a read.

Because intermittents are often caused by race conditions, it’s sometimes useful to enable Chaos Mode. This changes timings and event orderings a bit. The simplest way to do this is to enable it in a specific test, by calling SimpleTest.testInChaosMode. You can also set the MOZ_CHAOSMODE environment variable (use 0 to enable all chaos, or 0xfb for everything except Timer Chaos), or even edit mfbt/ChaosMode.cpp directly.

Some tests leak intermittently. Use ac_add_options --enable-logrefcnt in your mozconfig to potentially find them.

The rr tool has its own chaos mode. This can also sometimes reproduce a failure that isn’t ordinarily reproducible. While it’s difficult to debug JS bugs using rr, often if you can reliably reproduce the failure you can at least experiment (see below) to attempt a fix.

Use the Firefox Profiler

To use it locally, add the parameter --profiler to your test command:

./mach test <path to test> --profiler

To do it in a try build it’s possible to use the --profiler flag:

./mach try fuzzy --profiler <test directory>

This will automatically upload profiles of failing tests.

Hint

If you want to see a profile of a test that doesn’t fail (eg. to compare it with the profile of a failing run), an easy way to make it happen is to add a failure near the end of the test, eg. ok(false, "force profile upload");.

Some test types do not support running the profiler with a parameter (such as wpt tests), use environment variables to use it still:

MOZ_PROFILER_STARTUP=1 MOZ_PROFILER_SHUTDOWN=$(pwd)/test-profile.json ./mach wpt <test list>
samply load test-profile.json

Samply is used for symbolication. Install it with cargo install samply.

Hint

Remember that MOZ_LOGs are recorded by the profiler as well. Use the MOZ_LOG environment variable to enable them. Also add ,profilerstacks to your log string to capture the stacks for each log. For example:

MOZ_LOG="idleService:5,profilerstacks" MOZ_PROFILER_STARTUP_FILTERS="GeckoMain,timer" ./mach test widget/tests/test_bug343416.xhtml --profiler

Slow down your machine

On some operating systems, it’s possible to reduce the CPU speed. For example when using Linux with Intel processors, it’s possible to set /sys/devices/system/cpu/intel_pstate/max_perf_pct to a small value. The program cpufreq-set can also help.

Another useful approach is to load your machine. This has proven to be very effective in reproducing intermittent failures, but requires thinking about what seems to cause the failure. Using stress-ng (available on most Linux distributions and in brew on macOS), various parts of the system can be stressed:

The CPU, loading all cores to 100% for 60 seconds:

stress-ng --cpu 0 --cpu-method all --timeout 60s

The scheduler, attempting to cause uncommon context switch ordering to reproduce races:

stress-ng --sched 0 --sched-policy all --timeout 60s

The memory quantity, allocating 80% of the available memory for 60 seconds:

stress-ng --vm 4 --vm-bytes 80%

The system allocator, spinning on malloc and free:

stress-ng --malloc 0 --malloc-bytes 50%

The syscall interface, randomly exercising all syscall types:

stress-ng --syscall 0 --syscall-method all

The filesystem, running continuous temp-file I/O:

stress-ng --hdd 1 --hdd-bytes 4G --timeout 60s

The network, creating local TCP socket churn:

stress-ng --sock 0 --sock-domain inet --sock-type stream --timeout 60s

This is readily available on Linux and macOS, and can be installed on Windows using WSL.

That Didn’t Work

If you couldn’t reproduce locally, there are other options.

One useful approach is to add additional logging to the test, then push again. Sometimes log buffering makes the output weird; you can add a call to SimpleTest.requestCompleteLog() to fix this.

You can run a single directory of tests on try using mach try DIR. You can also use the --rebuild flag to retrigger test jobs multiple times; or you can also do this easily from treeherder.

Another possibility is to run the test with video recording enabled, especially when the test is doing things related to focus, graphical widget or screen capture:

./mach try fuzzy --env MOZ_RECORD_TEST=1

This will record a video of the test run, and drop the video file in the Artifact directory, accessible on treeherder in the Artifacts and debugging tools tab.

This is unfortunately only available on Linux and macOS for now (Android and Windows are possible but not tested enough, landed or implemented, consider helping out).

Solving

If a test fails at different places for each failure it might be a timeout. The current mochitest timeout is 45 seconds, so if successful runs of an intermittent are ~40 seconds, it might just be a real timeout. This is particularly true if the failure is most often seen on the slower builds, for example Linux 32 debug. In this case you can either split the test or call requestLongerTimeout somewhere at the beginning of the test (here’s an example).

Sometimes the problem is a race at a specific spot in the test. You can test this theory by adding a short wait to see if the failure goes away, like:

yield new Promise(r => setTimeout(r, 100));

You can use a similar trick to “pause” the test at a certain point. This is useful when debugging locally because it will leave Firefox open and responsive, at the specific spot you’ve chosen. Do this using yield new Promise(r => r);.

You can also binary search the test by either commenting out chunks of it, or hacking in early returns. You can do a bunch of these experiments in parallel without waiting for the first to complete.

For the tests in the devtools codebase, see the waitForTick and waitForTime functions in DevToolsUtils. shared-head.js also has some helpers, like once, to bind to events with additional logging.

Verifying

It’s difficult to verify that an intermittent has truly been fixed. One thing you can do is push to try, and then retrigger the job many times in treeherder. Exactly how many times you should retrigger depends on the frequency of the failure.