« Runaway Methods | Main | Staring Down Technical Debt »

September 19, 2012


Llewellyn Falco

While I agree this is an interesting way of optimizing the testing order for a build, I also think it might be a fascinating visual graph to look at.

If I am understanding correctly the you would have a data structure that says:

File[name] = FailTests[(TestA, 2 times), (TestB, 9 times) ......]

and then there are all kinds of interesting questions, like

'which files are associated with the most test failures'
'which tests most often fail'
'which files fail the most different tests'
'which files are the safest'
'which tests touch the most files'
'which visualization show the most insight'
'which insights are best shown by which visualizations'

I would expect the data for both the file changes and the test failures would be in the CI history. Is a simple map reduce job enough to collect the data?

Michael Feathers

LLewellyn: I suppose it could be. A time stamp is useful too. Decay functions can be applied for tuning the test set of a build. It is not something I've experimented with, though.


Sounds more like a Bayesian build server. You are estimating the probability that a test will fail given a set of changes, and picking the highest scores.


In the .NET space, Greg and Svein have been working on a tool called MightyMoose (www.continuoustests.com). It has a test minimizer that analyzes MSIL to detect what tests need running and only runs those. Perhaps this is an idea that can be extended to the build servers to speed things up.

Michael Feathers

edwin: Thank you. That does seem like a better way of describing it.

Ashic: Yes, I've seen people do that with coverage. For what it's worth, I think that is the future. Would be nice to have that technology across languages and platforms.

Terence Tuhinanshu

That's an interesting technique, and in no way are you hinting at this being a complete strategy, only an approximate one. But in my experience, most errors in the file I have changed are reported to me when I compile / demo on my machine to see if it is working correctly. The unit test suite assist me because they exercise how calls in other files interact with my changes.

Given this, perhaps we can map out the files in a system that refer to the definition in this file and use that list as a basis for prioritization. This mapping can probably be done in a straightforward way by something as simple as grep.

(This won't include service calls, but then we have Integration Testing for those...)

Stefan Radulian

To me it sounds a bit like having http://www.ncrunch.net/ on a build server - which, IMHO, would be great....

Scott Brickey

This approach seems to have a similar goal of code coverage, just at a lower cost (computing resources)... I think it's a worthwhile effort, but may need to be "just a piece" of the code coverage pie... I am forced to wonder what would be the implications of refactoring, to this approach... code coverage handles this by knowing the actual code paths, but with a simple file-based approach, refactoring could cause tests to fade out of focus, or could loose history altogether (depending on implementation).

I think that a good goal would be to merge these efforts with other code coverage efforts, to identify what changes within a file (changes to the signature, so to speak), and then relay that info back to the code coverage map. With this, you can realize (some of) the performance gains from your approach, while more easily coping with changes.


Great idea. While not necessarily trivial, seems like it should be an available option for a more modern build server.

One challenge might be for teams that are not used to (or just haven't put time into it) making sure there's no order dependencies in their build, as one build to the next would likely be run in a different order (which is the point). It's a good practice anyhow.


This sounds like what Limited Red does - http://www.limited-red.com/

"Failing tests are not randomly distributed, Limited Red helps you fail faster by running the tests most likely to fail first. While also helping visualise your test's history. "

Dave W.

It seems like as written, this would prioritize running big, unstable tests for a lot of code. One of the tests in our test suite that fails most frequently is a big test case that runs close to the test harness limits for time and memory. It fails a lot because it runs out of time or memory, but these failures aren't all that useful for us to study. What is more informative is when it fails for some other reason. But I think this test case would get tagged as a "run early for all kinds of source files" by your methodology.

Michael Feathers

Dave W: Your comment makes me feel like you don't see that test as enough of a problem.

Brian Slesinsky

It seems like this technique would tend to run flaky tests more often, but that isn't a bad thing; flaky tests need more test runs to measure their failure rate. Ideally the test runner should use the test's history of flakiness to decide whether the failure rate has changed enough to be worth reporting as an actual change in behavior.

While I do think fixing flaky tests is often worth doing (they're expensive), it's not necessarily higher priority than other tasks, so the test runner should be smart enough to handle them sensibly.

Dave W.

@Michael Feathers: In other circumstances, a flaky test like that might be the canary in the coal mine that lets us know that we inadvertently increased run time or memory requirements. And certainly, trying to decrease flakiness over time and get more repeatable results is important and useful. Right now, though, there's more urgent stuff to deal with.

Generally speaking, I think it's more efficient to spend limited development time going after consistent failures in short-running tests rather than inconsistent failures in long-running tests (except when the latter represents an important customer bug, which isn't the case here). You get more bugs fixed more quickly with the former, and you might fix the latter in the process.

This particular test case should be moved into a different test suite of known likely failures, which would probably also address the precog strategy concern. My main point was that if the test cases are non-homogenous in size and run time, as here, you probably don't want to prioritize the frequently-failing big stuff over the less-frequently-failing small stuff.

P.S. Michael, if you are willing to share your email address with me, I can be reached as dave at bluepearlsoftware dot com. I discussed your legacy code book at a conference talk this past June, and I'm interested in discussions about setting up complex data structures for unit testing legacy code. (The kind of legacy code that does a bunch of raw data structure accesses like apple[i].pear->grape[j].kiwi = banana[k].cherry, where there may be a bunch of hidden constraints and dependencies between the data structures that need to be observed for the code to function correctly.) If that's not a topic you have time to discuss, I'd be interested in pointers to other forums where those kinds of discussions might be welcomed.


Cucover tried using the idea of code coverage to test mapping (just with cucumber)


And interesting experiment, though it feels like the simple test file to code file association for what to run satisfies most people. I also found the coverage data was not always 100% accurate and it often had odd quirks. It requires low level bindings which meant there we lots of interoperability issues.

I might fire up this experiment again.

The comments to this entry are closed.