There's an implicit problem in automated test execution in a build pipeline: the number of tests keeps increasing. If tests are written well (i.e., properly decoupled) this shouldn't be a big problem. Each new test should only add a minute fraction of a second to the build. The sad reality, however, is that not all tests are good and, well, even if they were, the duration of a build is going to keep increasing.
Build performance tuning has become a part of daily life on many development teams, and there are many options. Parallelization can lead to big wins, as can all sorts of environmental optimization, but it is a constant battle against growth in a code base. It pays to revisit our assumptions.
Since the early days of Agile software development, there has been an unspoken assumption that a build should run all of the tests. It makes sense. If automated tests are a standard of expected behavior for a project, running them all gives us a solid, easily articulated statement about our code quality. It also assures us that the whole of our system is working to a particular standard, i.e., we are testing the real thing that will be deployed and we are testing it with our full automated arsenal.
Since then, people have tried a variety of half-way options. One common pattern is to run all unit tests at each build along with a series of "smoke tests" to give developers quick feedback on check-in. Integration tests run later. This is never ideal, but it is a pattern that many teams with large tests suites often fall into.
Some organizations have experimented with smarter builds. If you run coverage on your code through your test suite, you can build up a map of "tests that can possibly fail" when particular areas of code are touched. It's computationally intensive, but it is doable if you have plenty of compute power around and rerun periodically to update the map.
There is alternative, though. We can use prior experience to maximize the speed of feedback that we get from a build. Kent Beck experimented in this area years ago with his JUnitMax project. It took recently failed tests and pushed them forward in the build so that they ran very early. I've been experimenting with a variation of that: build a map of test failures to files which were modified in the commit where the failures occurred. Then, on every new commit, take the set-union of the set of tests that have ever failed when the files in the commit have been touched along with the set of all tests recently introduced, and run them as the build. My theory (which I have not been able to verify yet) is that for many projects this process may converge in such a way that nearly all of the failures that occur for a full build will occur for this abbreviated build. If that's the case, it could give developers a strong early sense of "done", allowing any errors discovered in a full build to be handled in a bug reporting process.
If it works, you essentially have a build which foretells failures in the future based on failures in the past: a precognitive build.
While I agree this is an interesting way of optimizing the testing order for a build, I also think it might be a fascinating visual graph to look at.
If I am understanding correctly the you would have a data structure that says:
File[name] = FailTests[(TestA, 2 times), (TestB, 9 times) ......]
and then there are all kinds of interesting questions, like
'which files are associated with the most test failures'
'which tests most often fail'
'which files fail the most different tests'
'which files are the safest'
'which tests touch the most files'
'which visualization show the most insight'
'which insights are best shown by which visualizations'
I would expect the data for both the file changes and the test failures would be in the CI history. Is a simple map reduce job enough to collect the data?
Posted by: Llewellyn Falco | September 19, 2012 at 08:20 AM
LLewellyn: I suppose it could be. A time stamp is useful too. Decay functions can be applied for tuning the test set of a build. It is not something I've experimented with, though.
Posted by: Michael Feathers | September 19, 2012 at 08:26 AM
Sounds more like a Bayesian build server. You are estimating the probability that a test will fail given a set of changes, and picking the highest scores.
Posted by: edwin | September 19, 2012 at 08:52 AM
In the .NET space, Greg and Svein have been working on a tool called MightyMoose (www.continuoustests.com). It has a test minimizer that analyzes MSIL to detect what tests need running and only runs those. Perhaps this is an idea that can be extended to the build servers to speed things up.
Posted by: Ashic | September 19, 2012 at 09:08 AM
edwin: Thank you. That does seem like a better way of describing it.
Ashic: Yes, I've seen people do that with coverage. For what it's worth, I think that is the future. Would be nice to have that technology across languages and platforms.
Posted by: Michael Feathers | September 19, 2012 at 11:08 AM
That's an interesting technique, and in no way are you hinting at this being a complete strategy, only an approximate one. But in my experience, most errors in the file I have changed are reported to me when I compile / demo on my machine to see if it is working correctly. The unit test suite assist me because they exercise how calls in other files interact with my changes.
Given this, perhaps we can map out the files in a system that refer to the definition in this file and use that list as a basis for prioritization. This mapping can probably be done in a straightforward way by something as simple as grep.
(This won't include service calls, but then we have Integration Testing for those...)
Posted by: Terence Tuhinanshu | September 19, 2012 at 01:16 PM
To me it sounds a bit like having http://www.ncrunch.net/ on a build server - which, IMHO, would be great....
Posted by: Stefan Radulian | September 20, 2012 at 02:55 AM
This approach seems to have a similar goal of code coverage, just at a lower cost (computing resources)... I think it's a worthwhile effort, but may need to be "just a piece" of the code coverage pie... I am forced to wonder what would be the implications of refactoring, to this approach... code coverage handles this by knowing the actual code paths, but with a simple file-based approach, refactoring could cause tests to fade out of focus, or could loose history altogether (depending on implementation).
I think that a good goal would be to merge these efforts with other code coverage efforts, to identify what changes within a file (changes to the signature, so to speak), and then relay that info back to the code coverage map. With this, you can realize (some of) the performance gains from your approach, while more easily coping with changes.
Posted by: Scott Brickey | September 20, 2012 at 09:12 AM
Great idea. While not necessarily trivial, seems like it should be an available option for a more modern build server.
One challenge might be for teams that are not used to (or just haven't put time into it) making sure there's no order dependencies in their build, as one build to the next would likely be run in a different order (which is the point). It's a good practice anyhow.
Posted by: Noahd1 | September 20, 2012 at 07:26 PM
This sounds like what Limited Red does - http://www.limited-red.com/
"Failing tests are not randomly distributed, Limited Red helps you fail faster by running the tests most likely to fail first. While also helping visualise your test's history. "
Posted by: Andyw8 | September 21, 2012 at 12:43 AM
It seems like as written, this would prioritize running big, unstable tests for a lot of code. One of the tests in our test suite that fails most frequently is a big test case that runs close to the test harness limits for time and memory. It fails a lot because it runs out of time or memory, but these failures aren't all that useful for us to study. What is more informative is when it fails for some other reason. But I think this test case would get tagged as a "run early for all kinds of source files" by your methodology.
Posted by: Dave W. | September 21, 2012 at 11:00 AM
Dave W: Your comment makes me feel like you don't see that test as enough of a problem.
Posted by: Michael Feathers | September 21, 2012 at 12:50 PM
It seems like this technique would tend to run flaky tests more often, but that isn't a bad thing; flaky tests need more test runs to measure their failure rate. Ideally the test runner should use the test's history of flakiness to decide whether the failure rate has changed enough to be worth reporting as an actual change in behavior.
While I do think fixing flaky tests is often worth doing (they're expensive), it's not necessarily higher priority than other tasks, so the test runner should be smart enough to handle them sensibly.
Posted by: Brian Slesinsky | September 22, 2012 at 12:06 PM
@Michael Feathers: In other circumstances, a flaky test like that might be the canary in the coal mine that lets us know that we inadvertently increased run time or memory requirements. And certainly, trying to decrease flakiness over time and get more repeatable results is important and useful. Right now, though, there's more urgent stuff to deal with.
Generally speaking, I think it's more efficient to spend limited development time going after consistent failures in short-running tests rather than inconsistent failures in long-running tests (except when the latter represents an important customer bug, which isn't the case here). You get more bugs fixed more quickly with the former, and you might fix the latter in the process.
This particular test case should be moved into a different test suite of known likely failures, which would probably also address the precog strategy concern. My main point was that if the test cases are non-homogenous in size and run time, as here, you probably don't want to prioritize the frequently-failing big stuff over the less-frequently-failing small stuff.
P.S. Michael, if you are willing to share your email address with me, I can be reached as dave at bluepearlsoftware dot com. I discussed your legacy code book at a conference talk this past June, and I'm interested in discussions about setting up complex data structures for unit testing legacy code. (The kind of legacy code that does a bunch of raw data structure accesses like apple[i].pear->grape[j].kiwi = banana[k].cherry, where there may be a bunch of hidden constraints and dependencies between the data structures that need to be observed for the code to function correctly.) If that's not a topic you have time to discuss, I'd be interested in pointers to other forums where those kinds of discussions might be welcomed.
Posted by: Dave W. | October 05, 2012 at 12:27 PM
Cucover tried using the idea of code coverage to test mapping (just with cucumber)
https://github.com/mattwynne/cucover
And interesting experiment, though it feels like the simple test file to code file association for what to run satisfies most people. I also found the coverage data was not always 100% accurate and it often had odd quirks. It requires low level bindings which meant there we lots of interoperability issues.
I might fire up this experiment again.
Posted by: Josephwilk | November 25, 2012 at 03:47 AM