I have Google’s blogsearch set to give me notifications about unit testing. On an average week, I read dozens of blogs and mailing list discussions about the topic. Occasionally, I read something new, but there’s lot of repetition out there. The same arguments crop up often. Of all of them, though, there is one argument about unit testing which really bugs me because it rests on a flawed theory about testing and quality and sadly, it’s an argument that I fell for a long time ago and I’d like to lay it to rest. Hopefully, this blog will help, but I have to relate a little history first.
Back in the very early 2000s, I had a conversation with Steve Freeman at a conference. We were talking about Test-Driven Development and Steve had the strong feeling that most of the people who were practicing TDD at the time were doing it wrong - they'd missed something.
Steve was and is part of a close-knit community in London who have been practicing XP and TDD from the very beginning. Among the fruits of their labor was the entire notion of mock objects. Steve Freeman and Tim MacKinnon wrote the paper that introduced the idea to the broader community. The rest is history. There are mock object frameworks out there for nearly every language in common use.
Mock Objects, however, are part of a larger relatively unpublicized approach to TDD. The story I heard was that it was all started by John Nolan, the CTO of a startup named Connextra. John Nolan, gave his developers a challenge: write OO code with no getters. Whenever possible, tell another object to do something rather than ask. In the process of doing this, they noticed that their code became supple and easy to change. They also noticed that the fake objects that they were writing were highly repetitive, so they came up with the idea of a mocking framework that would allow them to set expectations on objects - mock objects.
When Steve told me about this approach, I thought it sounded okay but I there was one thing that I couldn’t wrap my head around – Steve, and Tim, and the people who had been on that team were using mocks extensively. In fact, they used mocks whenever they could. This was a bit different from the way that I was practicing TDD. What I did, in general, was use tests to drive a class and then I extracted new classes from the class I was designing as it became bulky. Some tests would cover just one class, but others would cover several classes working together.
The problem that I saw with the mock object approach was that it only tested individual classes, not their interactions. Sure, the tests I wrote were nominally unit tests, but I liked the fact that they occasionally tested real interactions between a class and its immediate collaborators. Yes, I liked isolation but I felt that this little tiptoe into integration level testing gave my tests a bit more power and a bit more strength. But, there was only one problem. The team at Connextra that was using mocks extensively was reporting extremely low defect rates. I just wasn’t sure how they were getting them. After all, it didn’t seem like there was any integration testing going on. Their application should have been rife with integration errors. Or should it have? Let’s examine our reasoning.
One very common theory about unit testing is that quality comes from removing the errors that your tests catch. Superficially, this makes sense. Tests can pass or fail and when they fail we learn that we have a problem and we correct it. If you subscribe to this theory, you expect to find fewer integration errors when you do integration testing and fewer “unit” errors when you do unit testing. It’s a nice theory, but it’s wrong. The best way to see this is to compare unit testing to another way of improving quality – one that has a very dramatic measurable effect.
Back in the 1980s, there was a movement to use something called Clean Room Software Development. The notion behind Clean Room was that you could increase quality by increasing the rigor of development. In Clean Room, you had to write a logical predicate for every little piece of your code and you had to demonstrate, during a review, that your code did no more or less than the predicate described. It was a very serious approach, and it was a bit more radical than what I just described: another tenet of Clean Room was that there was to be no unit testing. None. Zilch. When you wrote your code it was assumed correct after it was reviewed. The only testing that was done was stochastic testing at the functional level.
Amazingly, Clean Room worked. Clean Room teams demonstrated very high quality numbers. When I read about it, I was stunned, but then I came across a passage in a book that I was reading about the process. The author said that many programmers wrote their predicates after writing a section of code, but that experienced programmers often wrote the predicates first.. Gee, that sounds familiar, doesn’t it? In TDD, we write the test first and the test is, essentially, a specification of the behavior of the code we are about to write.
In the software industry, we’ve been chasing quality for years. The interesting thing is there are a number of things that work. Design by Contract works. Test Driven Development works. So do Clean Room, code inspections and the use of higher-level languages.
All of these techniques have been shown to increase quality. And, if we look closely we can see why: all of them force us to reflect on our code.
That’s the magic, and it’s why unit testing works also. When you write unit tests, TDD-style or after your development, you scrutinize, you think, and often you prevent problems without even encountering a test failure.
Now, as you’re reading this, you might think that I'm saying that we can get away with doing nothing as long as we sit back in our chairs, rest our chins on our hands, and think about our code. I don’t think so. I think that approach may work for short periods for some people, but software development is a long-haul activity. We need practices which help us achieve continuous discipline and a continuous state of reflection. Clean Room and TDD are two practices which, despite their radical differences, force us to think with absolute precision about what we are doing.
I have no doubt that a team could do well with, say, Clean Room, but personally, I like the fact that the tests we end up with using TDD give us additional leverage: they make it easier to change our code and know that it is still working without having to re-reason through the entirety of the code base. If you have to change your code often it’s not the best use of your time, especially when you can write tests that have recorded and embodied that reasoning and run them at will. With those tests in place, you free yourself up to reason about other things that you haven’t reasoned about in the past, rather than repeating yourself endlessly.
But, enough about TDD.
My point is that we can't look at testing mechanistically. Unit testing does not improve quality just by catching errors at the unit level. And, integration testing does not improve quality just by catching errors at the integration level. The truth is more subtle than that. Quality is a function of thought and reflection - precise thought and reflection. That’s the magic. Techniques which reinforce that discipline invariably increase quality.
If, by process you had to write the code first in Lisp, and then later in Java, I would think you'd trigger the same huge improvements in quality from re-examining the basic problem in several different perspectives. That's nice, but it is a very slow and expensive way to development. For new code it is not that hard, but for existing stuff it is a huge resource drag.
I've always guessed that 'statistically' relative to the code and the programmer we have nearly constant bug rates. In a sense, if you knew that a programmer introduced 20 bugs in the last release, if their work is similar this time, can we assume that there are another 20 bugs?
If you know roughly how many bugs you are looking for, then you can get a good sense of your progress. If you're short, you can schedule a bit more time to look in the darker reaches of the code. It's an odd perspective, but it is far less resource intensive than doing everything twice.
Paul.
Posted by: Paul W. Homer | June 12, 2008 at 02:01 PM
Nice article! I am hoping one day in our software development world, we will use tools like Alloy (http://alloy.mit.edu/) to test our designs like we test our code. As Dijkstra says, program testing can be a very effective way to show the presence of bugs, but is hopelessly inadequate for showing their absence. http://www.cs.utexas.edu/~EWD/transcriptions/EWD03xx/EWD340.html and that was in 1972!
Posted by: Mitch | June 12, 2008 at 02:02 PM
Good post.
I think there's an ancillary point: When you design code for testability, you tend to improve its quality in the process--generally making it clearer, cleaner, and more modular.
Posted by: Andrew Binstock | June 12, 2008 at 02:05 PM
Reminds me of something from the Poppendieck's implementing lean book: "The job of tests, and the people that develop and run tests, is to prevent defects, not to find them"
Posted by: Scott Bellware | June 12, 2008 at 02:31 PM
Nice article. I was all prepared to scoff but no.
However, you missed a big point behind unit testing, and that's that having unit tests dramatically improves the ability of code to be reused.
First, the unit test is an example of how the developer of the code expected it to be used, so the next developer has a simplified example to start with.
Second, having unit tests allows the next developer to make small changes to the old code without fully understanding all the details of the large system in the reasonable expectation that he hasn't broken anything if the tests pass.
"As Dijkstra says, program testing can be a very effective way to show the presence of bugs, but is hopelessly inadequate for showing their absence."
I fail to agree. I have written modules with substantial testing, I felt I proved them correct with the testing, I delivered them, they continue to work to this day. In particular, I've uncovered what would have been subtle real-world bugs by edge case testing.
Demonstrating that a program is bug-free or at least performs to certain specifications is intrinsically extremely difficult, but program testing is one of the best tools we have for this purpose.
Posted by: Tom Ritchford | June 12, 2008 at 03:17 PM
Personally, I find that TDD drove me into a much more focused practice of the Single Responsibility Principle. If you have a method that does one and only one thing, it is really hard to code in errors.
I became an SRP extremist and took the principle down to the method level. You end up with a collection of very robust legos. Calling legos in sequence one after the other is hard to get wrong. This implies that your methods abide by the SRP.
My best designs (most robust, highest quality, easiest to adapt to new situations) have taken it to the extreme: if statement? Not the methods responsibility (ask someone else to look it up for you). Need a branch in the algorithm, ask a hash to provide you the appropriate algorithm based on a set of conditions. You can test that hash, test the guy that asks the hash, test the behaviors stored in the hash. It is hard to introduce bugs.
In fact, I started experimenting with only allowing methods to take a single input: currying as much as necessary. This gives you an even finer granularity.
I thought it might cause systems to be more confusing, but you end up with fairly simplistic flows. With modern text editors and IDE's, it is trivial to move around in the code (if you know how to use your chosen editor's advanced navigations).
In the end, the majority of "bugs" or "issues" are related to misunderstandings of the requirements, not unexpected situations in the code.
Posted by: Corey Haines | June 12, 2008 at 03:45 PM
I agree that what's important about unit testing is in a large part that it forces us to think about our design better, in part because we actually need to *use* it.
In the Python world some of us practice a technique called doctesting. It can be used for unit tests and integration tests, and for simply testing code samples in documentation. With doctesting, you write a document as much as you write a test.
A unit test makes you think about the design and APIs as you actually have to use them. A doctest in addition makes you think about these things more, because you have to explain it to some later reader. You need to actually write about it and imagine what someone new to the codebase would want to know.
Of course writing a good doctest is difficult, and it's not always the best documentation. This is the biggest con that people see, and some people familiar with them indeed don't really like doctests. The drawback is that often the documents are worse than non-doctest documentation, and this can be frustrating. On the other hand, some documentation is also better than nothing, which is frequently the alternative.
So, at best, a doctest can result in a nice documentation artifact, or, if nothing else, at least a fairly documented test artifact.
Here is an example of a doctest I wrote recently:
http://pypi.python.org/pypi/martian
Posted by: Martijn Faassen | June 12, 2008 at 04:21 PM
I love your writing usually, but I really got lost with this one.
What exactly was the flawed theory? I kept waiting for it. Reading between the lines, I can make several guesses -- Mock Objects were the flawed theory? Sounds wrong. Focusing on tests that test the integration cases is the flawed theory? My bet is on that one.
You said you had a problem with the mock object approach because it only tested the individual classes, not their interactions. But earlier you had said that you had, at one time, yourself had a flawed theory. So... was this "problem" you had with the mock object approach only a problem during that earlier era when you had a flawed theory, or is it a problem you have now?
Sorry, I just really need you to spell it out and hit me over the head with it -- reading between the lines leads to too many possible interpretations, for a beginner like me. Could you please clarify?
Also, the part about OO code with no getters and telling another object to do something rather than ask sounded very interesting. What does this mean in practice? Do you (or anyone) have any pointers to where we can read more about this? Thanks.
Posted by: Mark | June 12, 2008 at 05:47 PM
Reading again (a third time) I got it. Now, I don't know how I missed it. Must be a lack of caffeine: "One very common theory about unit testing is that quality comes from removing the errors that your tests catch."
Still very interested in hearing more about writing OO code without getters. I'll Google for more, but pointers welcome if anyone has any.
Posted by: Mark | June 12, 2008 at 05:52 PM
Hmmm, is the theory REALLY flawed?
JT
http://www.FireMe.To/udi
Posted by: John Thomas | June 12, 2008 at 07:31 PM
I recently worked at a company that had a server that would automatically generate unit tests. We had long and heated debates about the value.
My theory is that the tool had 'stone soup' value. We brought along the stone that created the incentive for other people to bring carrots and meat and mushrooms to make a very fine soup.
Most people don't know that good soup is an option. Any process that makes people reflect on the nature of their soup will have value. Kevin's Law of Good Soup.
Posted by: Kevin Lawrence | June 12, 2008 at 07:43 PM
If an unthoughtful coder writes tests to check that his code works the way he thinks it does, what tests that his tests work the way he thinks they do?
The TDD idea appeals to me in a way, but in practice I find too often you have to update the tests along with the code and double the error space along with breaking the DRY principle. However I do not usually work in a well known domain, or on well known problems so perhaps writing tests or specs up front is just the wrong choice for me.
Posted by: Anonymous | June 12, 2008 at 07:44 PM
Fantastic article. I feel like you nailed it right on the head.
Unit testing is a great tool for developers to utilize, but it's just a tool. At the end of the day, a good programming will end up using both TDD and Clean Room. We are human, we make mistakes, including while creating tests to test our code.
@Corey, I would slightly disagree with your second reason for unit testing. Yes, having a unit test for the next developer is crucial, but that next developer should always try to make sure test cases are checking all possible angles of a problem. Sure you change the code and all tests pass, but does that really mean the code is working as expected?
Posted by: Jay Signorello | June 12, 2008 at 07:45 PM
When you describe thought and reflection about code, I interpret that to describe the process which I use. For each specific function or process imagine the range of possible states (of variables, etc) and also what all the edge cases are before writing the code. The process must be able to handle all of those states without errors. For example, pointers might be null, the user may perform the steps in the "wrong" order. If any input can be "bad", then expect that it will be at some time and code accordingly.
Posted by: eikonos | June 12, 2008 at 08:49 PM
Good article.
Unit testing is not an end in itself; it is a means to an end. There are lots of ways to get there. Ultimately, testing forces you to look at the software in a different way. It is the switch in perspectives that helps you to improve quality.
Posted by: Bill Smith | June 12, 2008 at 09:03 PM
Most large-scale projects I've worked on, unit tests take a back seat. Requirements change quickly enough to render much of the test work obsolete. Why go through the trouble to create a virtual environment for your little class, than just deploy its parent component to a development environment and exercise it in a running system? Let the architecture folks design and elucidate your component's contracts. Let the army of testers find your edge cases. STOP TRYING TO BE SMART - JUST WRITE GOOD CODE. Concentrate on your code, jez your IDE practically writes it for you!
Posted by: oh crap | June 12, 2008 at 09:17 PM
Holistic versus Unit testing.
=============================
I agree with you: In the end, it is people who care about quality who produce quality code.
Yet, *how* you care matters, because it influences your productivity.
Productivity is impaired by the time it takes to find and fix defaults, i.e. bugs.
Bugs are easier fixed when detected early.
Conclusion: We need an efficient early error detection system.
Unit tests is an expensive early detection system.
I believe that, to the extend possible, automatized integration testing is much less expensive.
It means testing the system as a whole rather then as parts.
Posted by: JeanHuguesRobert | June 13, 2008 at 03:03 AM
Interesting article and I like the thought process you went through to explain your conclusion. However, it could be improved by defining "quality." For example, your statement "All of these techniques have been shown to increase quality." rests on the assumption that there is a definitive metric behind that term. Just the definition of "quality" is a contentious point in software engineering. It's a fuzzy concept that would be better served by perhaps stating that you really mean the number of defects in a given body of code.
Posted by: Matt | June 13, 2008 at 08:40 AM
Finally a thoughtful commentary on TDD that I can agree with! I always thought TDD didn't make sense -- it isn't always brought out that the thinking through your implementation (and writing test cases based on what exactly you are implementing) is the key rather than rote generation of unit tests.
Posted by: Grok2 | June 13, 2008 at 10:35 AM
This is slightly off-topic, but...how do we 'rest our heads on our chins'? =P
Posted by: Luke | June 13, 2008 at 10:56 AM
Can you give some examples of Clean Room developed code? Without some sort of context the comparison might not be valid. Perhaps they were simple features (like add two numbers together in a microprocessor) that didn't have many inputs and whose result will never change. Compare that to today's environment where inputs change all the time and there's always new requirements for the system.
Posted by: BlogReader | June 13, 2008 at 01:11 PM
I like to write out formulas describing my code in LaTeX. That helps me figure out what I want and make sure I can describe it exactly at a high level before I jump into the code.
Posted by: Eric Normand | June 14, 2008 at 09:57 AM
@Grok2,
You find yourself a chair which you can rest back in. Then, as you raise your feet up, you continue to look forward towards your monitor. There comes a point when your chin sits on your chest. At this point, relax the muscles in your neck, and now your chin provides the mechanical basis for keeping your head up. Hence, you rest your head on your chin.
Posted by: Samuel A. Falvo II | June 14, 2008 at 03:50 PM
So, um, yeah. And duh. So, my question - where did anyone get the idea that the game is anything other than "tools for reflection?"
Posted by: Jim Bullock | June 14, 2008 at 10:01 PM
So, maybe it was a bar discussion and actually meant "many" rather than "most".
Anyway, I've responded at http://www.m3p.co.uk/blog/2008/06/15/test-driven-development-a-cognitive-justification/
Posted by: Steve Freeman | June 15, 2008 at 03:20 AM