« Dealing with a Special Boundary | Main | Soft Sciences »

July 19, 2007


Alberto Savoia

Hi Michael,

I agree 100% with you and I sincerely hope that this blog was not motivated by my Artima entry on the Change Risk Analysis and Prediction metric :-).


I am fully aware of the problems and challenges that come along with the benefits of code coverage metrics. I refer people to the very same Brian Marick's article all the time.

My position on code coverage, evolved in the past 20 or so years, is the following:

High coverage, by itself, indicates nothing (i.e. I could have 100% coverage and no meaningful assertions).

But 0%, or very low, coverage DOES indicates inadequate testing - something that is a definite problem IMO.

If a segment of the code has 100% coverage, and you introduce a bug during maintenance, you are not guaranteed that your tests will find that bug.

But if that same segment has 0% coverage, and you introduce a bug, you are guaranteed that you will NOT find that bug until you perform some other form of testing (e.g. QA).

So, what's the right way to use (and not abuse or misuse) coverage?

I summarized my thinking on the subject in one of my Testivus entries:




I've been reading your blog for a while but this entry really hit home.

Today I was working on writing a test suite for a new data structure. I finished the test suite, got nearly 100% coverage over the data structure, but it almost immediately failed when I swapped it with the older data structure it was meant to replace.

Coverage may be a good place to start when writing a test suite, but it certainly doesn't get you where you want to be.

Alberto Savoia

>Coverage may be a good place to start when >writing a test suite, but it certainly >doesn't get you where you want to be.

Would you agree with the statement that code coverage is necessary but not sufficient?



All metrics related to code quality can be abused by people who don't understand them. And the error of using a metric as a proxy for a specific quality is a well-known error, even if frequently repeated. I am not quite sure, then, that I understand the value of introducing a new term for the problem, especially this one.

Michael Feathers

Binstock, I just find it easier to talk about things when I have names for them. It's just like in code. We can repeat three lines of code over and over again, or we can extract them to a method and give them a name.

Prosthetic Goals are something I plan on writing about more. And, it's a general problem in organizations. The metrics example that I offered is just one case of it. If there's a term for it already, I'm not aware of it. And, I wouldn't mind using another term for it if there was one.

Michael Feathers


Hi. No, this blog wasn't motivated by yours. It was motivated by a bunch of recent experiences, and anecdotes I've heard from people.. cases where organizations have attempted to impose coverage standards and ended up with waste and bad side effects.

I think that coverage is good, but it is good as a piece of information, not as a goal. Coverage is a great piece of information inside a team, but a poor organizational goal. When an organization implies that it will grade teams on coverage (and the implication can be nothing more than highlighting the coverage of various teams in public), it is, in software terms, as if the organization was writing tests like this:

void testMatrixAddition() { assertTrue(matrixSum(result, m1, m2)); }

That test can pass without calculating a matrix sum. People can pass coverage goals without doing the work that can make coverage useful.

Sometimes when I talk about this, people say "Well, we would never 'game' coverage" but problems can occur even if you don't attempt game it. It's just like the test above; it's something that looks like it is a point of feedback about quality, but it is deceptive if taken that way.

Michael Feathers

Alberto Savoia

>Hi. No, this blog wasn't motivated by yours.

Hi Michael,

That's a relief :-). I have a lot of respect for you and your work, and I would have been very disappointed if you had misunderstood my goals and intentions.

I think that a big problem with code coverage, is that there is no "assertion coverage" - as your example clearly demonstrates.

I have looked at thousands of JUnit tests from dozens of organizations in the past few years; and while I haven't seen much evidence of developers "gaming the system", I've seen plenty of weak tests where the high coverage is pretty much wasted by weak or insufficient assertions. We need some "state/outcome" coverage to complement code coverage. Some of my colleagues are working on this very subject, but it's not a trivial thing to do.

The interesting thing is that getting coverage is typically 80% of the work in testing. It's a shame that once people get to the point where they can make some meaningful assertions that could discover a lot of problems, they don't take full advantage of it.

Since metrics are such a touchy subject and so subject to misuse and abuse, I took the time to write the following Testivus epigram about it - hopefully it will *hammer* the point home:



Alberto Savoia

>I am not quite sure, then, that
>I understand the value of introducing
>a new term for the problem, especially
>this one.

Andrew, Michael,

I think we all agree that coining and using new terms is something that should be done with care and deliberation.

In this case, however, the fact that people continue to misuse and abuse coverage (along with other metrics) is an indication that whatever message is being used, it's not coming across and, perhaps, a new term might be called for.

While I agree that naming things is helpful, I am not sure that "prosthetic" is the right term to use. Many people might have trouble spelling it and it's original meaning and usage does not map as well as it should to the problem you are describing IMO:

"Prosthetics: The branch of medicine or surgery that deals with the production and application of artificial body parts."

I don't mean to be critical, just helpful. Because if some things deserve to be named, they deserve a good, fitting, memorable, and evocative name. Personally, I think Michael has hit the nail of the head with the term "characterization tests". A term which I've been using and promoting every chance I get; and I know Andrew has been using and promoting as well.

I like the term "characterization tests" because it's a perfect description, and one that's needed to differentiate between tests that contain an implied specification and/or intent, and tests that simply "record and check" what the code does. They are two very different types of tests and we need a term to distinguish them because they often assume the same form (i.e. JUnit) and are used in the same way.

Here, we need to clearly distinguish between a metric and a goal. We don't want to make the metric the goal but, at the same time, we don't want to throw the baby away with the bath wate, because metrics can be very useful, even necessary, to get closer to certain goals.

I'll try to give the naming issue some thought. In the meantime, if you decide to use the term prosthetic, I will support that and use it whenever I can to see how well it "sticks".

Thank you for sparking an interesting discussion. I hope more people chime in.



People who use code coverage metrics as a sign of quality need to get some revision of software engineering. Its as basic as that. But at the same time (and agreeing with Alberto when he says metrics and goal needs to be separated) good coverage does give a very good indication of the quantity and not quality of the code being tested. Test case coverage will give a coverage of the quality, but not quantity. As an example if a library is developed and its certified with 95% code coverage and 60% test coverage, it gives an altogether different picture and interpretation to the user of the library than the same library with 60% code coverage and 95% test coverage. Either way its kind of scary. Questions can still be raised on the competency of the tester. Many times I have seen people writing code where the exception handling code is substantial and a big part of this substantial exception handling is very difficult to test functionally. In such cases, the developer during Unit testing manages to do a very good code coverage. But give this code to the QA team, he/she may be able to do 95% functional test coverage, but the code coverage may be low. I think Context is God and metrics are useless without context.


This comment regards your point about improving compile time:

A friend works at at a small game developer. They have put substantial effort into profiling and optimizing their compile process. My sense is that it is extremely well-spent effort, enabling them to be far more productive and iterative than most teams working on similar-sized codebases. I'm not sure if compile time is a statistic they continue to monitor, but their experience seems to reinforce your point.


I'd argue that measuring bug count has very bad side effects.

Here are the metrics that matter to me:
1. Net promoter score version over version. (this is the Money metric)

2. Sales version over version. (yeah, it may be stable, but does anyone want to USE this stuff?)

3. Cool factor (try measuring that one?)

I'm glad to come up with other stats, but I always feel foolish like I'm selling ice to Eskimos. It's that feeling like when people come up to me and if I can give them an accurate date from the top of my head they feel confident that I'm doing a great job, where if I take a second to think about it their confidence goes down, even if it is to get them better information. I wonder, what percentage of my job is really quality marketing? Should I take time away from testing to build confidence for people when it doesn't help the product?

When I think about the word Prosthetic I wonder what else I can do to ease their phantom limb syndrome? You know, they have this new mirror therapy that is working really well for people? Check it out. http://content.nejm.org/cgi/content/full/357/21/2206

If the prosthetic goals do not translate to real goals, why does the myth persist that they always do? Because it seems scientific to have more numbers?

iPhone contacts backup

Yes. It really matters.
So, what?

The comments to this entry are closed.