« The Thing of Software Development | Main | The Pinned Progress Curve »

January 24, 2011


Tim Ottinger

One problem is that it doesn't take into account that a dozen mods may be for one task or one bug, especially for us in the git world.

We were using Jira for both features and bugs (whether that's a good idea or not, we were). For my heatmap I took the commits and separate out the ticket number. Then I easily built a count of tickets per file.

Same result. Some files are hardly touched at all (often due to a rename refactor elsewhere) and others soak up changes like crazy. This 'volatility' stuff matters, and it's easy to measure now. I think we'll see new principles and practices form because it's so visible.

Changes absolutely do clump.

Other note: "Code should grow by addition rather than mutation" -- as long as the additions are not duplications. I've recently seen code that grew by duplication instead of editing, and it was painfully bad.

Michael Feathers

@Tim Ottinger - yes, the way that mods are done definitely does matter. People are all over the map wrt how often they commit during a task. In the book 'Making Software' there's a chapter describing research that used an Eclipse-variant which actually recorded what other files you looked at when you were making a change. I haven't seen any one try to instrument that in vim or emacs yet :-) Could be useful information, though.


http://2011.msrconf.org has links to extracted version control datasets in their mining challenge.


I wonder what the long term effect will be?

I'm assuming a 1:1 correlation between conceptual units and files.

Encapsulating what varies, and what stays the same, in the long term will end up with more files that are modified once. So the proportion will change, but you're not going to get rid of that peak.

Plotting long term busy-ness may be good for cleaning the whole codebase, but once that's done, the plot will only be useful if it considers short term busy-ness.

Michael Feathers

@anonymous Thanks!

Michael Feathers

@Pete Yes, once a file is highly churned it doesn't un-churn, so a time window makes sense. The shape will persist. I think that the take-away is that if we extract code from the hubs, we may reduce the frequency with which we will visit them.

I think it's like many of these sorts of dists in code. The shape is always the same, just different coefficients. Good to figure out where to look.


This could be a very useful way of finding out where to focus first efforts of unit testing in a project.

Phil Anderson

Just as change clumps in this way, so do bugs - especially in a relatively mature codebase. Pretty much every codebase I've ever worked on has had areas of brittle code which see a lot of change due to seeing lots of small tweaks and bug fixes.

So, if you do the same graphing exercise with bug counts the maps are pretty similar in shape and there is plenty of commonality.

The wosrt thing is that these brittle classes with high churn tend to be notorious areas that people don't like having to work in, so they tend to make "strategic" bug fixes rather than refactorings - thus compounding the issue.

Austin Salonen

(I think this idea is an extension of Pete's)

The best extraction of this data, in my opinion, would be per source code file grouped by a time frame (say a month).

Ideally the first value would be high as you're checking in several features but then it would trend to zero very quickly. Then you should be able to attribute later blips to bug fixes. If instead you have any other trend the code most likely really bad -- lots of bugs over time or new features require more commits (the code is likely violated the OCP) -- and should be refactored.


you may want to check out this tool: http://rise.cs.drexel.edu/~sunny/tools.rhtml
and look at the research of Dr. Yuanfang Cai (http://www.cs.drexel.edu/~yfcai/). Essentially Wong and Cai have done a more advanced version of what you are talking about above, additionally they can also determine ownership (who are you likely to need to talk to in order to refactor that code) and discover modularity violations. This is done by looking at what files were checked in together, not just the frequency of checkins - which as someone pointed out above may or may not be important.

Aaron Evans

Another perspective is "Why do you have so many files that don't change?" If your product isn't changing, that's one answer. Otherwise, they might not be necessary. Chances are more likely that they contain configuration code and empty structures full of boilerplate.

The idea of extension over modification is a good one for shielding cruft, the most likely use. Good working code should be easy to change, and the strategy of encapsulation is for code you're scared to change, and besides, the original intent of the policy is to build a marketplace for commercial, closed source software.

But have you ever extended any closed source software?

Nirav Thaker

This is exactly my experience: http://blog.nirav.name/2008/03/unusual-tip-to-find-refactoring-hot.html I thought it was unusual at the time but now that I think about it, it only makes sense to look for source files with high revisions.

Luke Schubert

Definitely useful information that can guide refactoring. However, it seems to me that previous refactoring might also skew the results: a file that has been refactored would have more commits (the refactorings) than a file that hasn't yet been refactored ...

Heikki Naski

@Luke Schubert,
I wouldn't say refactorings downright skew the results. Yes, probably an often refactored file is of better quality than a file of similar complexity not refactored, but I'd think that refactoring efforts spread quite evenly among the files.

Simple logic and good quality code is less frequently refactored. A more often refactored file is probably one people had more problems with getting right, and it is more probable that there still are bugs there than in the less changed ones. But even tho a large number of refactorings in a file, in relation to other files, should be considered a smell, I of course agree that a file with lots of refactorings is of lesser risk than one with lots of bug fixes and new features.

Jimmy Choo Outlet

It is great!Very worth to read.

The comments to this entry are closed.