One of the most interesting maxims of OO (to me) is the Open/Closed Principle. It states that modules should be open for extension but closed to modification. Bertrand Meyer coined the term and, although he was thinking of inheritance as a reuse mechanism, he hit upon something central in software development. It turns out that regardless of whether we are doing OO or not, we know we are programming well when we find ourselves not modifying our existing code as much. Code should grow by addition rather than mutation. Need a new feature? Introduce some new classes. If you find yourself adding all of the code inline in existing methods and classes, know that down that path madness lies. If you keep your classes and methods small and focused and you avoid tangling responsibilities, closure is often an emergent effect.
The Open/Closed Principle has been around for a while. Many developers attempt to use it consciously, and others do things which tend to promote closure without thinking about it. What is the net effect on code?
One of the nice things about having a version control history is that we can query it.
Here is a graph of the per file commit frequencies of source files on a random open source project. Every point on the x axis is a file and the its height on y is the number of times its been modified. The frequencies have been sorted in ascending order to make the picture sane:
One thing that's obvious is that there are a lot of files which have only been modified a very small number of times. There are also some files which get an extremely large number of commits. It's worth taking a look at the latter files and wondering why. In some cases, it's because the files are configurational - they consist of things like data-based settings in code that change frequently. At other times, they are havens of run-away complexity - snarls of logic that seem to accrete more and more code. Clearly the right side of the curve is where you want to expend some refactoring effort.
Another thing that is nice about this view of code is that you can get a sense of what your payback will be for your refactoring. The area under the curve is the total number of commits for the project. If we inch backward from the right, we can calculate something like the 5% mark - the files which you are likely to touch in one out of every twenty modifications. This may not seem like much, but it is over the duration of a project, especially when you factor in the effort of continually trying to understand that code. I have no doubt that there is a decent formula that we can use to calculate the likely savings from refactoring based on the ratio of adds to modifications, and the commits in a code base.
Side note:
It's relatively easy to make these diagrams yourself. grep the log of your VCS for lines that depict adds and modifications. Strip everything except the file names, sort it, run it through 'uniq -c', sort it again, and then plot the first column.
One problem is that it doesn't take into account that a dozen mods may be for one task or one bug, especially for us in the git world.
We were using Jira for both features and bugs (whether that's a good idea or not, we were). For my heatmap I took the commits and separate out the ticket number. Then I easily built a count of tickets per file.
Same result. Some files are hardly touched at all (often due to a rename refactor elsewhere) and others soak up changes like crazy. This 'volatility' stuff matters, and it's easy to measure now. I think we'll see new principles and practices form because it's so visible.
Changes absolutely do clump.
Other note: "Code should grow by addition rather than mutation" -- as long as the additions are not duplications. I've recently seen code that grew by duplication instead of editing, and it was painfully bad.
Posted by: Tim Ottinger | January 24, 2011 at 06:18 PM
@Tim Ottinger - yes, the way that mods are done definitely does matter. People are all over the map wrt how often they commit during a task. In the book 'Making Software' there's a chapter describing research that used an Eclipse-variant which actually recorded what other files you looked at when you were making a change. I haven't seen any one try to instrument that in vim or emacs yet :-) Could be useful information, though.
Posted by: Michael Feathers | January 24, 2011 at 06:29 PM
http://2011.msrconf.org has links to extracted version control datasets in their mining challenge.
Posted by: anonymous | January 24, 2011 at 08:32 PM
I wonder what the long term effect will be?
I'm assuming a 1:1 correlation between conceptual units and files.
Encapsulating what varies, and what stays the same, in the long term will end up with more files that are modified once. So the proportion will change, but you're not going to get rid of that peak.
Plotting long term busy-ness may be good for cleaning the whole codebase, but once that's done, the plot will only be useful if it considers short term busy-ness.
Posted by: Pete | January 25, 2011 at 01:28 AM
@anonymous Thanks!
Posted by: Michael Feathers | January 25, 2011 at 03:48 AM
@Pete Yes, once a file is highly churned it doesn't un-churn, so a time window makes sense. The shape will persist. I think that the take-away is that if we extract code from the hubs, we may reduce the frequency with which we will visit them.
I think it's like many of these sorts of dists in code. The shape is always the same, just different coefficients. Good to figure out where to look.
Posted by: Michael Feathers | January 25, 2011 at 03:54 AM
This could be a very useful way of finding out where to focus first efforts of unit testing in a project.
Posted by: Raph | January 25, 2011 at 07:38 AM
Just as change clumps in this way, so do bugs - especially in a relatively mature codebase. Pretty much every codebase I've ever worked on has had areas of brittle code which see a lot of change due to seeing lots of small tweaks and bug fixes.
So, if you do the same graphing exercise with bug counts the maps are pretty similar in shape and there is plenty of commonality.
The wosrt thing is that these brittle classes with high churn tend to be notorious areas that people don't like having to work in, so they tend to make "strategic" bug fixes rather than refactorings - thus compounding the issue.
Posted by: Phil Anderson | January 26, 2011 at 02:44 AM
(I think this idea is an extension of Pete's)
The best extraction of this data, in my opinion, would be per source code file grouped by a time frame (say a month).
Ideally the first value would be high as you're checking in several features but then it would trend to zero very quickly. Then you should be able to attribute later blips to bug fixes. If instead you have any other trend the code most likely really bad -- lots of bugs over time or new features require more commits (the code is likely violated the OCP) -- and should be refactored.
Posted by: Austin Salonen | January 27, 2011 at 10:20 AM
you may want to check out this tool: http://rise.cs.drexel.edu/~sunny/tools.rhtml
and look at the research of Dr. Yuanfang Cai (http://www.cs.drexel.edu/~yfcai/). Essentially Wong and Cai have done a more advanced version of what you are talking about above, additionally they can also determine ownership (who are you likely to need to talk to in order to refactor that code) and discover modularity violations. This is done by looking at what files were checked in together, not just the frequency of checkins - which as someone pointed out above may or may not be important.
Posted by: wb | January 28, 2011 at 06:37 AM
Another perspective is "Why do you have so many files that don't change?" If your product isn't changing, that's one answer. Otherwise, they might not be necessary. Chances are more likely that they contain configuration code and empty structures full of boilerplate.
The idea of extension over modification is a good one for shielding cruft, the most likely use. Good working code should be easy to change, and the strategy of encapsulation is for code you're scared to change, and besides, the original intent of the policy is to build a marketplace for commercial, closed source software.
But have you ever extended any closed source software?
Posted by: Aaron Evans | February 03, 2011 at 03:41 PM
This is exactly my experience: http://blog.nirav.name/2008/03/unusual-tip-to-find-refactoring-hot.html I thought it was unusual at the time but now that I think about it, it only makes sense to look for source files with high revisions.
Posted by: Nirav Thaker | February 05, 2011 at 07:26 AM
Definitely useful information that can guide refactoring. However, it seems to me that previous refactoring might also skew the results: a file that has been refactored would have more commits (the refactorings) than a file that hasn't yet been refactored ...
Posted by: Luke Schubert | February 06, 2011 at 09:38 PM
@Luke Schubert,
I wouldn't say refactorings downright skew the results. Yes, probably an often refactored file is of better quality than a file of similar complexity not refactored, but I'd think that refactoring efforts spread quite evenly among the files.
Simple logic and good quality code is less frequently refactored. A more often refactored file is probably one people had more problems with getting right, and it is more probable that there still are bugs there than in the less changed ones. But even tho a large number of refactorings in a file, in relation to other files, should be considered a smell, I of course agree that a file with lots of refactorings is of lesser risk than one with lots of bug fixes and new features.
Posted by: Heikki Naski | February 10, 2011 at 09:44 AM
It is great!Very worth to read.
Posted by: Jimmy Choo Outlet | June 16, 2011 at 05:50 PM