How do you know that you have the right architecture for your application? First of all, we have to define what "right" is, and there are many opinions.
One worthwhile goal for an application is to have classes that are reasonably independent. If we have some functional change to make, we should be able to go to a single class and make the change. If we have to modify several classes it could be because we are introducing a large feature, or it could be because of the way that our application is decomposed. Adding a field to a model class might require changes in a view, and that's normal. It's a side effect of separating representation and presentation.
In many cases, though, we end up making changes in several places because we have faulty abstractions. If many changes to one class necessitate change to another, it's an important piece of information. It could be an indication of the code smell that Martin Fowler named 'Shotgun Surgery' in his book `Refactoring: Improving the Design of Existing Code' [Addison-Wesley 1999]. 'Shotgun Surgery' is essentially, the code smell that you have you find that adding features requires you to make changes spread across wide areas of the code base.
The sad thing about 'Shotgun Surgery'' is that we're pretty much left up to our day to day experience to detect it. We might get the sense that we are touching too many areas of our code, but that is a general feeling. Do we really know which classes are tied together in our problem space? Do we know what other classes we are likely to touch when we touch the one we are working on? It turns out that we have all of the information we need to figure this out in our source code repositories.
Representing Code Change
Over the past few months, I've settled on an intermediate form that I use for data I gather from source code repositories. It makes change analysis and many other forms of analysis rather easy. The central data structure is a method-change-event. Each method change event has the following fields:
- type (method add, change, or delete) - method name (fully qualified with class/modules) - method body length - file name - sha1 - commit date - committer
With this information, I can track the changes of individual methods across their entire lifetimes. I can also do higher level analysis of qualities associated with files and classes.
Approach
When methods from two or more classes are changed in the same day by the same committer, we can say that the classes are somewhat correlated in time. When this happens often their correlation is high. For our purposes, it's probably be sufficient to compute the pairwise correlation of classes.
Given that we have a set of events which represents the state of every method each time it is changed in our project's history, we should be able to find all of the classes that have been touched by each committer on each day. We can then take all pairs of those classes and throw them into a set. When we do this for each commit and total the number of times each pair occurs across all of the commits, we should be able to see the relative frequencies of the pairs. When we sort them, we'll see which classes change most often with each other.
Fortunately, this sort of analysis is rather brief in Ruby. Here is the whole of the computation after a few extensions to Array and my method change event class:
events.group_by {|e| [e.day,e.committer]}.values .map {|e| e.map(&:class_name).uniq.combination(2).to_a } .flatten(1).norm_pairs.freq_by {|e| e }.sort_by {|p| p[1] }
When you examine these sorts of frequencies, they typically have this sort of shape:
Once you have a sorted list of the highest frequency class pairs, you can march through them and see if there are any surprises. Temporal correlation of change across domain objects is particularly informative. Altering the correlation range from a day to a week can be useful too.
It would be great to plot that curve on a log scale. If it's close to straight, you know you have a power law!
Posted by: Dean Wampler | September 26, 2011 at 09:42 AM
Yet another power law popping out in software! There is something amazing going on here... probably a byproduct of the way the human mind is organizing things.
The concept of Temporal Correlation reminds me of artifacts (source code) entanglement, as defined by Carlo Pescio
"Two clusters of information are entangled when performing a change on one immediately requires a change on the other"
the concept is explored here in relation to change:
http://www.carlopescio.com/2011/01/notes-on-software-design-chapter-13-on.html
but the real meaning of the term was introduced earlier:
http://www.carlopescio.com/2010/11/notes-on-software-design-chapter-12.html
As I see it, your chart is a sampling of source code entanglement (where samples correspond to changes that actually happened to the code). It would be interesting to see the same chart for different code bases (languages, developers, programming styles...), to compare shape and values.
Posted by: Vic | September 26, 2011 at 11:55 AM
Read Cosma Shalizi regarding the difference between a power law and a lognormal distribution. In particular, "looks straight on a log plot" is definitely not the same thing as a power law.
Also, if there seems to be an obvious pattern in your data, Tukey suggests transforming it to remove the pattern and look at the residuals; it's almost always more informative than trying to look past the obvious pattern in the original plot.
For example, this curve slopes up. Taking differences is easy (e.g. convolve with a kernel of [-1, 1]), and allows you to see the next effect.
Posted by: Johnicholas | September 26, 2011 at 12:40 PM
How do you get the change event information? Some of it should obviously be tracked by any SCM, but identifying actual changes to a method seems quite a bit harder without a parser or ndepend-like tool to identify changes at the method level.
Posted by: Kalebpederson | September 27, 2011 at 07:44 AM
Once you have a sorted list of the highest frequency class pairs, you can march through them and see if there are any surprises. Temporal correlation of change across domain objects is particularly informative. Altering the correlation range from a day to a week can be useful too.
Posted by: power balance wholesale | October 05, 2011 at 09:32 PM
Nice stuff, Michael. I really like the code analysis stuff you've been doing lately.
Posted by: John Goodsen | October 09, 2011 at 09:48 AM
Items arrive in the outlet stores when they are not sold in traditional Michael Kors retail stores or are slightly damaged in manufacturing or shipping. Typically, the "flaws" are unnoticed and often would have been realized through normal wear-and-tear anyhow. Often it could be something as minor as a slight scuff on the leather in an obscure place or a scratch on the metal buckle that holds the purse strap. Because of this, visiting a Michael Kors outlet store can deliver you a discounted Michael Kors purse at a significantly lower price. [ http://www.korsoutlet.net ]
Posted by: beats by dr dre headphones | June 12, 2012 at 01:02 AM