Bram Cohen speaks up about patience diff

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
Report Content as Inappropriate

Bram Cohen speaks up about patience diff

Johannes Schindelin

Bram sent me two mails explaining a bit more of the details behind
patience diff, and invited me to forward the content.  So here you are:

-- snipsnap --
From: Bram Cohen <[hidden email]>
To: Johannes Schindelin <[hidden email]>
Sent: Wednesday, February 4, 2009 3:52:54 PM
Subject: patience diff advantages/disadvantages

Hey Johannes, I just did some web searching for patience search and came
across the recent discussion of it. I'm not on whatever mailing list that
discussion was on, but I have some thoughts on it which you might wish to

The reverse engineered intuition that non-unique lines are 'less
important' than unique lines is a correct one, and specifically the case
of lots of curly bracket lines in C and END lines in Pascal and similar
cases are the main motivating factor.

For most common cases, any reasonable diff algorithm will give the same
result. The cases where the results are quite dramatic are ones where the
amount of change between the two versions is very large. The test cases
being tried in recent discussion are from developers familiar with version
control, who are being very judicious in their development process and
intentionally trying to keep things from diverging too badly. In those
cases, outputs will generally speaking always be the same and where they
aren't the differences are frequently a judgement call.

The really bad cases are ones where two versions have diverged
dramatically and the developer isn't being careful to keep patch sizes
under control. Under those circumstances a diff algorithm can occasionally
become 'misaligned' in that it matches long sections of curly brackets
together, but it winds up correlating the curly brackets of functions in
one version with the curly brackets of the next later function in the
other version. This situation is *very ugly*, and can result in a totally
unusable conflict file in the situation where you need such things to be
presented coherently the most. I have in fact seen it happen (in fact,
it's happened to me). I don't have a sample of it, but as you can imagine
such a sample case must by its nature be rather large and uninsightful -
it basically just looks ugly.

So 'unique lines' is a simple cross-language proxy for 'unimportant
lines'. 'short lines' of course also can work, but might give false
positives and it depends what you mean by 'short'. There's also an
implementation and code maintainability advantage to unique lines. The
patience algorithm is far simpler to understand and implement than the
fancy-shmancy recent LCS algorithms. While patience sorting is far from
obvious, LCS algorithms with good asymptotics are extremely opaque. I
suspect that a properly implemented patience diff could be just as fast if
not faster than LCS, but that would require quite a bit more algorithmic
magic than I for one am prepared to design, test and debug. In very real
terms the amount of time effort, and trickery the different
implementations have had put into them is very different, so current speed
differences aren't exactly a fair comparison (although far from
irrelevant, because nobody with the necessary expertise is volunteering to
spend that much time on patience diff as of yet).

Some other things I forgot to mention - it isn't clear from the discussion
whether you're doing the heuristic of matching beginning/ending lines when
everything else doesn't match (this heuristic could also be applied first,
which would be faster and almost always yield the same results). That's
important so that long sections of white space get matched to each other.
Also, I disagree that LCS is necessarily a good idea when matching, say
12121 to 212. What was meant by that case is highly ambiguous, and the
most conservative thing is to simply give a conflict for the whole
section. (If matching, say, 12121 to 1212 then the first lines match
heuristic will make them mostly match with an extra 1 at the end of the
first one).

The most difficult real-world diff cases have to do with lots of blank
lines, if we represent a blank line by 'S', then the diff from 'SSSSS' to
'SSSXSSS' is highly ambiguous - an extra S was added, but there are four
reasonable places for assuming it was put in, each of which a reasonable
case can be made for, and I suspect that there's an analogue to Arrow's
theorem that trying to do the right thing in particular edge cases must
necessarily result in intermediary cases where the criteria are in direct
conflict. (My patience implementation, by the way, will match the first
three lines of the five Ss to the the first three of the second set of Ss,
and the last two to the last two, with the string XS inserted into the
middle, which is arguably the best behavior, but could under some
circumstances result in very confused three-way merges dependending on
what happened on the other side).

To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html