Quantcast

diff'ing files ...

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

diff'ing files ...

Albretch Mueller
 I research on collaborative editing/revision control utilities
(related mostly to NLP) and at the corpora list
~
 http://listserv.linguistlist.org/cgi-bin/wa?A1=ind1106&L=CORPORA#18
~
 ([Corpora-List] Managing texts and their edition history ...) they
told me they use git. However at:
~
 http://git-scm.com/documentation
~
 I could not find specific information on what diff'ing algorithm does
git use. Any white papers about git internals, mostly about its
diff'ing and synching algorithm? (and I am not scared at all of
"Directed Acyclic Graph" or any kind of Math terminology; in fact, I
love it ;-))
~
 Also, of the wave of git related or general books coming out, which
one actually explains the general concepts behind distributed version
Control? Again, my ultimate interest is not computer
programming/compiler-fed languages, but there is much that can be
learned from them
~
 Thank you
 lbrtchx
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: diff'ing files ...

Jeff King
On Mon, Jun 06, 2011 at 06:49:04PM +0000, Albretch Mueller wrote:

>  I could not find specific information on what diff'ing algorithm does
> git use. Any white papers about git internals, mostly about its
> diff'ing and synching algorithm? (and I am not scared at all of
> "Directed Acyclic Graph" or any kind of Math terminology; in fact, I
> love it ;-))

It depends on what level you want to talk about. Each commit has a tree
state represented by a single "tree object". Each tree object is a list
of paths mapping to more objects, each of which may be a binary stream
of data ("blob") or another tree object (i.e., a subtree).

Diffing commits means diffing their trees. Diffing trees means looking
for blobs at the same path in each commit and diffing them[1]. How we
diff the content in the blobs depends on:

  1. any external diff configuration (e.g., GIT_EXTERNAL_DIFF in the
     environment, or a per-file diff driver; see "git help attributes").
     In that case, we just show whatever output the external diff
     produces.

  2. We use xdiff by default, which does a line-oriented diff using the
     Myers algorithm (just like most "diff" implementations). See
     xdiff/xdiffi.c in the git source, or Myers "An O(ND) Difference
     Algorith and its Variations".

  3. If you give the --patience option, we use the patience algorithm
     instead:

       http://bramcohen.livejournal.com/73318.html

     That's the inventor's original description, but you can find other
     discussion by googling for "patience diff".

  4. If you use --word-diff (or --color-words), we find differences by
     line, and then break lines down by configurable delimiters into
     words, and then do a Myers diff on the resulting list of words.

  5. If files are not text, we usually just say "Binary files differ"
     (like most diff implementations). We can also generate binary
     diffs, though I don't know offhand the details of the algorithm.

I think that more or less covers 2-way diffing. There's also a 3-way
comparison when merging, both at the tree level and at the blob level.
Syncing actually has very little to do with our regular diffing. We
generate compact binary deltas (just like in (5) above) between objects,
regardless of their path, and send the other side a set of needed
objects. This is the same format used for compact "packfile" storage.
If the last bit of that paragraph didn't make any sense, then read up on
git's object model in something like "Pro Git":

  http://progit.org/book/ch9-0.html

-Peff

[1] Actually, tree comparisons are a little more complex than I made
them out to be. For instance, a path might exist on only one side (which
we usually show as a comparison to an empty blob), or might exist as a
tree on one side and a blob on the other. We also look for renames of
paths at the blob level by looking for similar blobs.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: diff'ing files ...

Albretch Mueller
> ... binary diffs, though I don't know offhand the details of the algorithm.
~
 this is the part that I need ;-)
~
 Reading the source code without knowing the basic underlying
ideas/algorithm (just an outline if possible) won't help much
~
 Thank you
 lbrtchx
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: diff'ing files ...

Jeff King
On Tue, Jun 07, 2011 at 10:12:35PM +0000, Albretch Mueller wrote:

> > ... binary diffs, though I don't know offhand the details of the algorithm.
> ~
>  this is the part that I need ;-)
> ~
>  Reading the source code without knowing the basic underlying
> ideas/algorithm (just an outline if possible) won't help much

You could read the comments in the source:

  $ head -n 7 diff-delta.c
  /*
   * diff-delta.c: generate a delta between two buffers
   *
   * This code was greatly inspired by parts of LibXDiff from Davide Libenzi
   * http://www.xmailserver.org/xdiff-lib.html
   *
   * Rewritten for GIT by Nicolas Pitre <[hidden email]>, (C) 2005-2007

According to the xdiff page linked:

  For binary files, LibXDiff implements both (with some modification)
  the algorithm described in File System Support for Delta Compression
  by Joshua P.  MacDonald, and the algorithm described in Fingerprinting
  By Random Polynomials by Michael O. Rabin.

Nicolas (cc'd) might be able to say what, if any, substantive changes
were made from those works.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: diff'ing files ...

Nicolas Pitre-2
On Tue, 7 Jun 2011, Jeff King wrote:

> On Tue, Jun 07, 2011 at 10:12:35PM +0000, Albretch Mueller wrote:
>
> > > ... binary diffs, though I don't know offhand the details of the algorithm.
> > ~
> >  this is the part that I need ;-)
> > ~
> >  Reading the source code without knowing the basic underlying
> > ideas/algorithm (just an outline if possible) won't help much
>
> You could read the comments in the source:
>
>   $ head -n 7 diff-delta.c
>   /*
>    * diff-delta.c: generate a delta between two buffers
>    *
>    * This code was greatly inspired by parts of LibXDiff from Davide Libenzi
>    * http://www.xmailserver.org/xdiff-lib.html
>    *
>    * Rewritten for GIT by Nicolas Pitre <[hidden email]>, (C) 2005-2007
>
> According to the xdiff page linked:
>
>   For binary files, LibXDiff implements both (with some modification)
>   the algorithm described in File System Support for Delta Compression
>   by Joshua P.  MacDonald, and the algorithm described in Fingerprinting
>   By Random Polynomials by Michael O. Rabin.
>
> Nicolas (cc'd) might be able to say what, if any, substantive changes
> were made from those works.

The libxdiff code was pretty generic so to be highly portable and usable
for many application types.  What I did is to get rid of everything that
git strictly didn't need in order to make the code as simple as
possible, and most importantly as fast as possible.  And then the code
was optimized even further, sacrificing on clarity a bit, to make it
even faster.  Since this code is the very inner loop of every delta
search for best delta matches, every bit of optimization counts.

And then further modifications were made to avoid pathological corner
cases which were taking too much time for little gain in the Git
context.

I also changed the output encoding to make it tighter.

So, strictly speaking, the current code in Git doesn't bear any
resemblance with the libxdiff code at all.  However the basic algorithm
behind both implementations is the same.  Studying the libxdiff version
is probably easier in order to gain an understanding of how this works.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: diff'ing files ...

Jakub Narębski
Nicolas Pitre <[hidden email]> writes:
> On Tue, 7 Jun 2011, Jeff King wrote:
> > On Tue, Jun 07, 2011 at 10:12:35PM +0000, Albretch Mueller wrote:
> >
> > > > ... binary diffs, though I don't know offhand the details of the algorithm.
> > > ~
> > >  this is the part that I need ;-)

[...]

> > According to the xdiff page linked:
> >
> >   For binary files, LibXDiff implements both (with some modification)
> >   the algorithm described in File System Support for Delta Compression
> >   by Joshua P.  MacDonald, and the algorithm described in Fingerprinting
> >   By Random Polynomials by Michael O. Rabin.
> >
> > Nicolas (cc'd) might be able to say what, if any, substantive changes
> > were made from those works.
>
> The libxdiff code was pretty generic so to be highly portable and usable
> for many application types.  What I did is to get rid of everything that
> git strictly didn't need in order to make the code as simple as
> possible, and most importantly as fast as possible. [...]
>
> And then further modifications were made to avoid pathological corner
> cases which were taking too much time for little gain in the Git
> context.
>
> I also changed the output encoding to make it tighter.

Nicolas, do you know how binary diff used by git compares with respect
to performance and compression with other binary diff algorithms:

  * original LibXDiff
  * bsdiff
  * xdelta (vcdif algorithm)
  * vbindiff

--
Jakub Narebski
Poland
ShadeHawk on #git
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: diff'ing files ...

Nicolas Pitre-2
On Fri, 10 Jun 2011, Jakub Narebski wrote:

> Nicolas Pitre <[hidden email]> writes:
> > The libxdiff code was pretty generic so to be highly portable and usable
> > for many application types.  What I did is to get rid of everything that
> > git strictly didn't need in order to make the code as simple as
> > possible, and most importantly as fast as possible. [...]
> >
> > And then further modifications were made to avoid pathological corner
> > cases which were taking too much time for little gain in the Git
> > context.
> >
> > I also changed the output encoding to make it tighter.
>
> Nicolas, do you know how binary diff used by git compares with respect
> to performance and compression with other binary diff algorithms:
>
>   * original LibXDiff
>   * bsdiff
>   * xdelta (vcdif algorithm)
>   * vbindiff

No idea.  But you can test that pretty easily if you wish.  I would be
interested in the results of course. Just do:

        make test-delta

and then, to compress something:

        ./test-delta -d <input_file> <reference_file> <output_file>

Of course this will produce <output_file> which is only the bare binary
diff annotation data against the reference file.  In Git that output is
also deflated with zlib, before it is stored in a pack.  The other
binary diff tools are usually doing a similar post-deflation pass as
well.

It should be noted that the algorithm that Git uses won't produce the
absolute smallest output.  When I tried that, computation time went up
of course, but surprizingly the final deflated result was slightly
_larger_ as well, probably due to the fact that zlib was less efficient
with the increased randomness in the tighter delta output to deflate.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: diff'ing files ...

Albretch Mueller
> further modifications were made to avoid pathological corner cases ...
~
 Nicolas, did you keep test samples of those pathological corner cases
you would share?
~
 lbrtchx
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Loading...