I research on collaborative editing/revision control utilities
(related mostly to NLP) and at the corpora list ~ http://listserv.linguistlist.org/cgi-bin/wa?A1=ind1106&L=CORPORA#18 ~ ([Corpora-List] Managing texts and their edition history ...) they told me they use git. However at: ~ http://git-scm.com/documentation ~ I could not find specific information on what diff'ing algorithm does git use. Any white papers about git internals, mostly about its diff'ing and synching algorithm? (and I am not scared at all of "Directed Acyclic Graph" or any kind of Math terminology; in fact, I love it ;-)) ~ Also, of the wave of git related or general books coming out, which one actually explains the general concepts behind distributed version Control? Again, my ultimate interest is not computer programming/compiler-fed languages, but there is much that can be learned from them ~ Thank you lbrtchx -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
On Mon, Jun 06, 2011 at 06:49:04PM +0000, Albretch Mueller wrote:
> I could not find specific information on what diff'ing algorithm does > git use. Any white papers about git internals, mostly about its > diff'ing and synching algorithm? (and I am not scared at all of > "Directed Acyclic Graph" or any kind of Math terminology; in fact, I > love it ;-)) It depends on what level you want to talk about. Each commit has a tree state represented by a single "tree object". Each tree object is a list of paths mapping to more objects, each of which may be a binary stream of data ("blob") or another tree object (i.e., a subtree). Diffing commits means diffing their trees. Diffing trees means looking for blobs at the same path in each commit and diffing them[1]. How we diff the content in the blobs depends on: 1. any external diff configuration (e.g., GIT_EXTERNAL_DIFF in the environment, or a per-file diff driver; see "git help attributes"). In that case, we just show whatever output the external diff produces. 2. We use xdiff by default, which does a line-oriented diff using the Myers algorithm (just like most "diff" implementations). See xdiff/xdiffi.c in the git source, or Myers "An O(ND) Difference Algorith and its Variations". 3. If you give the --patience option, we use the patience algorithm instead: http://bramcohen.livejournal.com/73318.html That's the inventor's original description, but you can find other discussion by googling for "patience diff". 4. If you use --word-diff (or --color-words), we find differences by line, and then break lines down by configurable delimiters into words, and then do a Myers diff on the resulting list of words. 5. If files are not text, we usually just say "Binary files differ" (like most diff implementations). We can also generate binary diffs, though I don't know offhand the details of the algorithm. I think that more or less covers 2-way diffing. There's also a 3-way comparison when merging, both at the tree level and at the blob level. Syncing actually has very little to do with our regular diffing. We generate compact binary deltas (just like in (5) above) between objects, regardless of their path, and send the other side a set of needed objects. This is the same format used for compact "packfile" storage. If the last bit of that paragraph didn't make any sense, then read up on git's object model in something like "Pro Git": http://progit.org/book/ch9-0.html -Peff [1] Actually, tree comparisons are a little more complex than I made them out to be. For instance, a path might exist on only one side (which we usually show as a comparison to an empty blob), or might exist as a tree on one side and a blob on the other. We also look for renames of paths at the blob level by looking for similar blobs. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
> ... binary diffs, though I don't know offhand the details of the algorithm.
~ this is the part that I need ;-) ~ Reading the source code without knowing the basic underlying ideas/algorithm (just an outline if possible) won't help much ~ Thank you lbrtchx -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
On Tue, Jun 07, 2011 at 10:12:35PM +0000, Albretch Mueller wrote:
> > ... binary diffs, though I don't know offhand the details of the algorithm. > ~ > this is the part that I need ;-) > ~ > Reading the source code without knowing the basic underlying > ideas/algorithm (just an outline if possible) won't help much You could read the comments in the source: $ head -n 7 diff-delta.c /* * diff-delta.c: generate a delta between two buffers * * This code was greatly inspired by parts of LibXDiff from Davide Libenzi * http://www.xmailserver.org/xdiff-lib.html * * Rewritten for GIT by Nicolas Pitre <[hidden email]>, (C) 2005-2007 According to the xdiff page linked: For binary files, LibXDiff implements both (with some modification) the algorithm described in File System Support for Delta Compression by Joshua P. MacDonald, and the algorithm described in Fingerprinting By Random Polynomials by Michael O. Rabin. Nicolas (cc'd) might be able to say what, if any, substantive changes were made from those works. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
On Tue, 7 Jun 2011, Jeff King wrote:
> On Tue, Jun 07, 2011 at 10:12:35PM +0000, Albretch Mueller wrote: > > > > ... binary diffs, though I don't know offhand the details of the algorithm. > > ~ > > this is the part that I need ;-) > > ~ > > Reading the source code without knowing the basic underlying > > ideas/algorithm (just an outline if possible) won't help much > > You could read the comments in the source: > > $ head -n 7 diff-delta.c > /* > * diff-delta.c: generate a delta between two buffers > * > * This code was greatly inspired by parts of LibXDiff from Davide Libenzi > * http://www.xmailserver.org/xdiff-lib.html > * > * Rewritten for GIT by Nicolas Pitre <[hidden email]>, (C) 2005-2007 > > According to the xdiff page linked: > > For binary files, LibXDiff implements both (with some modification) > the algorithm described in File System Support for Delta Compression > by Joshua P. MacDonald, and the algorithm described in Fingerprinting > By Random Polynomials by Michael O. Rabin. > > Nicolas (cc'd) might be able to say what, if any, substantive changes > were made from those works. The libxdiff code was pretty generic so to be highly portable and usable for many application types. What I did is to get rid of everything that git strictly didn't need in order to make the code as simple as possible, and most importantly as fast as possible. And then the code was optimized even further, sacrificing on clarity a bit, to make it even faster. Since this code is the very inner loop of every delta search for best delta matches, every bit of optimization counts. And then further modifications were made to avoid pathological corner cases which were taking too much time for little gain in the Git context. I also changed the output encoding to make it tighter. So, strictly speaking, the current code in Git doesn't bear any resemblance with the libxdiff code at all. However the basic algorithm behind both implementations is the same. Studying the libxdiff version is probably easier in order to gain an understanding of how this works. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
Nicolas Pitre <[hidden email]> writes:
> On Tue, 7 Jun 2011, Jeff King wrote: > > On Tue, Jun 07, 2011 at 10:12:35PM +0000, Albretch Mueller wrote: > > > > > > ... binary diffs, though I don't know offhand the details of the algorithm. > > > ~ > > > this is the part that I need ;-) [...] > > According to the xdiff page linked: > > > > For binary files, LibXDiff implements both (with some modification) > > the algorithm described in File System Support for Delta Compression > > by Joshua P. MacDonald, and the algorithm described in Fingerprinting > > By Random Polynomials by Michael O. Rabin. > > > > Nicolas (cc'd) might be able to say what, if any, substantive changes > > were made from those works. > > The libxdiff code was pretty generic so to be highly portable and usable > for many application types. What I did is to get rid of everything that > git strictly didn't need in order to make the code as simple as > possible, and most importantly as fast as possible. [...] > > And then further modifications were made to avoid pathological corner > cases which were taking too much time for little gain in the Git > context. > > I also changed the output encoding to make it tighter. Nicolas, do you know how binary diff used by git compares with respect to performance and compression with other binary diff algorithms: * original LibXDiff * bsdiff * xdelta (vcdif algorithm) * vbindiff -- Jakub Narebski Poland ShadeHawk on #git -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
On Fri, 10 Jun 2011, Jakub Narebski wrote:
> Nicolas Pitre <[hidden email]> writes: > > The libxdiff code was pretty generic so to be highly portable and usable > > for many application types. What I did is to get rid of everything that > > git strictly didn't need in order to make the code as simple as > > possible, and most importantly as fast as possible. [...] > > > > And then further modifications were made to avoid pathological corner > > cases which were taking too much time for little gain in the Git > > context. > > > > I also changed the output encoding to make it tighter. > > Nicolas, do you know how binary diff used by git compares with respect > to performance and compression with other binary diff algorithms: > > * original LibXDiff > * bsdiff > * xdelta (vcdif algorithm) > * vbindiff No idea. But you can test that pretty easily if you wish. I would be interested in the results of course. Just do: make test-delta and then, to compress something: ./test-delta -d <input_file> <reference_file> <output_file> Of course this will produce <output_file> which is only the bare binary diff annotation data against the reference file. In Git that output is also deflated with zlib, before it is stored in a pack. The other binary diff tools are usually doing a similar post-deflation pass as well. It should be noted that the algorithm that Git uses won't produce the absolute smallest output. When I tried that, computation time went up of course, but surprizingly the final deflated result was slightly _larger_ as well, probably due to the fact that zlib was less efficient with the increased randomness in the tighter delta output to deflate. Nicolas -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
> further modifications were made to avoid pathological corner cases ...
~ Nicolas, did you keep test samples of those pathological corner cases you would share? ~ lbrtchx -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
Free forum by Nabble | Edit this page |