Narrow clone implementation difficulty estimate

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Narrow clone implementation difficulty estimate

Alexander Gavrilov
Hello,

We are considering using Git to manage a large set of mostly binary
files (large images, pdf files, open-office documents, etc). The
amount of data is such that it is infeasible to force every user
to download all of it, so it is necessary to implement a partial
retrieval scheme.

In particular, we need to decide whether it is better to invest
effort into implementing Narrow Clone, or partitioning and
reorganizing the data set into submodules (the latter may prove
to be almost impossible for this data set). We will most likely
develop a new, very simplified GUI for non-technical users,
so the details of both possible approaches will be hidden
under the hood.


After some looking around, I think that Narrow clone would probably involve:

1. Modifying the revision walk engine used by the pack generator to
allow filtering blobs using a set of path masks. (Handling the same
tree object appearing at different paths may be tricky.)

2. Modifying the fetch protocol to allow sending such filter
expressions to the server.

3. Adding necessary configuration entries and parameters to commands,
in order to allow using the new functionality.

4. Resurrecting the sparse checkout series and merging it with the
new filtering logic. Narrow clone must imply sparse checkout that
is a subset of the cloned paths.

5. Fixing all breakage that may be caused by missing blobs.

I feel that the last point involves the most uncertainty, and may also
prove the most difficult one to implement. However, I cannot judge the
actual difficulty due to an incomplete understanding of Git internals.


I currently see the following additional problems with this approach:

1. Merge conflicts outside the filtered area cannot be handled.
However, in the case of this project they are estimated to be
extremely unlikely.

2. Changing the filter set is tricky, because extending the watched
area requires connecting to the server, and requesting missing blobs.
This action appears to be mostly identical to initial clone with a
more complex filter. On the other hand, shrinking the area would leave
unnecessary data in the repository, which is difficult to reuse safely
if the area is extended back. Finally, editing the set without
downloading missing data essentially corrupts the repository.

3. One of the goals of using git is building a distributed mirroring
system, similar to gittorrent or mirror-sync proposals. Narrow clone
significantly complicates this because of incomplete data sets.
A simple solution may be restricting download to peers whose set is
a superset of what's needed, but that may cause the system to degrade
to a fully centralized one.


In relation to the last point, namely building a mirroring
network, I also had an idea that perhaps in the current state
of things bundles are more suited to it, because they can be
directly reused by many peers, and deciding what to put in
the bundle is not much of a problem for this particular project.
I expect that implementation of narrow bundle support should
not be much different from narrow clone.


Currently we are evaluating possibilities to approach this
problem, and would like to know if this analysis makes sense.
We are willing to contribute the results to the Git community
if/when we implement it.

Alexander
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Narrow clone implementation difficulty estimate

Jakub Narębski
Alexander Gavrilov <[hidden email]> writes:

> We are considering using Git to manage a large set of mostly binary
> files (large images, pdf files, open-office documents, etc). The
> amount of data is such that it is infeasible to force every user
> to download all of it, so it is necessary to implement a partial
> retrieval scheme.
>
> In particular, we need to decide whether it is better to invest
> effort into implementing Narrow Clone, or partitioning and
> reorganizing the data set into submodules (the latter may prove
> to be almost impossible for this data set). We will most likely
> develop a new, very simplified GUI for non-technical users,
> so the details of both possible approaches will be hidden
> under the hood.

First, there were quite complete, although as far as I know newer
accepted into git, work on narrow / sparse / subtree / partial
*checkout*.  IIRC the general idea about extening or (ab)using
assume-unchanged mechanism was accepted, but the problem was in the
user interface details (I think that porcelain part was quite well
accepted, except hesitation whether to use/extend existing flag, or
create new for the purpose of narrow checkout).  You can search
archive for that
  http://article.gmane.org/gmane.comp.version-control.git/89900
  http://article.gmane.org/gmane.comp.version-control.git/90016
  http://article.gmane.org/gmane.comp.version-control.git/77046
  http://article.gmane.org/gmane.comp.version-control.git/50256
  ...
should give you some idea what to search for. This is of course
only part of solution.

Second, there was an idea to use new "replace" mechanism for this
(currently in 'pu' only, I think, merged as 'cc/replace' branch).
This mechanism was created for better bisecting with non-bisectable
commits, and is meant to be transferable extension of 'graft'
mechanism. The "replace" mechanism allows to replace also blob objects
(contents of filename), so you can have two repositories: baseline
repository with stub files in place of large binary files, and
extended repository with replacement in and replacement blobs in
object database with 'proper' (and large) contents of those binary
files. But that is just an idea, without implementation.

Third, there was work (a year ago, perhaps?) by Dana How on better
support for large objects. Some of those got accepted, some
dosn't. You can set maximum size of object in pack, IIRC, and you can
use gitattributes to mark (binary) files that are meant to be not
deltaified. If all of your repositories are on networked filesystem,
you can create separate optimized pack containing only those large
binary files, mark it as "kept" (using *.keep file, see documentation)
to avoid repacking those large binary files, and distributed this pack
either using symlink, or using alternates (keeping only one copy of
this pack, and accessing it via networked filesystem when it is
required).

Fourth, a long thime ago there was send a patch supposedly adding
support for 'lazy' clone, where you download blob objects from remote
repository only as required.  But its was send as a single large
patch, fairly intrusive.  I don't think it got good review, nevermind
being accepted into git.


Some further reading:
* "large(25G) repository in git"
  http://article.gmane.org/gmane.comp.version-control.git/114351
* "Re: Appropriateness of git for digital video production versioning"
  http://article.gmane.org/gmane.comp.version-control.git/107696
* http://git.or.cz/gitwiki/GitTogether08 had some presentation
  about media files in git, and some thread on git mailing list about
  that issue was result (which I didn't bookmark).

HTH
--
Jakub Narebski
Poland
ShadeHawk on #git
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Narrow clone implementation difficulty estimate

Duy Nguyen
On Thu, May 14, 2009 at 8:39 PM, Jakub Narebski <[hidden email]> wrote:

> Alexander Gavrilov <[hidden email]> writes:
>
>> We are considering using Git to manage a large set of mostly binary
>> files (large images, pdf files, open-office documents, etc). The
>> amount of data is such that it is infeasible to force every user
>> to download all of it, so it is necessary to implement a partial
>> retrieval scheme.
>>
>> In particular, we need to decide whether it is better to invest
>> effort into implementing Narrow Clone, or partitioning and
>> reorganizing the data set into submodules (the latter may prove
>> to be almost impossible for this data set). We will most likely
>> develop a new, very simplified GUI for non-technical users,
>> so the details of both possible approaches will be hidden
>> under the hood.
>
> First, there were quite complete, although as far as I know newer
> accepted into git, work on narrow / sparse / subtree / partial
> *checkout*.  IIRC the general idea about extening or (ab)using
> assume-unchanged mechanism was accepted, but the problem was in the
> user interface details (I think that porcelain part was quite well
> accepted, except hesitation whether to use/extend existing flag, or
> create new for the purpose of narrow checkout).  You can search
> archive for that
>  http://article.gmane.org/gmane.comp.version-control.git/89900
>  http://article.gmane.org/gmane.comp.version-control.git/90016
>  http://article.gmane.org/gmane.comp.version-control.git/77046
>  http://article.gmane.org/gmane.comp.version-control.git/50256
>  ...
> should give you some idea what to search for. This is of course
> only part of solution.

FWIW I still maintain the patch series as a merged branch "tp/sco"
under my branch "inst" here

http://repo.or.cz/w/git/pclouds.git?a=shortlog;h=refs/heads/inst
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html