Quantcast

Continue git clone after interruption

classic Classic list List threaded Threaded
36 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Continue git clone after interruption

Tomasz Kontusz
Hi,
is anybody working on making it possible to continue git clone after
interruption? It would be quite useful for people with bad internet
connection (I was downloading a big repo lately, and it was a bit
frustrating to start it over every time git stopped at ~90%).

Tomasz Kontusz

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Johannes Schindelin
Hi,

On Mon, 17 Aug 2009, Tomasz Kontusz wrote:

> is anybody working on making it possible to continue git clone after
> interruption? It would be quite useful for people with bad internet
> connection (I was downloading a big repo lately, and it was a bit
> frustrating to start it over every time git stopped at ~90%).

Unfortunately, we did not have enough GSoC slots for the project to allow
restartable clones.

There were discussions about how to implement this on the list, though.

Ciao,
Dscho

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Shawn Pearce
Johannes Schindelin <[hidden email]> wrote:

> On Mon, 17 Aug 2009, Tomasz Kontusz wrote:
>
> > is anybody working on making it possible to continue git clone after
> > interruption? It would be quite useful for people with bad internet
> > connection (I was downloading a big repo lately, and it was a bit
> > frustrating to start it over every time git stopped at ~90%).
>
> Unfortunately, we did not have enough GSoC slots for the project to allow
> restartable clones.
>
> There were discussions about how to implement this on the list, though.

Unfortunately, those of us who know how the native protocol works
can't come to an agreement on how it might be restartable.  If you
really read the archives on this topic, you'll see that Nico and I
disagree about how to do this.  IIRC Nico's position is, it isn't
really possible to implement a restart.

--
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Matthieu Moy
In reply to this post by Johannes Schindelin
Johannes Schindelin <[hidden email]> writes:

> Hi,
>
> On Mon, 17 Aug 2009, Tomasz Kontusz wrote:
>
>> is anybody working on making it possible to continue git clone after
>> interruption? It would be quite useful for people with bad internet
>> connection (I was downloading a big repo lately, and it was a bit
>> frustrating to start it over every time git stopped at ~90%).
>
> Unfortunately, we did not have enough GSoC slots for the project to allow
> restartable clones.
>
> There were discussions about how to implement this on the list,
> though.

And a paragraph on the wiki:

http://git.or.cz/gitwiki/SoC2009Ideas#RestartableClone

--
Matthieu
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Tomasz Kontusz
Dnia 2009-08-18, wto o godzinie 07:43 +0200, Matthieu Moy pisze:

> Johannes Schindelin <[hidden email]> writes:
>
> > Hi,
> >
> > On Mon, 17 Aug 2009, Tomasz Kontusz wrote:
> >
> >> is anybody working on making it possible to continue git clone after
> >> interruption? It would be quite useful for people with bad internet
> >> connection (I was downloading a big repo lately, and it was a bit
> >> frustrating to start it over every time git stopped at ~90%).
> >
> > Unfortunately, we did not have enough GSoC slots for the project to allow
> > restartable clones.
> >
> > There were discussions about how to implement this on the list,
> > though.
>
> And a paragraph on the wiki:
>
> http://git.or.cz/gitwiki/SoC2009Ideas#RestartableClone

Ok, so it looks like it's not implementable without some kind of cache
server-side, so the server would know what the pack it was sending
looked like.
But here's my idea: make server send objects in different order (the
newest commit + whatever it points to first, then next one,then
another...). Then it would be possible to look at what we got, tell
server we have nothing, and want [the newest commit that was not
complete]. I know the reason why it is sorted the way it is, but I think
that the way data is stored after clone is clients problem, so the
client should reorganize packs the way it wants.

Tomasz K.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Nicolas Pitre
On Tue, 18 Aug 2009, Tomasz Kontusz wrote:

> Ok, so it looks like it's not implementable without some kind of cache
> server-side, so the server would know what the pack it was sending
> looked like.
> But here's my idea: make server send objects in different order (the
> newest commit + whatever it points to first, then next one,then
> another...). Then it would be possible to look at what we got, tell
> server we have nothing, and want [the newest commit that was not
> complete]. I know the reason why it is sorted the way it is, but I think
> that the way data is stored after clone is clients problem, so the
> client should reorganize packs the way it wants.

That won't buy you much.  You should realize that a pack is made of:

1) Commit objects.  Yes they're all put together at the front of the pack,
   but they roughly are the equivalent of:

        git log --pretty=raw | gzip | wc -c

   For the Linux repo as of now that is around 32 MB.

2) Tree andblob objects.  Those are the bulk of the content for the top
   commit.  The top commit is usually not delta compressed because we
   want fast access to the top commit, and that is used as the base for
   further delta compression for older commits.  So the very first
   commit is whole at the front of the pack right after the commit
   objects.  you can estimate the size of this data with:

        git archive --format=tar HEAD | gzip | wc -c

   On the same Linux repo this is currently 75 MB.

3) Delta objects.  Those are making the rest of the pack, plus a couple
   tree/blob objects that were not found in the top commit and are
   different enough from any object in that top commit not to be
   represented as deltas.  Still, the majority of objects for all the
   remaining commits are delta objects.

So... if we reorder objects, all that we can do is to spread commit
objects around so that the objects referenced by one commit are all seen
before another commit object is included.  That would cut on that
initial 32 MB.

However you still have to get that 75 MB in order to at least be able to
look at _one_ commit.  So you've only reduced your critical download
size from 107 MB to 75 MB.  This is some improvement, of course, but not
worth the bother IMHO.  If we're to have restartable clone, it has to
work for any size.

And that's where the real problem is.  I don't think having servers to
cache pack results for every fetch requests is sensible as that would be
an immediate DoS attack vector.

And because the object order in a pack is not defined by the protocol,
we cannot expect the server to necessarily always provide the same
object order either.  For example, it is already undefined in which
order you'll receive objects as threaded delta search is non
deterministic and two identical fetch requests may end up with slightly
different packing.  Or load balancing may redirect your fetch requests
to different git servers which might have different versions of zlib, or
even git itself, affecting the object packing order and/or size.

Now... What _could_ be done, though, is some extension to the
git-archive command.  One thing that is well and strictly defined in git
is the file path sort order.  So given a commit SHA1, you should always
get the same files in the same order from git-archive.  For an initial
clone, git could attempt fetching the top commit using the remote
git-archive service and locally reconstruct that top commit that way.  
if the transfer is interrupted in the middle, then the remote
git-archive could be told how to resume the transfer by telling it how
many files and how many bytes in the current file to skip.  This way the
server doesn't need to perform any sort of caching and remains
stateless.

You then end up with a pretty shallow repository.  The clone process
could then fall back to the traditional native git transfer protocol to
deepen the history of that shallow repository.  And then that special
packing sort order to distribute commit objects would make sense since
each commit would then have a fairly small set of new objects, and most
of them would be deltas anyway, making the data size per commit really
small and any interrupted transfer much less of an issue.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Jakub Narębski
Nicolas Pitre <[hidden email]> writes:

> On Tue, 18 Aug 2009, Tomasz Kontusz wrote:
>
> > Ok, so it looks like it's not implementable without some kind of cache
> > server-side, so the server would know what the pack it was sending
> > looked like.
> > But here's my idea: make server send objects in different order (the
> > newest commit + whatever it points to first, then next one,then
> > another...). Then it would be possible to look at what we got, tell
> > server we have nothing, and want [the newest commit that was not
> > complete]. I know the reason why it is sorted the way it is, but I think
> > that the way data is stored after clone is clients problem, so the
> > client should reorganize packs the way it wants.
>
> That won't buy you much.  You should realize that a pack is made of:
>
> 1) Commit objects.  Yes they're all put together at the front of the pack,
>    but they roughly are the equivalent of:
>
> git log --pretty=raw | gzip | wc -c
>
>    For the Linux repo as of now that is around 32 MB.

For my clone of Git repository this gives 3.8 MB
 

> 2) Tree and blob objects.  Those are the bulk of the content for the top
>    commit.  The top commit is usually not delta compressed because we
>    want fast access to the top commit, and that is used as the base for
>    further delta compression for older commits.  So the very first
>    commit is whole at the front of the pack right after the commit
>    objects.  you can estimate the size of this data with:
>
> git archive --format=tar HEAD | gzip | wc -c
>
>    On the same Linux repo this is currently 75 MB.

On the same Git repository this gives 2.5 MB

>
> 3) Delta objects.  Those are making the rest of the pack, plus a couple
>    tree/blob objects that were not found in the top commit and are
>    different enough from any object in that top commit not to be
>    represented as deltas.  Still, the majority of objects for all the
>    remaining commits are delta objects.

You forgot that delta chains are bound by pack.depth limit, which
defaults to 50.  You would have then additional full objects.

The single packfile for this (just gc'ed) Git repository is 37 MB.
Much more than 3.8 MB + 2.5 MB = 6.3 MB.

[cut]

There is another way which we can go to implement resumable clone.
Let's git first try to clone whole repository (single pack; BTW what
happens if this pack is larger than file size limit for given
filesystem?).  If it fails, client ask first for first half of of
repository (half as in bisect, but it is server that has to calculate
it).  If it downloads, it will ask server for the rest of repository.
If it fails, it would reduce size in half again, and ask about 1/4 of
repository in packfile first.

The only extension required is for server to support additional
capability, which enable for client to ask for appropriate 1/2^n part
of repository (approximately), or 1/2^n between have and want.

--
Jakub Narebski
Poland
ShadeHawk on #git
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Nicolas Pitre
On Tue, 18 Aug 2009, Jakub Narebski wrote:

> Nicolas Pitre <[hidden email]> writes:
>
> > On Tue, 18 Aug 2009, Tomasz Kontusz wrote:
> >
> > > Ok, so it looks like it's not implementable without some kind of cache
> > > server-side, so the server would know what the pack it was sending
> > > looked like.
> > > But here's my idea: make server send objects in different order (the
> > > newest commit + whatever it points to first, then next one,then
> > > another...). Then it would be possible to look at what we got, tell
> > > server we have nothing, and want [the newest commit that was not
> > > complete]. I know the reason why it is sorted the way it is, but I think
> > > that the way data is stored after clone is clients problem, so the
> > > client should reorganize packs the way it wants.
> >
> > That won't buy you much.  You should realize that a pack is made of:
> >
> > 1) Commit objects.  Yes they're all put together at the front of the pack,
> >    but they roughly are the equivalent of:
> >
> > git log --pretty=raw | gzip | wc -c
> >
> >    For the Linux repo as of now that is around 32 MB.
>
> For my clone of Git repository this gives 3.8 MB
>  
> > 2) Tree and blob objects.  Those are the bulk of the content for the top
> >    commit.  The top commit is usually not delta compressed because we
> >    want fast access to the top commit, and that is used as the base for
> >    further delta compression for older commits.  So the very first
> >    commit is whole at the front of the pack right after the commit
> >    objects.  you can estimate the size of this data with:
> >
> > git archive --format=tar HEAD | gzip | wc -c
> >
> >    On the same Linux repo this is currently 75 MB.
>
> On the same Git repository this gives 2.5 MB

Interesting to see that the commit history is larger than the latest
source tree.  Probably that would be the same with the Linux kernel as
well if all versions since the beginning with adequate commit logs were
included in the repo.

> > 3) Delta objects.  Those are making the rest of the pack, plus a couple
> >    tree/blob objects that were not found in the top commit and are
> >    different enough from any object in that top commit not to be
> >    represented as deltas.  Still, the majority of objects for all the
> >    remaining commits are delta objects.
>
> You forgot that delta chains are bound by pack.depth limit, which
> defaults to 50.  You would have then additional full objects.

Sure, but that's probably not significant.  the delta chain depth is
limited, but not the width.  A given base object can have unlimited
delta "children", and so on at each depth level.

> The single packfile for this (just gc'ed) Git repository is 37 MB.
> Much more than 3.8 MB + 2.5 MB = 6.3 MB.

What I'm saying is that most of that 37 MB - 6.3 MB = 31 MB is likely to
be occupied by deltas.

> [cut]
>
> There is another way which we can go to implement resumable clone.
> Let's git first try to clone whole repository (single pack; BTW what
> happens if this pack is larger than file size limit for given
> filesystem?).

We currently fail.  Seems that no one ever had a problem with that so
far. We'd have to split the pack stream into multiple packs on the
receiving end.  But frankly, if you have a repository large enough to
bust your filesystem's file size limit then maybe you should seriously
reconsider your choice of development environment.

> If it fails, client ask first for first half of of
> repository (half as in bisect, but it is server that has to calculate
> it).  If it downloads, it will ask server for the rest of repository.
> If it fails, it would reduce size in half again, and ask about 1/4 of
> repository in packfile first.

Problem people with slow links have won't be helped at all with this.  
What if the network connection gets broken only after 49% of the
transfer and that took 3 hours to download?  You'll attempt a 25% size
transfer which would take 1.5 hour despite the fact that you already
spent that much time downloading that first 1/4 of the repository
already.  And yet what if you're unlucky and now the network craps on
you after 23% of that second attempt?

I think it is better to "prime" the repository with the content of the
top commit in the most straight forward manner using git-archive which
has the potential to be fully restartable at any point with little
complexity on the server side.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Jakub Narębski
On Tue, 18 Aug 2009, Nicolas Pitre wrote:
> On Tue, 18 Aug 2009, Jakub Narebski wrote:
>> Nicolas Pitre <[hidden email]> writes:

>>> That won't buy you much.  You should realize that a pack is made of:
>>>
>>> 1) Commit objects.  Yes they're all put together at the front of the pack,
>>>    but they roughly are the equivalent of:
>>>
>>> git log --pretty=raw | gzip | wc -c
>>>
>>>    For the Linux repo as of now that is around 32 MB.
>>
>> For my clone of Git repository this gives 3.8 MB
>>  
>>> 2) Tree and blob objects.  Those are the bulk of the content for the top
>>>    commit. [...]  You can estimate the size of this data with:
>>>
>>> git archive --format=tar HEAD | gzip | wc -c
>>>
>>>    On the same Linux repo this is currently 75 MB.
>>
>> On the same Git repository this gives 2.5 MB
>
> Interesting to see that the commit history is larger than the latest
> source tree.  Probably that would be the same with the Linux kernel as
> well if all versions since the beginning with adequate commit logs were
> included in the repo.

Note that having reflog and/or patch management interface like StGit,
and frequently reworking commits (e.g. using rebase) means more commit
objects in repository.

Also Git repository has 3 independent branches: 'man', 'html' and 'todo',
from whose branches objects are not included in "git archive HEAD".

>
>>> 3) Delta objects.  Those are making the rest of the pack, plus a couple
>>>    tree/blob objects that were not found in the top commit and are
>>>    different enough from any object in that top commit not to be
>>>    represented as deltas.  Still, the majority of objects for all the
>>>    remaining commits are delta objects.
>>
>> You forgot that delta chains are bound by pack.depth limit, which
>> defaults to 50.  You would have then additional full objects.
>
> Sure, but that's probably not significant.  the delta chain depth is
> limited, but not the width.  A given base object can have unlimited
> delta "children", and so on at each depth level.

You can probably get number and size taken by delta and non-delta (base)
objects in the packfile somehow.  Neither "git verify-pack -v <packfile>"
nor contrib/stats/packinfo.pl did help me arrive at this data.

>> The single packfile for this (just gc'ed) Git repository is 37 MB.
>> Much more than 3.8 MB + 2.5 MB = 6.3 MB.
>
> What I'm saying is that most of that 37 MB - 6.3 MB = 31 MB is likely to
> be occupied by deltas.

True.
 

>> [cut]
>>
>> There is another way which we can go to implement resumable clone.
>> Let's git first try to clone whole repository (single pack; BTW what
>> happens if this pack is larger than file size limit for given
>> filesystem?).
>
> We currently fail.  Seems that no one ever had a problem with that so
> far. We'd have to split the pack stream into multiple packs on the
> receiving end.  But frankly, if you have a repository large enough to
> bust your filesystem's file size limit then maybe you should seriously
> reconsider your choice of development environment.

Do we fail gracefully (with an error message), or does git crash then?

If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
FAT is often used on SSD, on USB drive.  Although if you have  2 GB
packfile, you are doing something wrong, or UGFWIINI (Using Git For
What It Is Not Intended).
 

>> If it fails, client ask first for first half of of
>> repository (half as in bisect, but it is server that has to calculate
>> it).  If it downloads, it will ask server for the rest of repository.
>> If it fails, it would reduce size in half again, and ask about 1/4 of
>> repository in packfile first.
>
> Problem people with slow links have won't be helped at all with this.  
> What if the network connection gets broken only after 49% of the
> transfer and that took 3 hours to download?  You'll attempt a 25% size
> transfer which would take 1.5 hour despite the fact that you already
> spent that much time downloading that first 1/4 of the repository
> already.  And yet what if you're unlucky and now the network craps on
> you after 23% of that second attempt?

A modification then.

First try ordinary clone.  If it fails because network is unreliable,
check how much we did download, and ask server for packfile of slightly
smaller size; this means that we are asking server for approximate pack
size limit, not for bisect-like partitioning revision list.

> I think it is better to "prime" the repository with the content of the
> top commit in the most straight forward manner using git-archive which
> has the potential to be fully restartable at any point with little
> complexity on the server side.

But didn't it make fully restartable 2.5 MB part out of 37 MB packfile?

A question about pack protocol negotiation.  If clients presents some
objects as "have", server can and does assume that client has all
prerequisites for such objects, e.g. for tree objects that it has
all objects for files and directories inside tree; for commit it means
all ancestors and all objects in snapshot (have top tree, and its
prerequisites).  Do I understand this correctly?

If we have partial packfile which crashed during downloading, can we
extract from it some full objects (including blobs)?  Can we pass
tree and blob objects as "have" to server, and is it taken into account?
Perhaps instead of separate step of resumable-downloading of top commit
objects (in snapshot), we can pass to server what we did download in
full?


BTW. because of compression it might be more difficult to resume
archive creation in the middle, I think...

--
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Nicolas Pitre
On Tue, 18 Aug 2009, Jakub Narebski wrote:

> You can probably get number and size taken by delta and non-delta (base)
> objects in the packfile somehow.  Neither "git verify-pack -v <packfile>"
> nor contrib/stats/packinfo.pl did help me arrive at this data.

Documentation for verify-pack says:

|When specifying the -v option the format used is:
|
|        SHA1 type size size-in-pack-file offset-in-packfile
|
|for objects that are not deltified in the pack, and
|
|        SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
|
|for objects that are deltified.

So a simple script should be able to give you the answer.

> >> (BTW what happens if this pack is larger than file size limit for
> >> given filesystem?).
> >
> > We currently fail.  Seems that no one ever had a problem with that so
> > far. We'd have to split the pack stream into multiple packs on the
> > receiving end.  But frankly, if you have a repository large enough to
> > bust your filesystem's file size limit then maybe you should seriously
> > reconsider your choice of development environment.
>
> Do we fail gracefully (with an error message), or does git crash then?

If the filesystem is imposing the limit, it will likely return an error
on the write() call and we'll die().  If the machine has a too small
off_t for the received pack then we also die("pack too large for current
definition of off_t").

> If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
> FAT is often used on SSD, on USB drive.  Although if you have  2 GB
> packfile, you are doing something wrong, or UGFWIINI (Using Git For
> What It Is Not Intended).

Hopefully you're not performing a 'git clone' off of a FAT filesystem.  
For physical transport you may repack with the appropriate switches.

> >> If it fails, client ask first for first half of of
> >> repository (half as in bisect, but it is server that has to calculate
> >> it).  If it downloads, it will ask server for the rest of repository.
> >> If it fails, it would reduce size in half again, and ask about 1/4 of
> >> repository in packfile first.
> >
> > Problem people with slow links have won't be helped at all with this.  
> > What if the network connection gets broken only after 49% of the
> > transfer and that took 3 hours to download?  You'll attempt a 25% size
> > transfer which would take 1.5 hour despite the fact that you already
> > spent that much time downloading that first 1/4 of the repository
> > already.  And yet what if you're unlucky and now the network craps on
> > you after 23% of that second attempt?
>
> A modification then.
>
> First try ordinary clone.  If it fails because network is unreliable,
> check how much we did download, and ask server for packfile of slightly
> smaller size; this means that we are asking server for approximate pack
> size limit, not for bisect-like partitioning revision list.

If the download didn't reach past the critical point (75 MB in my linux
repo example) then you cannot validate the received data and you've
wasted that much bandwidth.

> > I think it is better to "prime" the repository with the content of the
> > top commit in the most straight forward manner using git-archive which
> > has the potential to be fully restartable at any point with little
> > complexity on the server side.
>
> But didn't it make fully restartable 2.5 MB part out of 37 MB packfile?

The front of the pack is the critical point.  If you get enough to
create the top commit then further transfers can be done incrementally
with only the deltas between each commits.

> A question about pack protocol negotiation.  If clients presents some
> objects as "have", server can and does assume that client has all
> prerequisites for such objects, e.g. for tree objects that it has
> all objects for files and directories inside tree; for commit it means
> all ancestors and all objects in snapshot (have top tree, and its
> prerequisites).  Do I understand this correctly?

That works only for commits.

> If we have partial packfile which crashed during downloading, can we
> extract from it some full objects (including blobs)?  Can we pass
> tree and blob objects as "have" to server, and is it taken into account?

No.

> Perhaps instead of separate step of resumable-downloading of top commit
> objects (in snapshot), we can pass to server what we did download in
> full?

See above.

> BTW. because of compression it might be more difficult to resume
> archive creation in the middle, I think...

Why so?  the tar+gzip format is streamable.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Johannes Schindelin
In reply to this post by Nicolas Pitre
Hi,

On Tue, 18 Aug 2009, Nicolas Pitre wrote:

> On Tue, 18 Aug 2009, Jakub Narebski wrote:
>
> > There is another way which we can go to implement resumable clone.
> > Let's git first try to clone whole repository (single pack; BTW what
> > happens if this pack is larger than file size limit for given
> > filesystem?).
>
> We currently fail.  Seems that no one ever had a problem with that so
> far.

They just went away, most probably.

But seriously, I miss a very important idea in this discussion: we control
the Git source code.  So we _can_ add a upload_pack feature that a client
can ask for after the first failed attempt.

Ciao,
Dscho

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Nicolas Pitre
On Wed, 19 Aug 2009, Johannes Schindelin wrote:

> Hi,
>
> On Tue, 18 Aug 2009, Nicolas Pitre wrote:
>
> > On Tue, 18 Aug 2009, Jakub Narebski wrote:
> >
> > > There is another way which we can go to implement resumable clone.
> > > Let's git first try to clone whole repository (single pack; BTW what
> > > happens if this pack is larger than file size limit for given
> > > filesystem?).
> >
> > We currently fail.  Seems that no one ever had a problem with that so
> > far.
>
> They just went away, most probably.

Most probably they simply don't exist.  I would be highly surprised
otherwise.

> But seriously, I miss a very important idea in this discussion: we control
> the Git source code.  So we _can_ add a upload_pack feature that a client
> can ask for after the first failed attempt.

Indeed.  So what do you think about my proposal?  It was included in my
first reply to this thread.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Sitaram Chamarty
In reply to this post by Jakub Narębski
On Wed, Aug 19, 2009 at 12:15 AM, Jakub Narebski<[hidden email]> wrote:
> There is another way which we can go to implement resumable clone.
> Let's git first try to clone whole repository (single pack; BTW what
> happens if this pack is larger than file size limit for given
> filesystem?).  If it fails, client ask first for first half of of
> repository (half as in bisect, but it is server that has to calculate
> it).  If it downloads, it will ask server for the rest of repository.
> If it fails, it would reduce size in half again, and ask about 1/4 of
> repository in packfile first.

How about an extension where the user can *ask* for a clone of a
particular HEAD to be sent to him as a git bundle?  Or particular
revisions (say once a week) were kept as a single file git-bundle,
made available over HTTP -- easily restartable with byte-range -- and
anyone who has bandwidth problems first gets that, then changes the
origin remote URL and does a "pull" to get uptodate?

I've done this manually a few times when sneakernet bandwidth was
better than the normal kind, heh, but it seems to me the lowest impact
solution.

Yes you'd need some extra space on the server, but you keep only one
bundle, and maybe replace it every week by cron.  Should work fine
right now, as is, with a wee bit of manual work by the user, and a
quick cron entry on the server
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Nguyễn Thái Ngọc Duy
In reply to this post by Nicolas Pitre
On Wed, Aug 19, 2009 at 2:35 PM, Johannes
Schindelin<[hidden email]> wrote:
> But here comes an idea: together with Nguy要's sparse series, it is

FWIW, you can write "Nguyen" instead. It might save you one copy/paste
(I take it you don't have a Vietnamese IM ;-)

> conceivable that we support a shallow & narrow clone via the upload-pack
> protocol (also making mithro happy).  The problem with narrow clones was
> not the pack generation side, that is done by a rev-list that can be
> limited to certain paths.  The problem was that we end up with missing
> tree objects.  However, if we can make a sparse checkout, we can avoid
> the problem.

But then git-fsck, git-archive... will die?
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Johannes Schindelin
Hi,

On Wed, 19 Aug 2009, Nguyen Thai Ngoc Duy wrote:

> On Wed, Aug 19, 2009 at 2:35 PM, Johannes
> Schindelin<[hidden email]> wrote:
> > But here comes an idea: together with Nguy要's sparse series, it is
>
> FWIW, you can write "Nguyen" instead. It might save you one copy/paste
> (I take it you don't have a Vietnamese IM ;-)

FWIW I originally wrote Nguyễn (not that Chinese(?) character)... I look
it up everytime I want to write your name by searching my address book for
"pclouds". ;-)

> > conceivable that we support a shallow & narrow clone via the
> > upload-pack protocol (also making mithro happy).  The problem with
> > narrow clones was not the pack generation side, that is done by a
> > rev-list that can be limited to certain paths.  The problem was that
> > we end up with missing tree objects.  However, if we can make a sparse
> > checkout, we can avoid the problem.
>
> But then git-fsck, git-archive... will die?

Oh, but they should be made aware of the narrow clone, just like for
shallow clones.

Ciao,
Dscho
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Jakub Narębski
In reply to this post by Sitaram Chamarty
On Wed, Aug 19, 2009, Sitaram Chamarty wrote:
> On Wed, Aug 19, 2009 at 12:15 AM, Jakub Narebski<[hidden email]> wrote:

> > There is another way which we can go to implement resumable clone.
> > Let's git first try to clone whole repository (single pack; BTW what
> > happens if this pack is larger than file size limit for given
> > filesystem?).  If it fails, client ask first for first half of of
> > repository (half as in bisect, but it is server that has to calculate
> > it).  If it downloads, it will ask server for the rest of repository.
> > If it fails, it would reduce size in half again, and ask about 1/4 of
> > repository in packfile first.
>
> How about an extension where the user can *ask* for a clone of a
> particular HEAD to be sent to him as a git bundle?  Or particular
> revisions (say once a week) were kept as a single file git-bundle,
> made available over HTTP -- easily restartable with byte-range -- and
> anyone who has bandwidth problems first gets that, then changes the
> origin remote URL and does a "pull" to get uptodate?
>
> I've done this manually a few times when sneakernet bandwidth was
> better than the normal kind, heh, but it seems to me the lowest impact
> solution.
>
> Yes you'd need some extra space on the server, but you keep only one
> bundle, and maybe replace it every week by cron.  Should work fine
> right now, as is, with a wee bit of manual work by the user, and a
> quick cron entry on the server

This is a good idea, i think, and it can be implemented with various
amount of effort and changes to git, and various amount of seamless
integration.

1. Simplest solution: social (homepage).  Not integrated at all.

   On projects homepage, the one where there is described where project
   repository is and how to get it, you add a link to most recent bundle
   (perhaps in addition to most recent snapshot).  This bundle would be
   served as a static file via HTTP (and perhaps also FTP) by (any) web
   server that supports resuming (range requests).  Or you can make
   server generate bundles on demand, only when they are first requested.

   Most recent might mean latest tagged release, or it might mean daily
   snapshot^W bundle.

   This solution could be integrated into gitweb, either by generic
   'latest bundle' link in project's README.html (or in site's
   GITWEB_HOMETEXT, default indextext.html), or by having gitweb
   generate those links (and perhaps bundles as well) by itself.

2. Seamless solution: 'bundle' or 'bundles' capability.  Requires
   changes to both server and client.

   If server supports (advertises) 'bundle' capability, it can serve
   list of bundles (as HTTP / FTP / rsync URLs) either at client request,
   or after (or before) list of refs if client requests 'bundle'
   capability.

   If client has support for 'bundles' capability, it terminates
   connection to sshd or git-daemon, and does ordinary resumable HTTP
   fetch using libcurl.  After bundle is downloaded fully, it clones
   from bundle, and does git-fetch with the same server as before,
   which would then have less to transfer.  Client has also to handle
   situation where bundle download is interrupted, and do not do cleanup,
   allowing for "git clone --continue".

3. Seamless solution: GitTorrent or its simplification: git mirror-sync.

   I think that GitTorrent (see http://git.or.cz/gitwiki/SoC2009Ideas)
   or even its simplification git-mirror-sync would include restartable
   cloning.  It is even among its intended features.  Also this would
   help to download faster via mirrors which can have faster and better
   network connection.

   But this would be most work.

You can implement solution 1. even now...
--
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Jakub Narębski
In reply to this post by Nicolas Pitre
On Tue, 18 Aug 2009, Nicolas Pitre wrote:

> On Tue, 18 Aug 2009, Jakub Narebski wrote:
>
>> You can probably get number and size taken by delta and non-delta (base)
>> objects in the packfile somehow.  Neither "git verify-pack -v <packfile>"
>> nor contrib/stats/packinfo.pl did help me arrive at this data.
>
> Documentation for verify-pack says:
>
> |When specifying the -v option the format used is:
> |
> |        SHA1 type size size-in-pack-file offset-in-packfile
> |
> |for objects that are not deltified in the pack, and
> |
> |        SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
> |
> |for objects that are deltified.
>
> So a simple script should be able to give you the answer.

Thanks.

There are 114937 objects in this packfile, including 56249 objects
used as base (can be deltified or not).  git-verify-pack -v shows
that all objects have total size-in-packfile of 33 MB (which agrees
with packfile size of 33 MB), with 17 MB size-in-packfile taken by
deltaified objects, and 16 MB taken by base objects.

  git verify-pack -v |
    grep -v "^chain" |
    grep -v "objects/pack/pack-" > verify-pack.out

  sum=0; bsum=0; dsum=0;
  while read sha1 type size packsize off depth base; do
    echo "$sha1" >> verify-pack.sha1.out
    sum=$(( $sum + $packsize ))
    if [ -n "$base" ]; then
       echo "$sha1" >> verify-pack.delta.out
       dsum=$(( $dsum + $packsize ))
    else
       echo "$sha1" >> verify-pack.base.out
       bsum=$(( $bsum + $packsize ))
    fi
  done < verify-pack.out
  echo "sum=$sum; bsum=$bsum; dsum=$dsum"
 
>>>> (BTW what happens if this pack is larger than file size limit for
>>>> given filesystem?).
[...]

>> If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
>> FAT is often used on SSD, on USB drive.  Although if you have  2 GB
>> packfile, you are doing something wrong, or UGFWIINI (Using Git For
>> What It Is Not Intended).
>
> Hopefully you're not performing a 'git clone' off of a FAT filesystem.  
> For physical transport you may repack with the appropriate switches.

Not off a FAT filesystem, but into a FAT filesystem.
 
[...]

>>> I think it is better to "prime" the repository with the content of the
>>> top commit in the most straight forward manner using git-archive which
>>> has the potential to be fully restartable at any point with little
>>> complexity on the server side.
>>
>> But didn't it make fully restartable 2.5 MB part out of 37 MB packfile?
>
> The front of the pack is the critical point.  If you get enough to
> create the top commit then further transfers can be done incrementally
> with only the deltas between each commits.

How?  You have some objects that can be used as base; how to tell
git-daemon that we have them (but not theirs prerequisites), and how
to generate incrementals?

>> A question about pack protocol negotiation.  If clients presents some
>> objects as "have", server can and does assume that client has all
>> prerequisites for such objects, e.g. for tree objects that it has
>> all objects for files and directories inside tree; for commit it means
>> all ancestors and all objects in snapshot (have top tree, and its
>> prerequisites).  Do I understand this correctly?
>
> That works only for commits.

Hmmmm... how do you intent for "prefetch top objects restartable-y first"
to work, then?
 
>> BTW. because of compression it might be more difficult to resume
>> archive creation in the middle, I think...
>
> Why so?  the tar+gzip format is streamable.

gzip format uses sliding window in compression.  "cat a b | gzip"
is different from "cat <(gzip a) <(gzip b)".

But that doesn't matter.  If we are interrupted in the middle, we can
uncompress what we have to check how far did we get, and tell server
to send the rest; this way server wouldn't have to even generate
(but not send) what we get as partial transfer.

P.S. What do you think about 'bundle' capability extension mentioned
     in a side sub-thread?
--
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Nicolas Pitre
On Wed, 19 Aug 2009, Jakub Narebski wrote:

> There are 114937 objects in this packfile, including 56249 objects
> used as base (can be deltified or not).  git-verify-pack -v shows
> that all objects have total size-in-packfile of 33 MB (which agrees
> with packfile size of 33 MB), with 17 MB size-in-packfile taken by
> deltaified objects, and 16 MB taken by base objects.
>
>   git verify-pack -v |
>     grep -v "^chain" |
>     grep -v "objects/pack/pack-" > verify-pack.out
>
>   sum=0; bsum=0; dsum=0;
>   while read sha1 type size packsize off depth base; do
>     echo "$sha1" >> verify-pack.sha1.out
>     sum=$(( $sum + $packsize ))
>     if [ -n "$base" ]; then
>        echo "$sha1" >> verify-pack.delta.out
>        dsum=$(( $dsum + $packsize ))
>     else
>        echo "$sha1" >> verify-pack.base.out
>        bsum=$(( $bsum + $packsize ))
>     fi
>   done < verify-pack.out
>   echo "sum=$sum; bsum=$bsum; dsum=$dsum"

Your object classification is misleading.  Because an object has no
base, that doesn't mean it is necessarily a base itself.  You'd have to
store $base into a separate file and then sort it and remove duplicates
to know the actual number of base objects.  What you have right now is
strictly delta objects and non-delta objects. And base objects can
themselves be delta objects already of course.

Also... my git repo after 'git gc --aggressive' contains a pack which
size is 22 MB.  Your script tells me:

sum=22930254; bsum=14142012; dsum=8788242

and:

   29558 verify-pack.base.out
   82043 verify-pack.delta.out
  111601 verify-pack.out
  111601 verify-pack.sha1.out

meaning that I have 111601 total objects, of which 29558 are non-deltas
occupying 14 MB and 82043 are deltas occupying 8 MB.  That certainly
shows how deltas are space efficient.  And with a minor modification to
your script, I know that 44985 objects are actually used as a delta
base.  So, on average, each base is responsible for nearly 2 deltas.

> >>>> (BTW what happens if this pack is larger than file size limit for
> >>>> given filesystem?).
> [...]
>
> >> If I remember correctly FAT28^W FAT32 has maximum file size of 2 GB.
> >> FAT is often used on SSD, on USB drive.  Although if you have  2 GB
> >> packfile, you are doing something wrong, or UGFWIINI (Using Git For
> >> What It Is Not Intended).
> >
> > Hopefully you're not performing a 'git clone' off of a FAT filesystem.  
> > For physical transport you may repack with the appropriate switches.
>
> Not off a FAT filesystem, but into a FAT filesystem.

That's what I meant, sorry.  My point still stands.

> > The front of the pack is the critical point.  If you get enough to
> > create the top commit then further transfers can be done incrementally
> > with only the deltas between each commits.
>
> How?  You have some objects that can be used as base; how to tell
> git-daemon that we have them (but not theirs prerequisites), and how
> to generate incrementals?

Just the same as when you perform a fetch to update your local copy of a
remote branch: you tell the remote about the commit you have and the one
you want, and git-repack will create delta objects for the commit you
want against similar objects from the commit you already have, and skip
those objects from the commit you want that are already included in the
commit you have.

> >> A question about pack protocol negotiation.  If clients presents some
> >> objects as "have", server can and does assume that client has all
> >> prerequisites for such objects, e.g. for tree objects that it has
> >> all objects for files and directories inside tree; for commit it means
> >> all ancestors and all objects in snapshot (have top tree, and its
> >> prerequisites).  Do I understand this correctly?
> >
> > That works only for commits.
>
> Hmmmm... how do you intent for "prefetch top objects restartable-y first"
> to work, then?

See my latest reply to dscho (you were in CC already).

> >> BTW. because of compression it might be more difficult to resume
> >> archive creation in the middle, I think...
> >
> > Why so?  the tar+gzip format is streamable.
>
> gzip format uses sliding window in compression.  "cat a b | gzip"
> is different from "cat <(gzip a) <(gzip b)".
>
> But that doesn't matter.  If we are interrupted in the middle, we can
> uncompress what we have to check how far did we get, and tell server
> to send the rest; this way server wouldn't have to even generate
> (but not send) what we get as partial transfer.

You got it.

> P.S. What do you think about 'bundle' capability extension mentioned
>      in a side sub-thread?

I don't like it.  Reason is that it forces the server to be (somewhat)
stateful by having to keep track of those bundles and cycle them, and it
doubles the disk usage by having one copy of the repository in the form
of the original pack(s) and another copy as a bundle.

Of course, the idea of having a cron job generating a bundle and
offering it for download through HTTP or the like is fine if people are
OK with that, and that requires zero modifications to git.  But I don't
think that is a solution that scales.

If you think about git.kernel.org which has maybe hundreds of
repositories where the big majority of them are actually forks of Linus'
own repository, then having all those forks reference Linus' repository
is a big disk space saver (and IO too as the referenced repository is
likely to remain cached in memory).  Having a bundle ready for each of
them will simply kill that space advantage, unless they all share the
same bundle.

Now sharing that common bundle could be done of course, but that makes
things yet more complex while still wasting IO because some requests
will hit the common pack and some others will hit the bundle, making
less efficient usage of the disk cache on the server.

Yet, that bundle would probably not contain the latest revision if it is
only periodically updated, even less so if it is shared between multiple
repositories as outlined above.  And what people with slow/unreliable
network links are probably most interested in is the latest revision and
maybe a few older revisions, but probably not the whole repository as
that is simply too long to wait for.  Hence having a big bundle is not
flexible either with regards to the actual data transfer size.

Hence having a restartable git-archive service to create the top
revision with the ability to cheaply (in terms of network bandwidth)
deepen the history afterwards is probably the most straight forward way
to achieve that.  The server needs no be aware of separate bundles, etc.  
And the shared object store still works as usual with the same cached IO
whether the data is needed for a traditional fetch or a "git archive"
operation.

Why "git archive"?  Because its content is well defined.  So if you give
it a commit SHA1 you will always get the same stream of bytes (after
decompression) since the way git sort files is strictly defined.  It is
therefore easy to tell a remote "git archive" instance that we want the
content for commit xyz but that we already got n files already, and that
the last file we've got has m bytes.  There is simply no confusion about
what we've got already, unlike with a partial pack which might need
yet-to-be-received objects in order to make sense of what has been
already received.  The server simply has to skip that many files and
resume the transfer at that point, independently of the compression or
even the archive format.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Jakub Narębski
Cc-ed Dscho, so he can easier participate in this subthread.

On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> On Wed, 19 Aug 2009, Jakub Narebski wrote:

> > P.S. What do you think about 'bundle' capability extension mentioned
> >      in a side sub-thread?
>
> I don't like it.  Reason is that it forces the server to be (somewhat)
> stateful by having to keep track of those bundles and cycle them, and it
> doubles the disk usage by having one copy of the repository in the form
> of the original pack(s) and another copy as a bundle.

I agree about problems with disk usage, but I disagree about server
having to be stateful; server can just simply scan for bundles, and
offer links to them if client requests 'bundles' capability, somewhere
around initial git-ls-remote list of refs.

> Of course, the idea of having a cron job generating a bundle and
> offering it for download through HTTP or the like is fine if people are
> OK with that, and that requires zero modifications to git.  But I don't
> think that is a solution that scales.

Well, offering daily bundle in addition to daily snapshot could be
a good practice, at least until git acquires resumable fetch (resumable
clone).

>
> If you think about git.kernel.org which has maybe hundreds of
> repositories where the big majority of them are actually forks of Linus'
> own repository, then having all those forks reference Linus' repository
> is a big disk space saver (and IO too as the referenced repository is
> likely to remain cached in memory).  Having a bundle ready for each of
> them will simply kill that space advantage, unless they all share the
> same bundle.

I am thinking about sharing the same bundle for related projects.

>
> Now sharing that common bundle could be done of course, but that makes
> things yet more complex while still wasting IO because some requests
> will hit the common pack and some others will hit the bundle, making
> less efficient usage of the disk cache on the server.

Hmmm... true (unless bundles are on separate server).

>
> Yet, that bundle would probably not contain the latest revision if it is
> only periodically updated, even less so if it is shared between multiple
> repositories as outlined above.  And what people with slow/unreliable
> network links are probably most interested in is the latest revision and
> maybe a few older revisions, but probably not the whole repository as
> that is simply too long to wait for.  Hence having a big bundle is not
> flexible either with regards to the actual data transfer size.

I agree that bundle would be useful for restartable clone, and not
useful for restartable fetch.  Well, unless you count (non-existing)
GitTorrent / git-mirror-sync as this solution... ;-)

>
> Hence having a restartable git-archive service to create the top
> revision with the ability to cheaply (in terms of network bandwidth)
> deepen the history afterwards is probably the most straight forward way
> to achieve that.  The server needs no be aware of separate bundles, etc.  
> And the shared object store still works as usual with the same cached IO
> whether the data is needed for a traditional fetch or a "git archive"
> operation.

It's the "cheaply deepen history" that I doubt would be easy.  This is
the most difficult part, I think (see also below).

>
> Why "git archive"?  Because its content is well defined.  So if you give
> it a commit SHA1 you will always get the same stream of bytes (after
> decompression) since the way git sort files is strictly defined.  It is
> therefore easy to tell a remote "git archive" instance that we want the
> content for commit xyz but that we already got n files already, and that
> the last file we've got has m bytes.  There is simply no confusion about
> what we've got already, unlike with a partial pack which might need
> yet-to-be-received objects in order to make sense of what has been
> already received.  The server simply has to skip that many files and
> resume the transfer at that point, independently of the compression or
> even the archive format.

Let's reiterate it to check if I understand it correctly:


Any "restartable clone" / "resumable fetch" solution must begin with
a file which is rock-solid stable wrt. reproductability given the same
parameters.  git-archive has this feature, packfile doesn't (so I guess
that bundle also doesn't, unless it was cached / saved on disk).

It would be useful if it was possible to generate part of this rock-solid
file for partial (range, resume) request, without need to generate
(calculate) parts that client already downloaded.  Otherwise server has
to either waste disk space and IO for caching, or waste CPU (and IO)
on generating part which is not needed and dropping it to /dev/null.
git-archive you say has this feature.

Next you need to tell server that you have those objects got using
resumable download part ("git archive HEAD" in your proposal), and
that it can use them and do not include them in prepared file/pack.
"have" is limited to commits, and "have <sha1>" tells server that
you have <sha1> and all its prerequisites (dependences).  You can't
use "have <sha1>" with git-archive solution.  I don't know enough
about 'shallow' capability (and what it enables) to know whether
it can be used for that.  Can you elaborate?

Then you have to finish clone / fetch.  All solutions so far include
some kind of incremental improvements.  My first proposal of bisect
fetching 1/nth or predefined size pack is buttom-up solution, where
we build full clone from root commits up.  You propose, from what
I understand build full clone from top commit down, using deepening
from shallow clone.  In this step you either get full incremental
or not; downloading incremental (from what I understand) is not
resumable / they do not support partial fetch.

Do I understand this correctly?
--
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Continue git clone after interruption

Nicolas Pitre
On Wed, 19 Aug 2009, Jakub Narebski wrote:

> Cc-ed Dscho, so he can easier participate in this subthread.
>
> On Wed, 19 Aug 2009, Nicolas Pitre wrote:
> > On Wed, 19 Aug 2009, Jakub Narebski wrote:
>
> > > P.S. What do you think about 'bundle' capability extension mentioned
> > >      in a side sub-thread?
> >
> > I don't like it.  Reason is that it forces the server to be (somewhat)
> > stateful by having to keep track of those bundles and cycle them, and it
> > doubles the disk usage by having one copy of the repository in the form
> > of the original pack(s) and another copy as a bundle.
>
> I agree about problems with disk usage, but I disagree about server
> having to be stateful; server can just simply scan for bundles, and
> offer links to them if client requests 'bundles' capability, somewhere
> around initial git-ls-remote list of refs.

But that's the client that has to deal with what the server wants to
offer, instead of the server actually serving data as the client wants.

> Well, offering daily bundle in addition to daily snapshot could be
> a good practice, at least until git acquires resumable fetch (resumable
> clone).

Outside of Git: maybe.  Through the git protocol: no.  And what would
that bundle contain over the daily snapshot?  The whole history?  If so
that goes against the idea that people concerned by all this have slow
links and probably aren't interested in the time to download it all.  If
the bundle contains only the top revision then it has no advantage over
the snapshot.  Somewhere in the middle?  Sure, but then where to draw
the line?  That's for the client to decide, not the server
administrator.

And what if you start your slow transfer which breaks in the middle.  
The next morning you want to restart it in the hope that you might
resume the transfer of the bundle that is incomplete.  But crap, the
server has updated its bundle and your half-bundle is now useless.
You've wasted your bandwidth for nothing.

> > If you think about git.kernel.org which has maybe hundreds of
> > repositories where the big majority of them are actually forks of Linus'
> > own repository, then having all those forks reference Linus' repository
> > is a big disk space saver (and IO too as the referenced repository is
> > likely to remain cached in memory).  Having a bundle ready for each of
> > them will simply kill that space advantage, unless they all share the
> > same bundle.
>
> I am thinking about sharing the same bundle for related projects.

... meaning more administrative burden.

> > Now sharing that common bundle could be done of course, but that makes
> > things yet more complex while still wasting IO because some requests
> > will hit the common pack and some others will hit the bundle, making
> > less efficient usage of the disk cache on the server.
>
> Hmmm... true (unless bundles are on separate server).

... meaning additional but avoidable costs.

> > Yet, that bundle would probably not contain the latest revision if it is
> > only periodically updated, even less so if it is shared between multiple
> > repositories as outlined above.  And what people with slow/unreliable
> > network links are probably most interested in is the latest revision and
> > maybe a few older revisions, but probably not the whole repository as
> > that is simply too long to wait for.  Hence having a big bundle is not
> > flexible either with regards to the actual data transfer size.
>
> I agree that bundle would be useful for restartable clone, and not
> useful for restartable fetch.  Well, unless you count (non-existing)
> GitTorrent / git-mirror-sync as this solution... ;-)

I don't think fetches after a clone are such an issue.  They are
typically transfers being orders of magnitude smaller than the initial
clone.  Same goes for fetches to deepen a shallow clone which are in
fact fetches going back in history instead of forward.  I still stands
by my assertion that bundles are suboptimal for a restartable clone.

As for GitTorrent / git-mirror-sync... those are still vaporwares to me
and I therefore have doubts about their actual feasability. So no, I
don't count on them.

> > Hence having a restartable git-archive service to create the top
> > revision with the ability to cheaply (in terms of network bandwidth)
> > deepen the history afterwards is probably the most straight forward way
> > to achieve that.  The server needs no be aware of separate bundles, etc.  
> > And the shared object store still works as usual with the same cached IO
> > whether the data is needed for a traditional fetch or a "git archive"
> > operation.
>
> It's the "cheaply deepen history" that I doubt would be easy.  This is
> the most difficult part, I think (see also below).

Don't think so.  Try this:

        mkdir test
        cd test
        git init
        git fetch --depth=1 git://git.kernel.org/pub/scm/git/git.git

REsult:

remote: Counting objects: 1824, done.
remote: Compressing objects: 100% (1575/1575), done.
Receiving objects: 100% (1824/1824), 3.01 MiB | 975 KiB/s, done.
remote: Total 1824 (delta 299), reused 1165 (delta 180)
Resolving deltas: 100% (299/299), done.
From git://git.kernel.org/pub/scm/git/git
 * branch            HEAD       -> FETCH_HEAD

You'll get the very latest revision for HEAD, and only that.  The size
of the transfer will be roughly the size of a daily snapshot, except it
is fully up to date.  It is however non resumable in the event of a
network outage.  My proposal is to replace this with a "git archive"
call.  It won't get all branches, but for the purpose of initialising
one's repository that should be good enough.  And the "git archive" can
be fully resumable as I explained.

Now to deepen that history.  Let's say you want 10 more revisions going
back then you simply perform the fetch again with a --depth=10.  Right
now it doesn't seem to work optimally, but the pack that is then being
sent could be made of deltas against objects found in the commits we
already have.  Currently it seems that a pack that also includes those
objects we already have in addition to those we want is created, which
is IMHO a flaw in the shallow support that shouldn't be too hard to fix.  
Each level of deepening should then be as small as standard fetches
going forward when updating the repository with new revisions.

> > Why "git archive"?  Because its content is well defined.  So if you give
> > it a commit SHA1 you will always get the same stream of bytes (after
> > decompression) since the way git sort files is strictly defined.  It is
> > therefore easy to tell a remote "git archive" instance that we want the
> > content for commit xyz but that we already got n files already, and that
> > the last file we've got has m bytes.  There is simply no confusion about
> > what we've got already, unlike with a partial pack which might need
> > yet-to-be-received objects in order to make sense of what has been
> > already received.  The server simply has to skip that many files and
> > resume the transfer at that point, independently of the compression or
> > even the archive format.
>
> Let's reiterate it to check if I understand it correctly:
>
> Any "restartable clone" / "resumable fetch" solution must begin with
> a file which is rock-solid stable wrt. reproductability given the same
> parameters.  git-archive has this feature, packfile doesn't (so I guess
> that bundle also doesn't, unless it was cached / saved on disk).

Right.

> It would be useful if it was possible to generate part of this rock-solid
> file for partial (range, resume) request, without need to generate
> (calculate) parts that client already downloaded.  Otherwise server has
> to either waste disk space and IO for caching, or waste CPU (and IO)
> on generating part which is not needed and dropping it to /dev/null.
> git-archive you say has this feature.

"Could easily have" is more appropriate.

> Next you need to tell server that you have those objects got using
> resumable download part ("git archive HEAD" in your proposal), and
> that it can use them and do not include them in prepared file/pack.
> "have" is limited to commits, and "have <sha1>" tells server that
> you have <sha1> and all its prerequisites (dependences).  You can't
> use "have <sha1>" with git-archive solution.  I don't know enough
> about 'shallow' capability (and what it enables) to know whether
> it can be used for that.  Can you elaborate?

See above, or Documentation/technical/shallow.txt.

> Then you have to finish clone / fetch.  All solutions so far include
> some kind of incremental improvements.  My first proposal of bisect
> fetching 1/nth or predefined size pack is buttom-up solution, where
> we build full clone from root commits up.  You propose, from what
> I understand build full clone from top commit down, using deepening
> from shallow clone.  In this step you either get full incremental
> or not; downloading incremental (from what I understand) is not
> resumable / they do not support partial fetch.

Right.  However, like I said, the incremental part should be much
smaller and therefore less susceptible to network troubles.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
12
Loading...