RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Shawn Pearce
Sorry, no code, only words today. Previously people have proposed a
few different options for resumable clone, like "clone.bundle"
(currently used by Android as a hack) or "skip start of regenerated
pack". Here is another option.

We could implement resumable clone by making a bit of a hybrid of the
smart and dumb HTTP protocol.


Overview
--------
During `git gc` break out the refs/heads/* namespace into its own
packfile, and pack the remaining references into an "other" pack file.
This is already done by JGit's DfsGarbageCollector and normal
filesystem based GC, we have many years of experience with this
approach working well.

The GC process saves the roots used for the "head" pack to a new
pack-$HASH.info file, which is written as a sibling to the existing
.pack, .idx, and .bitmap files produced by GC. The .info file contains
metadata about the pack in a text format such as:

  pack-name objects/pack/pack-$HASH.pack
  pack-trailer $FOOTER_HASH
  pack-size $LENGTH_IN_BYTES

  ff4ea6004fb48146330d663d64a71e7774f059f9 refs/heads/master
  ced85554d3d871f66dc4442631320a96a5d702f5 refs/heads/next

Ideally $HASH = $FOOTER_HASH here so that any repack with the same
object set but different compression settings yields a different
pack-name in the info file.


During `git clone` HTTP clients first fetch:

  https://host/repo/info/clone

This returns the most recent pack-*.info file available in the
repository, assuming the client has read permission on all branches
contained in the info file. If there is no info file present or the
client has read restrictions that requires eliding content, the URL
404s. On 404 the client tries a normal smart HTTP ref advertisement
(/info/refs?service=git-upload-pack).

With the info file in hand the client writes this to a temporary file
under $GIT_DIR and then uses dumb HTTP byte range requests to download
the pack identified by pack-name:

  https://host/repo/$pack-name
aka
  https://host/repo/objects/pack/pack-$HASH.pack

Smart HTTP servers that have info/clone support will need to return
this pack as-is to clients, if the client still has read permission
over all branches in the associated pack-*.info file.

Clients store the file data locally and check the last 20 bytes
matches $FOOTER_HASH from the info file before running git-index-pack
to verify the contents and build the local .idx file. If the footer
hash does not match or any resume attempt for the pack-name URL fails,
the client deletes everything and starts over from the beginning.


Post Clone Fetch
----------------
After the pack file is downloaded and indexed the client creates local
tracking references matching the info file it obtained at the start.
Once the references are created the client issues a standard `git
fetch` with the origin server to become current.

Since the client is working off the server’s last GC pack, the client
will be very close to the server, and the final fetch will be very
small, however this depends on the frequency of the server’s GCs. With
a volume based GC trigger such as `git gc` in a post-receive hook,
clients will be very close to the server after this clone.


Redirects
---------
Clients should follow HTTP redirects, allowing servers to position
both the info/clone file and the pack-name on CDNs. Clients should not
remember the redirected locations and instead always use the base
repository URL during resume. This allows servers to embed short-lived
authentication tokens into the redirect URL for processing at the edge
CDN.


Parallel Download
-----------------
Servers may optionally include a header indicating they are willing to
perform parallel download of a file iff the client is well connected:

  parallel-download 5 8388608

in the header of the info/clone file indicates the server will allow
clients to send up to 5 parallel byte-range requests on 8 MiB
alignments.

Clients should start 1 HTTP download, track incoming bandwidth, then
add a parallel download for a later segment of the file. If bandwidth
increases with the additional stream, clients may try adding a 3rd
stream, etc. until the maximum of 5 streams is reached.


Racing with GC
--------------
Servers may choose to keep a pack file around several hours after GC
to permit clients to continue to resume and download a repository they
started cloning before the GC began.

By sending the most recent pack-*.info file to new clients, servers
can choose to keep older packs on disk for a time window (e.g. 4
hours) that older sessions can continue to retrieve before the file is
unlinked from the filesystem and its disk is reclaimed.

This is an optional service that is subject to disk constraints on the
server. Unlike clone.bundle, the server does not need 2x disk capacity
for all repositories; the additional disk required depends on GC rate
and size of repositories.


End user experience
-------------------
From the end-user side experience, I think we want the following to happen:

1. When initiating a `git clone`, the user does not have to do anything special.

2. A `git clone` eventually calls into the transport layer, and
git-remote-curl will probe for the info/clone URL; if the resource
fails to load, everything goes through the traditional codepath.

3. When git-remote-curl detects the support of dumb clone, it does the
"retry until successfully download the pack data fully" dance
internally, tentatively updates the remote tracking refs, and then
pretends as if it was asked to do an incremental fetch. If this
succeeds without any die(), everybody is happy.

4. If the above step 3. has to die() for some reason (including
impatience hitting ^C), leave the $GIT_DIR, downloaded .info file and
partially downloaded .pack file. Tell the user that the cloning can be
resumed and how.

5. `git clone --resume` would take the leftover $GIT_DIR and friends
from step 4. `git clone --resume`, when it is given a $GIT_DIR that
lacks .info, will refuse to start. When `--resume` is given, there is
no other option acceptable (perhaps other than naming which $GIT_DIR
the partial result lives).


... Thoughts?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Shawn Pearce
On Wed, Feb 10, 2016 at 10:59 AM, Shawn Pearce <[hidden email]> wrote:
>
> ... Thoughts?

Several of us at $DAY_JOB talked about this more today and thought a
variation makes more sense:

1. Clients attempting clone ask for /info/refs?service=git-upload-pack
like they do today.

2. Servers that support resumable clone include a "resumable"
capability in the advertisement.

3. Updated clients on clone request GET /info/refs?service=git-resumable-clone.

4. The server may return a 302 Redirect to its current "mostly whole"
pack file. This can be more flexible than "refs/heads/*", it just
needs to be a mostly complete pack file that contains a complete graph
from any arbitrary roots.

5. Clients fetch the file using standard HTTP GET, possibly with
byte-ranges to resume.

6. Once stored and indexed with .idx, clients run `git fsck
--lost-found` to discover the roots of the pack it downloaded. These
are saved as temporary references.

7. Client runs incremental fetch, and then deletes the temporary
references from 6.


An advantage to this process is its much more flexible for the server.
There is no additional pack-*.info file required. GC can organize
packs anyway it wants, etc.

To make step 4 really resume well, clients may need to save the first
Location header it gets back from
/info/refs?service=git-resumable-clone and use that on resume. Servers
are likely to embed the pack SHA-1 in the Location header, and the
client wants to use this on subsequent GET attempts to abort early if
the server has deleted the pack the client is trying to obtain.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Stefan Beller-4
On Wed, Feb 10, 2016 at 12:11 PM, Shawn Pearce <[hidden email]> wrote:

> On Wed, Feb 10, 2016 at 10:59 AM, Shawn Pearce <[hidden email]> wrote:
>>
>> ... Thoughts?
>
> Several of us at $DAY_JOB talked about this more today and thought a
> variation makes more sense:
>
> 1. Clients attempting clone ask for /info/refs?service=git-upload-pack
> like they do today.
>
> 2. Servers that support resumable clone include a "resumable"
> capability in the advertisement.

like "resumable-token=hash" similar to a push cert advertisement?

>
> 3. Updated clients on clone request GET /info/refs?service=git-resumable-clone.

Or just in the non-http case, they would terminate after the ls-remote
(including capability advertisement) was done and connect again to
a different service such as git-upload-stale-pack with the resumable
token to identify the pack.

>
> 4. The server may return a 302 Redirect to its current "mostly whole"
> pack file. This can be more flexible than "refs/heads/*", it just
> needs to be a mostly complete pack file that contains a complete graph
> from any arbitrary roots.
>
> 5. Clients fetch the file using standard HTTP GET, possibly with
> byte-ranges to resume.

In the non-http case the git-upload-stale-pack would be rsync with the
resume token to determine the file name of the pack,
such that we have resumeability.

>
> 6. Once stored and indexed with .idx, clients run `git fsck
> --lost-found` to discover the roots of the pack it downloaded. These
> are saved as temporary references.

jrn:
> I suspect we can do even faster by making index-pack do the work

    index-pack --check-self-contained-and-connected

>
> 7. Client runs incremental fetch, and then deletes the temporary
> references from 6.
>
>
> An advantage to this process is its much more flexible for the server.
> There is no additional pack-*.info file required. GC can organize
> packs anyway it wants, etc.
>
> To make step 4 really resume well, clients may need to save the first
> Location header it gets back from
> /info/refs?service=git-resumable-clone and use that on resume. Servers
> are likely to embed the pack SHA-1 in the Location header, and the
> client wants to use this on subsequent GET attempts to abort early if
> the server has deleted the pack the client is trying to obtain.
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to [hidden email]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Junio C Hamano
Stefan Beller <[hidden email]> writes:

>> 6. Once stored and indexed with .idx, clients run `git fsck
>> --lost-found` to discover the roots of the pack it downloaded. These
>> are saved as temporary references.
>
> jrn:
>> I suspect we can do even faster by making index-pack do the work

I somehow doubt it.  Both index-pack and lost-found need to trace
"object A depends on object B", but the similarity ends there.
index-pack traces the delta dependency, which does not have any
relation with the reachability, which is what lost-found needs to
compute.

>> 7. Client runs incremental fetch, and then deletes the temporary
>> references from 6.
>>
>>
>> An advantage to this process is its much more flexible for the server.
>> There is no additional pack-*.info file required. GC can organize
>> packs anyway it wants, etc.

I am not quite sure if that is an advantage, though.  The second
message proposes that the lost-found computation to be done by the
client using *.pack, but any client, given the same *.pack, will
compute the same result, so if the result is computed on the server
side just once when the *.pack is prepared and downloaded to the
client, it would give us a better overall resource utilization.  And
in essence, that was what the *.info file in the first message was.

The original proposal explicitly specified that the initial bulk
transfer pack must cover all heads and no tags.  For the purpose of
offloading the clone traffic and making it resumable, there is no
need for such a restriction, and I do like the fact that the second
one loosened the requirement.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Jonathan Nieder-2
In reply to this post by Stefan Beller-4
Stefan Beller wrote:
> On Wed, Feb 10, 2016 at 12:11 PM, Shawn Pearce <[hidden email]> wrote:

>> Several of us at $DAY_JOB talked about this more today and thought a
>> variation makes more sense:
>>
>> 1. Clients attempting clone ask for /info/refs?service=git-upload-pack
>> like they do today.
>>
>> 2. Servers that support resumable clone include a "resumable"
>> capability in the advertisement.
>
> like "resumable-token=hash" similar to a push cert advertisement?

It could just be the string 'resumable'.

But I wonder if it would be possible to save a round-trip by getting the
302 response in the initial request.  If the client requests

        /info/refs?service=git-upload-pack&want_resumable=true

then allow the server to make a 302 in response to its current mostly
whole pack.  Current clients would never send such a request because the
current protocol requires that for smart clients

        The request MUST contain exactly one query parameter,
        `service=$servicename`, where `$servicename` MUST be the service
        name the client wishes to contact to complete the operation.
        The request MUST NOT contain additional query parameters.

Current http-backend ignores extra query parameters.  I haven't
checked other smart http server implementations, though.

>> 3. Updated clients on clone request GET /info/refs?service=git-resumable-clone.
>
> Or just in the non-http case, they would terminate after the ls-remote
> (including capability advertisement) was done and connect again to
> a different service such as git-upload-stale-pack with the resumable
> token to identify the pack.

HTTP supports range requests and existing CDNs speak HTTP, so I
suspect it would work better if the git-resumable-clone service
printed an HTTP URL from which to grab the packfile.

I think the details are something that could be figured out after
trying out the idea with http first, though.

[...]
>> 5. Clients fetch the file using standard HTTP GET, possibly with
>> byte-ranges to resume.
>
> In the non-http case the git-upload-stale-pack would be rsync with the
> resume token to determine the file name of the pack,
> such that we have resumeability.

How do I tunnel rsync over git protocol?

So I think in the non-http case the git-resumable-clone service would
have to print a URL to be served using a possibly different protocol
(e.g., a signed https URL for getting the file from a service like S3,
or an rsync URL for getting the file using the same ssh creds that
were used for the initial request).

[...]
>> 6. Once stored and indexed with .idx, clients run `git fsck
>> --lost-found` to discover the roots of the pack it downloaded. These
>> are saved as temporary references.
>
> jrn:
> > I suspect we can do even faster by making index-pack do the work
>
>     index-pack --check-self-contained-and-connected

--strict + --check-self-contained-and-connected check that the pack
is self-contained.  In the process they mark each object that is
reachable from another object in the pack with FLAG_LINK.

The objects not marked with FLAG_LINK are the roots.

[...]
>> To make step 4 really resume well, clients may need to save the first
>> Location header it gets back from
>> /info/refs?service=git-resumable-clone and use that on resume. Servers
>> are likely to embed the pack SHA-1 in the Location header, and the
>> client wants to use this on subsequent GET attempts to abort early if
>> the server has deleted the pack the client is trying to obtain.

Yes.

I really like this design.  I'm tempted to implement it (since it
lacks a bunch of the downsides of clone.bundle).

Thanks,
Jonathan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Junio C Hamano
Jonathan Nieder <[hidden email]> writes:

> --strict + --check-self-contained-and-connected check that the pack
> is self-contained.  In the process they mark each object that is
> reachable from another object in the pack with FLAG_LINK.
>
> The objects not marked with FLAG_LINK are the roots.

Ahh, OK, if we already have the code that does it, I have no
objection.  At some point before finalizing the pack on the disk we
would want to run fsck on all objects anyway, so having to pass
"--strict" to find objects that lack FLAG_LINK does not sound too
big a downside.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Jonathan Nieder-2
In reply to this post by Junio C Hamano
Junio C Hamano wrote:

> I somehow doubt it.  Both index-pack and lost-found need to trace
> "object A depends on object B", but the similarity ends there.

You already responded about this in another side-thread, so the
following is mostly redundant. :)  I think this message can still be
useful to future readers since it more explicitly goes through the
trade-offs.

The relevant similarity is that both index-pack and lost-found need to
inflate all commit, tag, and tree objects.

This is the same idea that led to the introduction of "git index-pack
--strict".

[...]
> I am not quite sure if that is an advantage, though.  The second
> message proposes that the lost-found computation to be done by the
> client using *.pack, but any client, given the same *.pack, will
> compute the same result, so if the result is computed on the server
> side just once when the *.pack is prepared and downloaded to the
> client, it would give us a better overall resource utilization.  And
> in essence, that was what the *.info file in the first message was.

Advantages of not providing the list of roots:
 1. only need one round-trip to serve the packfile as-is
 2. less data sent over the wire (not important unless the list of roots
    is long)
 3. can be enabled on the server for existing repositories without an
    extra step of generating .info files

Advantage of providing the list of roots:
- speedup because the client does not have to compute the list of roots

For a client that is already iterating over all objects and inspecting
FLAG_LINK, the advantage (3) seems compelling enough to prefer the
protocol that doesn't sent a list of roots.

Except when people pass --depth, "git clone" sets
'check_self_contained_and_connected = 1'.  That means clients that
already iterate over all objects and inspect FLAG_LINK are the usual
case.

Thanks for your thoughtfulness,
Jonathan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Jeff King
In reply to this post by Shawn Pearce
On Wed, Feb 10, 2016 at 12:11:46PM -0800, Shawn Pearce wrote:

> On Wed, Feb 10, 2016 at 10:59 AM, Shawn Pearce <[hidden email]> wrote:
> >
> > ... Thoughts?
>
> Several of us at $DAY_JOB talked about this more today and thought a
> variation makes more sense:
>
> 1. Clients attempting clone ask for /info/refs?service=git-upload-pack
> like they do today.
>
> 2. Servers that support resumable clone include a "resumable"
> capability in the advertisement.

Because the magic happens in the git protocol, that would mean this does
not have to be limited to git-over-http. It could be "resumable=<url>"
to point the client anywhere (the same server over a different protocol,
another server, etc).

> 3. Updated clients on clone request GET /info/refs?service=git-resumable-clone.
>
> 4. The server may return a 302 Redirect to its current "mostly whole"
> pack file. This can be more flexible than "refs/heads/*", it just
> needs to be a mostly complete pack file that contains a complete graph
> from any arbitrary roots.

And with "resumable=<url>", the client does not have to hit the server
to do a redirect; it can go straight to the final URL, saving a
round-trip.

> 5. Clients fetch the file using standard HTTP GET, possibly with
> byte-ranges to resume.
>
> 6. Once stored and indexed with .idx, clients run `git fsck
> --lost-found` to discover the roots of the pack it downloaded. These
> are saved as temporary references.

Clients do not have to _just_ fetch a packfile. They could get a bundle
file that contains the roots along with the packfile. I know that one of
your goals is not duplicating the storage of the packfile on the server,
but it would not be hard for the server to store the packfile and the
bundle header separately, and concatenate them on the fly.

Right now the clients can't clone from bundles directly via HTTP. I
wrote patches for that ages ago, but got stuck on this very issue
(basically that I had to spool the bundle and then clone from it, which
temporarily doubled the client's disk space requirements). One
alternative would be to amend the bundle format so that rather than a
single file, you get a bundle header whose end says "...and my matching
packfile is 1234-abcd". And then the client knows that they can fetch
that separately from the same source.

It's an extra HTTP request, but it makes the code for client _and_
server way simpler. So the whole thing is basically then:

  0. During gc, server generates pack-1234abcd.pack. It writes matching
     tips into pack-1234abcd.info, which is essentially a bundle file
     whose final line says "pack-1234abcd.pack".

  1. Client contacts server via any git protocol. Server says
     "resumable=<url>". Let's says that <url> is
     https://example.com/repo/clones/1234abcd.bundle.

  2. Client goes to <url>. They see that they are fetching a bundle,
     and know not to do the usual smart-http or dumb-http protocols.
     They can fetch the bundle header resumably (though it's tiny, so it
     doesn't really matter).

  3. After finishing the bundle header, they see they need to grab the
     packfile. Based on the bundle header's URL and the filename
     contained within it, they know to get
     https://example.com/repo/clones/pack-1234abcd.pack". This is
     resumable, too.

  4. Client clones from bundled pack as normal; no root-finding magic
     required.

  5. Client runs incremental fetch against original repo from step 1.

And you'll notice, too, that all of the bundle-http magic kicks in
during step 2 because the client sees they're grabbing a bundle. Which
means that the <url> in step 1 doesn't _have_ to be a bundle. It can be
"go fetch from kernel.org, then come back to me".

> An advantage to this process is its much more flexible for the server.
> There is no additional pack-*.info file required. GC can organize
> packs anyway it wants, etc.

Yes, it's much better than your original email, at least for GitHub
servers. We're not very flexible with GC tricks, because we need bitmaps
to work, and because we get a lot of benefit from sharing the object
storage for forks of a single repository.

> To make step 4 really resume well, clients may need to save the first
> Location header it gets back from
> /info/refs?service=git-resumable-clone and use that on resume. Servers
> are likely to embed the pack SHA-1 in the Location header, and the
> client wants to use this on subsequent GET attempts to abort early if
> the server has deleted the pack the client is trying to obtain.

You could possibly do away with this trick if the server hands out a
unique URL in its "resumable" header. Though I imagine it might be
convenient for server admins to always point to a generic url, and
put the logic in the HTTP layer.

OTOH, if you do the "split bundle" thing I mentioned above, then this
happens for free. The client caches the bundle header it grabs in my
step 2, and then that contains the unique pack name to fetch in step 3.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Jeff King
In reply to this post by Jonathan Nieder-2
On Wed, Feb 10, 2016 at 01:22:07PM -0800, Jonathan Nieder wrote:

> > I am not quite sure if that is an advantage, though.  The second
> > message proposes that the lost-found computation to be done by the
> > client using *.pack, but any client, given the same *.pack, will
> > compute the same result, so if the result is computed on the server
> > side just once when the *.pack is prepared and downloaded to the
> > client, it would give us a better overall resource utilization.  And
> > in essence, that was what the *.info file in the first message was.
>
> Advantages of not providing the list of roots:
>  1. only need one round-trip to serve the packfile as-is
>  2. less data sent over the wire (not important unless the list of roots
>     is long)
>  3. can be enabled on the server for existing repositories without an
>     extra step of generating .info files
>
> Advantage of providing the list of roots:
> - speedup because the client does not have to compute the list of roots
>
> For a client that is already iterating over all objects and inspecting
> FLAG_LINK, the advantage (3) seems compelling enough to prefer the
> protocol that doesn't sent a list of roots.

I'm not sure how compelling (3) is, since we are relying on the server
to make certain packing choices. I guess a stock "git repack -ad" would
do in a pinch; it should at least contain all needed objects, but it's
going to potentially have extra cruft objects (from reflogs, for
example).

I outlined some alternatives to Shawn's proposal elsewhere in the
thread. I think it's a useful feature for this redirect to not just be
"go fetch this packfile", but "go clone from here and come back to me".
That opens up a lot of flexibility.

It does make "go fetch this packfile without roots" a little harder, but
I think it's still do-able. Right now when git hits an http URL, we pass
the smart-http "?service=" magic, and we look at the response to figure
out whether we got:

  1. A smart-http server.

  2. A dumb-http server.

  N. Something else, in which case we die.

The alternative I outlined elsewhere (and the patches I posted long
ago) basically adds:

  3. If it's a bundle, fetch the bundle and then clone from that.

But we could also do:

  4. If it's a packfile, fetch the packfile and then do the
     find-the-roots magic.

> Except when people pass --depth, "git clone" sets
> 'check_self_contained_and_connected = 1'.  That means clients that
> already iterate over all objects and inspect FLAG_LINK are the usual
> case.

Somewhat related, but I've wondered if we could do something similar
even for non-clone cases. That is, `index-pack` could tell us the set of
referenced but missing objects, and we could verify that each of those
is reachable (we _could_ just have it verify that we have the object at
all; traditionally we only guaranteed that reachable objects were kept,
but these days we keep anything reachable from another object we are
keeping, so if you have X, you should always have X^, etc).

Anyway, that's quite a tangent from this topic.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Jonathan Nieder-2
In reply to this post by Jeff King
Jeff King wrote:
> On Wed, Feb 10, 2016 at 12:11:46PM -0800, Shawn Pearce wrote:

>> Several of us at $DAY_JOB talked about this more today and thought a
>> variation makes more sense:
>>
>> 1. Clients attempting clone ask for /info/refs?service=git-upload-pack
>> like they do today.
>>
>> 2. Servers that support resumable clone include a "resumable"
>> capability in the advertisement.
>
> Because the magic happens in the git protocol, that would mean this does
> not have to be limited to git-over-http. It could be "resumable=<url>"
> to point the client anywhere (the same server over a different protocol,
> another server, etc).

Thanks for bringing this up.  A worry with putting the URL in the
capabilities line is that it makes it easy to run into the 1000-byte
limit.  It's been a while since v1.8.3-rc0~148^2~6 (pkt-line: provide
a LARGE_PACKET_MAX static buffer, 2013-02-20) but we still can't
rely on clients having that applied.

(I also haven't checked whether current versions of git are able to
handle longer capability strings with that patch applied.)

Another nice thing about using a 302 is that you can set cookies
during the redirect, which might make authenticated access easier.
(That said, authenticated access through e.g. signed URLs can work
fine without that.)

[...]
> Clients do not have to _just_ fetch a packfile. They could get a bundle
> file that contains the roots along with the packfile. I know that one of
> your goals is not duplicating the storage of the packfile on the server,
> but it would not be hard for the server to store the packfile and the
> bundle header separately, and concatenate them on the fly.

Doesn't that prevent using a git-unaware file transfer service to
serve the files?

It also means the client can't use the downloaded file as-is --- they
need to separate the root list from the packfile (that's not a big
deal; just some added complication to switch files at the appropriate
moment during download if you want to avoid temporarily using twice
the space).

That said, both these problems are avoided by serving the 'split
bundle' you described as-is instead of concatenating.

[...]
> And you'll notice, too, that all of the bundle-http magic kicks in
> during step 2 because the client sees they're grabbing a bundle. Which
> means that the <url> in step 1 doesn't _have_ to be a bundle. It can be
> "go fetch from kernel.org, then come back to me".

I think that use case brings in complications that make it not
necessarily worth it.  In this example, if kernel.org is serving pack
files, why shouldn't I point directly at the advertised pack CDN URL
instead of adding an extra hop that puts added load on kernel.org
servers?

Allowing an arbitrary "fetch from here first" capability is very
flexible.  I guess my fear comes from not knowing what the flexibility
buys beyond aesthetics.  (My motivation comes from the example of
alternates: it is pretty and very flexible and ended up as a support
and maintenance headache instead of being widely useful.  I think what
you are proposing is more harmless but I'd still want to have an
example of what it's used for before going in that direction.)

Thanks,
Jonathan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Junio C Hamano
In reply to this post by Jeff King
Jeff King <[hidden email]> writes:

> Clients do not have to _just_ fetch a packfile. They could get a bundle
> file that contains the roots along with the packfile. I know that one of
> your goals is not duplicating the storage of the packfile on the server,
> but it would not be hard for the server to store the packfile and the
> bundle header separately, and concatenate them on the fly.

Yeah, I was wondering about that.  Just storing them as separate
two-file pair on the server side and serving it as if it is a single
file to the client does not sound like a rocket science, and a
reasonable webserver should be able to serve even a byte-range
request out of such a thing.

Of course, updating the clients to understand such a two-file pair
as a valid bundle is a good thing to do, and I like that step 0. you
outlined below.

>   0. During gc, server generates pack-1234abcd.pack. It writes matching
>      tips into pack-1234abcd.info, which is essentially a bundle file
>      whose final line says "pack-1234abcd.pack".
>
>   1. Client contacts server via any git protocol. Server says
>      "resumable=<url>". Let's says that <url> is
>      https://example.com/repo/clones/1234abcd.bundle.
>
>   2. Client goes to <url>. They see that they are fetching a bundle,
>      and know not to do the usual smart-http or dumb-http protocols.
>      They can fetch the bundle header resumably (though it's tiny, so it
>      doesn't really matter).
> ...

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Jeff King
In reply to this post by Jonathan Nieder-2
On Wed, Feb 10, 2016 at 02:17:58PM -0800, Jonathan Nieder wrote:

> > Because the magic happens in the git protocol, that would mean this does
> > not have to be limited to git-over-http. It could be "resumable=<url>"
> > to point the client anywhere (the same server over a different protocol,
> > another server, etc).
>
> Thanks for bringing this up.  A worry with putting the URL in the
> capabilities line is that it makes it easy to run into the 1000-byte
> limit.  It's been a while since v1.8.3-rc0~148^2~6 (pkt-line: provide
> a LARGE_PACKET_MAX static buffer, 2013-02-20) but we still can't
> rely on clients having that applied.

I hadn't considered that, but I'm not sure how much of a problem it is
in practice. The first line is 40-hex sha1, space, HEAD, NUL, then
capabilities. The current capabilities string from github.com on a
sample repo (which has a rather large agent string, and a normal-sized
symref pointer for HEAD) is 188 bytes. So that's still over 750 bytes
available for a URL. The space isn't infinite, and we may want to add
more capabilities later. But I'd think that devoting even 256 bytes to a
url would be reasonable, and leave us with fair room to grow.

> Another nice thing about using a 302 is that you can set cookies
> during the redirect, which might make authenticated access easier.
> (That said, authenticated access through e.g. signed URLs can work
> fine without that.)

Yeah, I can see there are advantages to assuming all the world is HTTP,
but it just doesn't seem that practical to me.

> > Clients do not have to _just_ fetch a packfile. They could get a bundle
> > file that contains the roots along with the packfile. I know that one of
> > your goals is not duplicating the storage of the packfile on the server,
> > but it would not be hard for the server to store the packfile and the
> > bundle header separately, and concatenate them on the fly.
>
> Doesn't that prevent using a git-unaware file transfer service to
> serve the files?

Sort of. They would just need to serve the combined file. But _if_ you
have a git-unaware service (rsync?) _and_ its hitting the same storage
(so you could in theory not store the packfile twice), _and_ it cannot
be taught to do any kind of concatenation, then yes, it would be a
problem.

I do think I favor the "split bundle" anyway, though, just for
simplicity on both ends.

> > And you'll notice, too, that all of the bundle-http magic kicks in
> > during step 2 because the client sees they're grabbing a bundle. Which
> > means that the <url> in step 1 doesn't _have_ to be a bundle. It can be
> > "go fetch from kernel.org, then come back to me".
>
> I think that use case brings in complications that make it not
> necessarily worth it.  In this example, if kernel.org is serving pack
> files, why shouldn't I point directly at the advertised pack CDN URL
> instead of adding an extra hop that puts added load on kernel.org
> servers?

Sure, that would be more efficient if kernel.org is providing such a CDN
URL. But they aren't now, because it doesn't exist yet, and this feature
can be used by you without having to coordinate their use of the
feature. And you can replace kernel.org in my example with any other
server that happens to be preferable to fetching from you, for whatever
reason.

> My motivation comes from the example of
> alternates: it is pretty and very flexible and ended up as a support
> and maintenance headache instead of being widely useful.

I guess one man's trash is another's treasure. Alternates are in wide
use at GitHub, and are pretty much what make our data model feasible.
I'm pretty sure that git.or.cz uses them for similar purposes, too.

> I think what
> you are proposing is more harmless but I'd still want to have an
> example of what it's used for before going in that direction.

Unless there is a big immediate downside, I'd rather err in the opposite
direction: keep orthogonal concerns separate (e.g., redirecting to X
versus protocol details of X).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Junio C Hamano
In reply to this post by Jonathan Nieder-2
Jonathan Nieder <[hidden email]> writes:

> I really like this design.  I'm tempted to implement it (since it
> lacks a bunch of the downsides of clone.bundle).

Just to see people are not stepping on each others toe, implementing
slightly different components in parallel based on the same theme to
solve the same problem, it may be beneficial to have a list of a bit
more detailed breakdown of the necessary parts.

I think possible small first steps include:

 * A new "--split-header" option to "git bundle" that allows the
   command to write out two files;

 * An update to the "bundle" transport so that it understand the new
   split bundle format (i.e. when you have base.bndl that refers to
   pack-deadbeef.pack, you can still say "git clone base.bndl" and
   "git ls-remote base.bndl");

 * A new "--bundle-header" option to "git index-pack", which makes
   it to write out the bundle header file that references the
   resulting packfile in addition to the pack .idx file (this should
   also be able to "fix" thin pack, and also by keeping track of the
   actual "foreign" objects not just number of them, compute and
   record the list of prerequisite objects in the resulting bundle
   header file).

Am I on the right track?  Assuming I am, anything I missed?

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Shawn Pearce
On Wed, Feb 10, 2016 at 7:43 PM, Junio C Hamano <[hidden email]> wrote:

> Jonathan Nieder <[hidden email]> writes:
>
>> I really like this design.  I'm tempted to implement it (since it
>> lacks a bunch of the downsides of clone.bundle).
>
> Just to see people are not stepping on each others toe, implementing
> slightly different components in parallel based on the same theme to
> solve the same problem, it may be beneficial to have a list of a bit
> more detailed breakdown of the necessary parts.
>
> I think possible small first steps include:
>
>  * A new "--split-header" option to "git bundle" that allows the
>    command to write out two files;
>
>  * An update to the "bundle" transport so that it understand the new
>    split bundle format (i.e. when you have base.bndl that refers to
>    pack-deadbeef.pack, you can still say "git clone base.bndl" and
>    "git ls-remote base.bndl");
>
>  * A new "--bundle-header" option to "git index-pack", which makes
>    it to write out the bundle header file that references the
>    resulting packfile in addition to the pack .idx file (this should
>    also be able to "fix" thin pack, and also by keeping track of the
>    actual "foreign" objects not just number of them, compute and
>    record the list of prerequisite objects in the resulting bundle
>    header file).
>
> Am I on the right track?  Assuming I am, anything I missed?

--bundle-header option for pack-objects so that repack/gc can create
the matching .info file?

I waffled on the URL in the capabilities line, but I think Peff
convinced me earlier in the thread that its reasonably safe to do. So
this split bundle with header URL in the capabilities sounds like a
good approach, its just a matter of implementation. :)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Junio C Hamano
In reply to this post by Jeff King
Jeff King <[hidden email]> writes:

> ... One
> alternative would be to amend the bundle format so that rather than a
> single file, you get a bundle header whose end says "...and my matching
> packfile is 1234-abcd". And then the client knows that they can fetch
> that separately from the same source.

I would imagine that we would introduce bundle v3 format for this.

It may want to say "my matching packfiles are these" to accomodate a
set of packs split at max-pack-size, but I am perfectly fine to say
you must create a single pack when you use a bundle with separate
header to keep things simpler.

> It's an extra HTTP request, but it makes the code for client _and_
> server way simpler. So the whole thing is basically then:
>
>   0. During gc, server generates pack-1234abcd.pack. It writes matching
>      tips into pack-1234abcd.info, which is essentially a bundle file
>      whose final line says "pack-1234abcd.pack".

OK.

>   1. Client contacts server via any git protocol. Server says
>      "resumable=<url>". Let's says that <url> is
>      https://example.com/repo/clones/1234abcd.bundle.
>
>   2. Client goes to <url>. They see that they are fetching a bundle,
>      and know not to do the usual smart-http or dumb-http protocols.
>      They can fetch the bundle header resumably (though it's tiny, so it
>      doesn't really matter).

Might be in megabytes range, though, with many refs.  It still is
tiny, though ;-).

>   3. After finishing the bundle header, they see they need to grab the
>      packfile. Based on the bundle header's URL and the filename
>      contained within it, they know to get
>      https://example.com/repo/clones/pack-1234abcd.pack". This is
>      resumable, too.

OK.

>   4. Client clones from bundled pack as normal; no root-finding magic
>      required.

I like this part the most.

>   5. Client runs incremental fetch against original repo from step 1.
>
> And you'll notice, too, that all of the bundle-http magic kicks in
> during step 2 because the client sees they're grabbing a bundle. Which
> means that the <url> in step 1 doesn't _have_ to be a bundle. It can be
> "go fetch from kernel.org, then come back to me".

Or it could be a packfile (and the client discovers roots), as you
mentioned in a separate message.  I personally do not think it buys
us much, as long as we do a bundle represented as a header and a
separate pack.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Jeff King
On Thu, Feb 11, 2016 at 01:32:22PM -0800, Junio C Hamano wrote:

> > ... One
> > alternative would be to amend the bundle format so that rather than a
> > single file, you get a bundle header whose end says "...and my matching
> > packfile is 1234-abcd". And then the client knows that they can fetch
> > that separately from the same source.
>
> I would imagine that we would introduce bundle v3 format for this.

Yeah, I think so. And in fact, the "here are my packfiles..." bit should
probably be in the v3 header.

> It may want to say "my matching packfiles are these" to accomodate a
> set of packs split at max-pack-size, but I am perfectly fine to say
> you must create a single pack when you use a bundle with separate
> header to keep things simpler.

Interesting. My initial thought is that one could replace "git bundle
create foo.bundle --all && split foo.bundle" with this (for storing or
transferring a bundle somewhere that cannot handle the whole thing in
one go).  It has the advantage that you do not need to recreate the full
bundle to extract the data.

But I think the negatives of splitting across packs would outweigh that.
You cannot have cross-pack deltas, so your total size would be much
larger (in general, I have yet to see a case where max-pack-size is
beneficial, beyond the obvious "your filesystem cannot store files
larger than N bytes").

So I don't think it would be helpful for normal bundle use.

It _could_ be helpful in the context we're talking about here, though.
If I create the split-bundle so that people can resumable-clone from me,
they can only clone up to that bundle's creation point (and
incrementally fetch the rest). But with a single pack, I can't update
the split-bundle without doing a full repack. With multiple packs, I
could regenerate the split-bundle header and just mention the new pack.

It wouldn't be as _efficient_ as a full repack of course, but it may be
a good, cheap interim solution between repacks.

> >   2. Client goes to <url>. They see that they are fetching a bundle,
> >      and know not to do the usual smart-http or dumb-http protocols.
> >      They can fetch the bundle header resumably (though it's tiny, so it
> >      doesn't really matter).
>
> Might be in megabytes range, though, with many refs.  It still is
> tiny, though ;-).

Yes, but it's the same amount we already spew for a ref advertisement,
which isn't resumable, either. :)

I think I'd probably make this a straight fetch in the first iteration,
and we can worry about making it resumable later on if people actually
care.

> > And you'll notice, too, that all of the bundle-http magic kicks in
> > during step 2 because the client sees they're grabbing a bundle. Which
> > means that the <url> in step 1 doesn't _have_ to be a bundle. It can be
> > "go fetch from kernel.org, then come back to me".
>
> Or it could be a packfile (and the client discovers roots), as you
> mentioned in a separate message.  I personally do not think it buys
> us much, as long as we do a bundle represented as a header and a
> separate pack.

Yeah, I think I agree, and if it were just me, I'd implement the bundle
part and call it done. But the important thing to me is that we haven't
eliminated the possibility of doing the pure-pack thing on top, if we
choose to.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Duy Nguyen
In reply to this post by Jonathan Nieder-2
On Thu, Feb 11, 2016 at 4:01 AM, Jonathan Nieder <[hidden email]> wrote:
> ....
>
> I really like this design.  I'm tempted to implement it (since it
> lacks a bunch of the downsides of clone.bundle).

A bit disappointed we still do not address resumable fetch. But I'm
happy that it makes people excited and one of them even tempted to
implement. This thread sounds like we'll see resumable clone before
2017 :)
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Blake Burkhart
In reply to this post by Jeff King
On Wed, Feb 10, 2016 at 3:49 PM, Jeff King <[hidden email]> wrote:
>> 2. Servers that support resumable clone include a "resumable"
>> capability in the advertisement.
>
> Because the magic happens in the git protocol, that would mean this does
> not have to be limited to git-over-http. It could be "resumable=<url>"
> to point the client anywhere (the same server over a different protocol,
> another server, etc).

I'd like to call this out as a possible security issue before it gets
implemented. Allowing the server to instruct the client what protocol
to use is a security risk. This sounds like a fine feature, just do it
carefully.

I reported a similar issue was discussed off list which eventually
became CVE-2015-7545. Basically, git-submodule allowed a repository to
specify any protocol via .gitmodules, causing the client to fetch an
arbitrary URL using a protocol of the attacker's choice. Sadly, the
existence of git-remote-ext allows easily executing arbitrary shell
commands if the server can tell the client to use it. Furthermore,
it's possible the client has some insecure or sensitive custom git
remote helpers installed.

To address this GIT_ALLOW_PROTOCOL was introduced, and git-submodule
now uses it as of 33cfccb. This environment variable specifies a
default whitelist of protocols. Whoever implements this should
probably make use of GIT_ALLOW_PROTOCOL to limit resumable clones to
the same default whitelist that git-submodule now uses.

--
Blake Burkhart
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Junio C Hamano
In reply to this post by Duy Nguyen
Duy Nguyen <[hidden email]> writes:

> On Thu, Feb 11, 2016 at 4:01 AM, Jonathan Nieder <[hidden email]> wrote:
>> ....
>>
>> I really like this design.  I'm tempted to implement it (since it
>> lacks a bunch of the downsides of clone.bundle).
>
> A bit disappointed we still do not address resumable fetch.

As you already said, the initial "clone" and subsequent incremental
"fetch" are orthogonal issues.

Because the proposed update to "clone" has much larger payoff than
the proposed change to "fetch", i.e.

 * The amount of data that gets transferred is much larger, hence
   the chance of network timing out in a poor network environment is
   much higher, need for resuming much larger.

 * Not only the approach makes "clone" resumable and helping
   clients, it helps the server offload bulk transfer out to CDN.

and it has much smaller damage to the existing code, i.e.

 * We do not have to pessimize the packing process, only to discard
   the bulk of bytes that were generated, like the proposed approach
   for "fetch".

 * The area new code is needed is well isolated and the switch to
   new protocol happens very early in the exchange without sharing
   code to existing codepath; these properties make it less risky to
   introduce regression.

it is simply a good taste to give "clone" higher priority between
the two.

So I do not think there is much reason to feel disappointed.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Resumable clone based on hybrid "smart" and "dumb" HTTP

Jeff King
In reply to this post by Blake Burkhart
On Fri, Feb 12, 2016 at 07:40:43PM -0600, Blake Burkhart wrote:

> On Wed, Feb 10, 2016 at 3:49 PM, Jeff King <[hidden email]> wrote:
> >> 2. Servers that support resumable clone include a "resumable"
> >> capability in the advertisement.
> >
> > Because the magic happens in the git protocol, that would mean this does
> > not have to be limited to git-over-http. It could be "resumable=<url>"
> > to point the client anywhere (the same server over a different protocol,
> > another server, etc).
>
> I'd like to call this out as a possible security issue before it gets
> implemented. Allowing the server to instruct the client what protocol
> to use is a security risk. This sounds like a fine feature, just do it
> carefully.

Thanks for mentioning this. I agree it's a potential issue, and we
should use the same solution as submodules, as you pointed out:

> To address this GIT_ALLOW_PROTOCOL was introduced, and git-submodule
> now uses it as of 33cfccb. This environment variable specifies a
> default whitelist of protocols. Whoever implements this should
> probably make use of GIT_ALLOW_PROTOCOL to limit resumable clones to
> the same default whitelist that git-submodule now uses.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
12