Re: Resumable clone/Gittorrent (again) - stable packs?

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Zenaan Harkness
Bittorrent requires some stability around torrent files.

Can packs be generated deterministically?
If not by two separate repos, what about by one particular repo?

For Linus' linux-2.6.git, that repo is considered 'canonical' by many.

Pack-torrents could be ~1MiB, ~10MiB, ~100Mib, ~1GiB, or as configured
in a particular repo, which repo is the canonical location for
pack-torrents for all who consider that particular repo as canonical.

Perhaps a heuristic/ algorithm: once ten 10MiB (sequentially
generated) pack-torrents are floating around,
they could be simply concatenated to create a 100MiB pack-torrent,
with a deterministic name and SHA etc,
so that all those 10MiB pack-torrent files that torrent clients have,
can be re-used and locally combined into the 100MiB torrent as needed,
on demand.

Same for 100MiB -> 1GiB pack-torrents.

Individual extra commits:
While "small" number of additional commits go into a repo, clients
fall back to git-fetch, _after .

If Linus linus-2.6.git (currently configured "canonical" repo) goes
offline, simply configure a new remote canonical repo.

Branches:
Other "branches" repos of linux-2.6.git could create their own
consistent 50MiB (or as configured) pack-torrents which are
commits-only-missing-from-linux-2.6 pack-torrents (ie, those missing
from that repo's "canonical" upstream).

This would require clients have a recursive torrent locator (I start
at linux-net.git, which requires linux-2.6.git, so I go get those
packs as well as the linux-net.git packs).

Perhaps have a system-wide or user-wide git repo/ torrent config, or
check with user running git-clone linux-net.git "Do you have an
existing git.vger.kernel.org/linux-2.6.git archive?".

Zen
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Shawn Pearce
On Wed, Jan 5, 2011 at 18:29, Zenaan Harkness <[hidden email]> wrote:
> Bittorrent requires some stability around torrent files.
>
> Can packs be generated deterministically?

No.  We have been trying to avoid doing that, because it ties us into
one particular compression scheme.  We can't tune the algorithm and
get better compression later, because it would generate a different
pack.  We also rely on the system's libz to generate the compressed
data.  A version change to libz may generate a different encoding for
the same uncompressed data, simply because they made a tweak to how
the compression was performed.  Likewise our own delta compression
code can be tweaked to produce a different (but logically identical)
delta between the same two objects.

Right now packs aren't deterministic because they use multiple threads
to generate the deltas, the thread scheduling impacts which base
objects deltas are tried against because threads can steal work from
each other if one finishes before the other one.  Disabling threading
entirely slows down delta compression considerably on multi-core
machines, but does remove this work-stealing, making the pack
deterministic... but only for this exact Git binary, with this same
shared libz.  If the system libz or Git changes, all bets are off.

We've been down this road before; we don't want to box ourselves into
a tight corner by setting for all time these tunable portions of the
compression algorithms.

--
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Nicolas Pitre-2
In reply to this post by Zenaan Harkness
On Thu, 6 Jan 2011, Zenaan Harkness wrote:

> Bittorrent requires some stability around torrent files.
>
> Can packs be generated deterministically?

They _could_, but we do _not_ want to do that.

The only thing which is stable in Git is the canonical representation of
objects, and the objects they depend on, expressed by their SHA1
signature.  Any BitTorrent-alike design for Git must be based on that
property and not the packed representation of those objects which is not
meant to be stable.

If you don't want to design anything and simply reuse current BitTorrent
codebase then simply create a Git bundle from some release version and
seed that bundle for a sufficiently long period to be worth it.  Then
falling back to git fetch in order to bring the repo up to date with the
very latest commits should be small and quick.  When that clone gets too
big then it's time to start seeding another more up-to-date bundle.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Zenaan Harkness
On Fri, Jan 7, 2011 at 08:09, Nicolas Pitre <[hidden email]> wrote:

> On Thu, 6 Jan 2011, Zenaan Harkness wrote:
>
>> Bittorrent requires some stability around torrent files.
>>
>> Can packs be generated deterministically?
>
> They _could_, but we do _not_ want to do that.
>
> The only thing which is stable in Git is the canonical representation of
> objects, and the objects they depend on, expressed by their SHA1
> signature.  Any BitTorrent-alike design for Git must be based on that
> property and not the packed representation of those objects which is not
> meant to be stable.
>
> If you don't want to design anything and simply reuse current BitTorrent
> codebase then simply create a Git bundle from some release version and
> seed that bundle for a sufficiently long period to be worth it.  Then
> falling back to git fetch in order to bring the repo up to date with the
> very latest commits should be small and quick.  When that clone gets too
> big then it's time to start seeding another more up-to-date bundle.

Thanks guys for the explanations.

So, we don't _want_ to generate packs deterministically.
BUT, we _can_ reliably unpack a pack (duh).

So if my configured "canonical upstream" decides on a particular
compression etc, I (my git client) doesn't care what has been chosen
by my upstream.

What is important for torrent-able packs though is stability over some
time period, no matter what the format.

There's been much talk of caching, invalidating of caches, overlapping
torrent-packs etc.

In every case, for torrents to work, the P2P'd files must have some
stability over some time period.
(If this assumption is incorrect, please clarify, not counting
every-file-is-a-torrent and every-commit-is-a-torrent.)

So, torrentable options:
- torrent per commit
- torrent per pack
- torrent per torrent-archive - new file format

Torrent per commit - too small, too many torrents; we need larger
p2p-able sizes in general.

Torrent per pack - packs non-deterministically created, both between
hosts and even intra-host (libz upgrade, nr_threads change, git pack
algorithm optimization).

A new torrent format, if "close enough" to current git pack
performance (cpu load, threadability, size) is potential for new
version of git pack file format - we don't want to store two sets of
pack files on disk, if sensible to not do so; unlikely to happen - I
can't conceive that a torrentable format would be anything but worse
than pack files and therefore would be rejected from git master.

Can we can relax the perceived requirement to deterministically create
pack files?
Well, over what time period are pack files stable in a particular git?
Over what time period do we require stable files for torrenting?

Can we simply configure our local git to keep specified pack files for
specified time period?
And use those for torrent-packs?
Perhaps the torrent file could have a UseBy date?

Zen
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Nicolas Pitre-2
On Fri, 7 Jan 2011, Zenaan Harkness wrote:

> On Fri, Jan 7, 2011 at 08:09, Nicolas Pitre <[hidden email]> wrote:
> > On Thu, 6 Jan 2011, Zenaan Harkness wrote:
> >
> >> Bittorrent requires some stability around torrent files.
> >>
> >> Can packs be generated deterministically?
> >
> > They _could_, but we do _not_ want to do that.
> >
> > The only thing which is stable in Git is the canonical representation of
> > objects, and the objects they depend on, expressed by their SHA1
> > signature.  Any BitTorrent-alike design for Git must be based on that
> > property and not the packed representation of those objects which is not
> > meant to be stable.
> >
> > If you don't want to design anything and simply reuse current BitTorrent
> > codebase then simply create a Git bundle from some release version and
> > seed that bundle for a sufficiently long period to be worth it.  Then
> > falling back to git fetch in order to bring the repo up to date with the
> > very latest commits should be small and quick.  When that clone gets too
> > big then it's time to start seeding another more up-to-date bundle.
>
> Thanks guys for the explanations.
>
> So, we don't _want_ to generate packs deterministically.
> BUT, we _can_ reliably unpack a pack (duh).
Of course.

> So if my configured "canonical upstream" decides on a particular
> compression etc, I (my git client) doesn't care what has been chosen
> by my upstream.

Indeed.  This is like saying: I'm sending you the value 52, but I chose
to use the representation "24 + 28", while someone else might decide to
encode that value as "13 * 4" instead.  You still are able to decode it
to the same result in both cases.

> What is important for torrent-able packs though is stability over some
> time period, no matter what the format.

Hence my suggestion to simply seed a Git bundle over BitTorrent. Bundles
are files which are designed to be used by completely ad hoc transports
and you can fetch from them just like if they were a remote repository.

> There's been much talk of caching, invalidating of caches, overlapping
> torrent-packs etc.

And in my humble opinion this is just all crap.  All those suggestions
are fragile, create administrative issues, eat up server resources, and
they all are suboptimal in the end. No one ever implemented a working
prototype so far either.

We don't want caches.  Fundamentally, we do not need any cache.  Caches
are a pain to administrate on a busy server anyway as they eat disk
space and they also represent a much bigger security risk compared to a
read-only operation.

Furthermore, a cache is good only for the common case that everyone
want.  but with Git, you cannot presume that everyone is at the same
version locally.  So either you do a custom transfer for each client to
minimize transfers and caching the result in that case might not benefit
that many people, or you make the cached data bigger so to cover more
cases while making the transfer suboptimal.

Finally, we do have a cache already, and that's the existing packs
themselves.  During a clone, the vast majority of the transferred data
is streamed without further processing straight of those existing packs
as we try to reuse as much data as possible from those packs so not to
recompute/recompress that data all the time.

> In every case, for torrents to work, the P2P'd files must have some
> stability over some time period.
> (If this assumption is incorrect, please clarify, not counting
> every-file-is-a-torrent and every-commit-is-a-torrent.)
>
> So, torrentable options:
> - torrent per commit
> - torrent per pack
> - torrent per torrent-archive - new file format
>
> Torrent per commit - too small, too many torrents; we need larger
> p2p-able sizes in general.
>
> Torrent per pack - packs non-deterministically created, both between
> hosts and even intra-host (libz upgrade, nr_threads change, git pack
> algorithm optimization).
>
> A new torrent format, if "close enough" to current git pack
> performance (cpu load, threadability, size) is potential for new
> version of git pack file format - we don't want to store two sets of
> pack files on disk, if sensible to not do so; unlikely to happen - I
> can't conceive that a torrentable format would be anything but worse
> than pack files and therefore would be rejected from git master.
>
> Can we can relax the perceived requirement to deterministically create
> pack files?
> Well, over what time period are pack files stable in a particular git?
> Over what time period do we require stable files for torrenting?
>
> Can we simply configure our local git to keep specified pack files for
> specified time period?
> And use those for torrent-packs?
> Perhaps the torrent file could have a UseBy date?
Again, this is just too much complexity for so little gain.

Here's what I suggest:

        cd my_project
        BUNDLENAME=my_project_$(date "+%s").gitbundle
        git bundle create $BUNDLENAME --all
        maketorrent-console your_favorite_tracker $BUNDLENAME

Then start seeding that bundle, and upload $BUNDLENAME.torrent as
bundle.torrent inside my_project.git on your server.

Now... Git clients could be improved to first check for the availability
of the file "bundle.torrent" on the remote side, either directly in
my_project.git, or through some Git protocol extension.  Or even better,
the torrent hash could be stored in a Git ref, such as
refs/bittorrent/bundle and the client could use that to retrieve the
bundle.torrent file through some other means.

When the bundle.torrent file is retrieved, then just pull the torrent
content (and seed it some more to be nice).  Then simply run "git clone"
using the original arguments but with the obtained bundle instead of the
original URL.  Then replace the remote URL in .git/config with the
actual remote URL instead of the bundle file path.  And finally perform
a "git pull" to bring the new commits that were added to the remote
repository since the bundle was created.  That final pull will be small
and quick.

After a while, that final pull will get bigger as the difference between
the bundled version and the current tip in the remote repository will
grow.  So every so often, say 3 months, it might be a good idea to
create a new bundle so that the latest commits are included into it in
order to make that final pull small and quick again.

Isn't that sufficient?


Nicolas
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Jeff King
On Thu, Jan 06, 2011 at 11:33:51PM -0500, Nicolas Pitre wrote:

> Here's what I suggest:
>
> cd my_project
> BUNDLENAME=my_project_$(date "+%s").gitbundle
> git bundle create $BUNDLENAME --all
> maketorrent-console your_favorite_tracker $BUNDLENAME
>
> Then start seeding that bundle, and upload $BUNDLENAME.torrent as
> bundle.torrent inside my_project.git on your server.
>
> Now... Git clients could be improved to first check for the availability
> of the file "bundle.torrent" on the remote side, either directly in
> my_project.git, or through some Git protocol extension.  Or even better,
> the torrent hash could be stored in a Git ref, such as
> refs/bittorrent/bundle and the client could use that to retrieve the
> bundle.torrent file through some other means.

I really like the simplicity of this idea. It could even be generalized
to handle more traditional mirrors, too. Just slice up the refs/mirrors
namespace to provide different methods of fetching some initial set of
objects. For example, upload-pack might advertise (in addition to the
usual refs):

  refs/mirrors/bundle/torrent
  refs/mirrors/bundle/http
  refs/mirrors/fetch/git
  refs/mirrors/fetch/http

and the client can decide its preferred way of getting data: a bundle by
http or by torrent, or connecting directly to some other git repository
by git protocol or http. It would fetch the appropriate ref, which would
contain a blob in some method-specific format. For torrent, it would be
a torrent file. For the others, probably a newline-delimited set of
URLs. You could also provide a torrent-magnet ref if you didn't even
want to distribute the torrent file.

And no matter what the method used, at the end you have some set of refs
and objects, and you can re-try your (now much smaller fetch). And there
are a few obvious optimizations:

  1. When you get the initial set of refs from the master, remember
     them. If the mirror actually satisfies everything you were going to
     fetch, then you don't even have to reconnect for the final fetch.

  2. You can optionally cache the mirror list, and go straight to a
     mirror for future fetches instead of checking the master. This is
     only a reasonable thing to do if the mirrors are kept up to date,
     and provide good incremental access (i.e., only actual git-protocol
     mirrors, not torrent or http file).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Jeff King
On Fri, Jan 07, 2011 at 12:22:07AM -0500, Jeff King wrote:

>   refs/mirrors/bundle/torrent
>   refs/mirrors/bundle/http
>   refs/mirrors/fetch/git
>   refs/mirrors/fetch/http
>
> and the client can decide its preferred way of getting data: a bundle by
> http or by torrent, or connecting directly to some other git repository
> by git protocol or http. It would fetch the appropriate ref, which would
> contain a blob in some method-specific format. For torrent, it would be
> a torrent file. For the others, probably a newline-delimited set of
> URLs. You could also provide a torrent-magnet ref if you didn't even
> want to distribute the torrent file.
>
> And no matter what the method used, at the end you have some set of refs
> and objects, and you can re-try your (now much smaller fetch).

And I think it is probably obvious to you, Nicolas, since these are
problems you have been thinking about for some time, but the reason I am
interested in this expanded definition of mirroring is for a few
features people have been asking for:

  1. restartable clone; any bundle format is easily restartable using
     standard protocols

  2. avoid too-big clones; I remember the gentoo folks wanting to
     disallow full clones from their actual dev machines and push people
     off to some more static method of pulling. I think not just because
     of restartability, but because of the load on the dev machines

  3. people on low-bandwidth servers who fork major projects; if I write
     three kernel patches and host a git server, I would really like
     people to only fetch my patches from me and get the rest of it from
     kernel.org

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Zenaan Harkness
On Fri, Jan 7, 2011 at 16:31, Jeff King <[hidden email]> wrote:
> On Fri, Jan 07, 2011 at 12:22:07AM -0500, Jeff King wrote:
> the reason I am
> interested in this expanded definition of mirroring is for a few
> features people have been asking for:
>
>  1. restartable clone; any bundle format is easily restartable using
>     standard protocols

This is very important to me. I have failed to establish an initial
repo for a few larger projects, some apache projects and opentaps most
recently. It is getting _really_ frustrating.


>  2. avoid too-big clones; I remember the gentoo folks wanting to
>     disallow full clones from their actual dev machines and push people
>     off to some more static method of pulling. I think not just because
>     of restartability, but because of the load on the dev machines

And of course the lack of restartability causes an ongoing increase in
the load on the machines delivering those large clones.


>  3. people on low-bandwidth servers who fork major projects; if I write
>     three kernel patches and host a git server, I would really like
>     people to only fetch my patches from me and get the rest of it from
>     kernel.org

This is not so much of a problem - can already be handled by cloning
your linux-full.git to a private dir, and only publishing your shallow
"personal patches only" clone, or better still, just a tar-ball of
your 3 patches, or email them, or etc.


So I agree with the big issues being restartable large clones and
lowering server loads.

Zen
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Ilari Liusvaara
In reply to this post by Jeff King
On Fri, Jan 07, 2011 at 12:31:19AM -0500, Jeff King wrote:
>
>   3. people on low-bandwidth servers who fork major projects; if I write
>      three kernel patches and host a git server, I would really like
>      people to only fetch my patches from me and get the rest of it from
>      kernel.org

One client-side-only feature that could be useful:

Ability to contact multiple servers in sequence, each time advertising
everything obtained so far. Then treat the new repo as clone of the last
address.

This would e.g. be very handy if you happen to have local mirror of say, Linux
kernel and want to fetch some related project without messing with alternates
or downloading everything again:

git clone --use-mirror=~/repositories/linux-2.6 git://foo.example/linux-foo

This would first fetch everything from local source and then update that
from remote, likely being vastly faster.

-Ilari
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Jeff King
On Fri, Jan 07, 2011 at 08:52:18PM +0200, Ilari Liusvaara wrote:

> On Fri, Jan 07, 2011 at 12:31:19AM -0500, Jeff King wrote:
> >
> >   3. people on low-bandwidth servers who fork major projects; if I write
> >      three kernel patches and host a git server, I would really like
> >      people to only fetch my patches from me and get the rest of it from
> >      kernel.org
>
> One client-side-only feature that could be useful:
>
> Ability to contact multiple servers in sequence, each time advertising
> everything obtained so far. Then treat the new repo as clone of the last
> address.
>
> This would e.g. be very handy if you happen to have local mirror of say, Linux
> kernel and want to fetch some related project without messing with alternates
> or downloading everything again:
>
> git clone --use-mirror=~/repositories/linux-2.6 git://foo.example/linux-foo
>
> This would first fetch everything from local source and then update that
> from remote, likely being vastly faster.

I'm not clear in your example what ~/repositories/linux-2.6 is. Is it a
repo? In that case, isn't that basically the same as --reference? Or is
it a local mirror list?

If the latter, then yeah, I think it is a good idea. Clients should
definitely be able to ignore, override, or add to mirror lists provided
by servers. The server can provide hints about useful mirrors, but it is
up to the client to decide which methods are useful to it and which
mirrors are closest.

Of course there are some servers who will want to do more than hint
(e.g., the gentoo case where they really don't want people cloning from
the main machine). For those cases, though, I think it is best to
provide the hint and to reject clients who don't follow it (e.g., by
barfing on somebody who tries to do a full clone). You have to implement
that rejection layer anyway for older clients.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Ilari Liusvaara
On Fri, Jan 07, 2011 at 02:17:19PM -0500, Jeff King wrote:

> On Fri, Jan 07, 2011 at 08:52:18PM +0200, Ilari Liusvaara wrote:
>
> >
> > git clone --use-mirror=~/repositories/linux-2.6 git://foo.example/linux-foo
> >
> > This would first fetch everything from local source and then update that
> > from remote, likely being vastly faster.
>
> I'm not clear in your example what ~/repositories/linux-2.6 is. Is it a
> repo? In that case, isn't that basically the same as --reference? Or is
> it a local mirror list?

Yes, it is a repo. No, it isn't the same as --reference. It is list
of mirrors to try first before connecting to final repository and can
be any type of repository URL (local, true smart transport, smart HTTP,
dumb HTTP, etc...)

Idea is that you have list of mirrors that are faster than the final
repository, but not necressarily complete. You want to download most of
the stuff from there.

> If the latter, then yeah, I think it is a good idea. Clients should
> definitely be able to ignore, override, or add to mirror lists provided
> by servers. The server can provide hints about useful mirrors, but it is
> up to the client to decide which methods are useful to it and which
> mirrors are closest.

This is essentially adding mirrors to mirror list (modulo that mirrors
are not assumed to be complete).

Security:

Confidentiality: The connection to mirror must transverse only trusted
links or be encrypted if material from mirror is sensitive.

Integerity: The same integerity as the connection to final repo (assuming
SHA-1 can't be collided) due to fact that git object naming is securely
unique.

> Of course there are some servers who will want to do more than hint
> (e.g., the gentoo case where they really don't want people cloning from
> the main machine). For those cases, though, I think it is best to
> provide the hint and to reject clients who don't follow it (e.g., by
> barfing on somebody who tries to do a full clone). You have to implement
> that rejection layer anyway for older clients.

With option like this, a client could do:

git clone --use-mirror=http://git.example.org/base/foo git://git.example.org/foo

To first grab stuff via HTTP (well-packed dumb HTTP is very light on the
server) and then continue via git:// (now much cheaper because client is
relatively up to date).

-Ilari
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Jeff King
On Fri, Jan 07, 2011 at 11:45:01PM +0200, Ilari Liusvaara wrote:

> > I'm not clear in your example what ~/repositories/linux-2.6 is. Is it a
> > repo? In that case, isn't that basically the same as --reference? Or is
> > it a local mirror list?
>
> Yes, it is a repo. No, it isn't the same as --reference. It is list
> of mirrors to try first before connecting to final repository and can
> be any type of repository URL (local, true smart transport, smart HTTP,
> dumb HTTP, etc...)

OK, I understand what you mean. I was thrown off by your example using a
local repository (in which case you probably would want --reference to
save disk space, unless the burden of alternates management is too
much).

So yeah, I think we are on the same page, except that you were proposing
to pass the mirror directly, and I was proposing passing a mirror file
which would contain a list of mirrors. Yours is much simpler and would
probably be what people want most of the time.

> > If the latter, then yeah, I think it is a good idea. Clients should
> > definitely be able to ignore, override, or add to mirror lists provided
> > by servers. The server can provide hints about useful mirrors, but it is
> > up to the client to decide which methods are useful to it and which
> > mirrors are closest.
>
> This is essentially adding mirrors to mirror list (modulo that mirrors
> are not assumed to be complete).

I think there should always be an assumption that mirrors are not
necessarily complete. That is necessary for bundle-like mirrors to be
feasible, since updating the bundle for every commit defeats the
purpose.

It would be nice for there to be a way for some mirrors to be marked as
"should be considered complete and authoritative", since we can optimize
out the final check of the master in that case (as well as for future
fetches). But that's a future feature. My plan was to leave space in the
mirror list for arbitrary metadata of that sort.

> > Of course there are some servers who will want to do more than hint
> > (e.g., the gentoo case where they really don't want people cloning from
> > the main machine). For those cases, though, I think it is best to
> > provide the hint and to reject clients who don't follow it (e.g., by
> > barfing on somebody who tries to do a full clone). You have to implement
> > that rejection layer anyway for older clients.
>
> With option like this, a client could do:
>
> git clone --use-mirror=http://git.example.org/base/foo git://git.example.org/foo
>
> To first grab stuff via HTTP (well-packed dumb HTTP is very light on the
> server) and then continue via git:// (now much cheaper because client is
> relatively up to date).

Yes, exactly.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Ilari Liusvaara
On Fri, Jan 07, 2011 at 04:56:31PM -0500, Jeff King wrote:
> On Fri, Jan 07, 2011 at 11:45:01PM +0200, Ilari Liusvaara wrote:
>
>
> I think there should always be an assumption that mirrors are not
> necessarily complete. That is necessary for bundle-like mirrors to be
> feasible, since updating the bundle for every commit defeats the
> purpose.

Also add protocol that grabs a bundle from HTTP and then opens that
up? :-)

> It would be nice for there to be a way for some mirrors to be marked as
> "should be considered complete and authoritative", since we can optimize
> out the final check of the master in that case (as well as for future
> fetches). But that's a future feature. My plan was to leave space in the
> mirror list for arbitrary metadata of that sort.

The first thing one should get/do when connecting to another repository
is its list of references. One can see from there if what one has got
is complete or not (with --use-mirror that only allows skipping commit
negotiation and fetch, not the whole connection due to the fact that the
repositories are contacted in order)...

-Ilari
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Jeff King
On Sat, Jan 08, 2011 at 12:21:33AM +0200, Ilari Liusvaara wrote:

> On Fri, Jan 07, 2011 at 04:56:31PM -0500, Jeff King wrote:
> > On Fri, Jan 07, 2011 at 11:45:01PM +0200, Ilari Liusvaara wrote:
> >
> >
> > I think there should always be an assumption that mirrors are not
> > necessarily complete. That is necessary for bundle-like mirrors to be
> > feasible, since updating the bundle for every commit defeats the
> > purpose.
>
> Also add protocol that grabs a bundle from HTTP and then opens that
> up? :-)

Well, yes, that still needs to be implemented. But it's all client-side,
so the server just has to provide the bundle somewhere.

> > It would be nice for there to be a way for some mirrors to be marked as
> > "should be considered complete and authoritative", since we can optimize
> > out the final check of the master in that case (as well as for future
> > fetches). But that's a future feature. My plan was to leave space in the
> > mirror list for arbitrary metadata of that sort.
>
> The first thing one should get/do when connecting to another repository
> is its list of references. One can see from there if what one has got
> is complete or not (with --use-mirror that only allows skipping commit
> negotiation and fetch, not the whole connection due to the fact that the
> repositories are contacted in order)...

Yes, but it would be cool to be able to skip even that connect in some
cases (e.g., mirrors can be useful not just to take load off the master,
but also when the master isn't available, either for downtime or because
the client is behind a firewall). But the default should definitely be
to double-check that the master is right, and we can leave more advanced
cases for later (we just need to be aware of leaving room for them now).

I'm going to start working on a patch series for this, so hopefully
we'll see how it's shaping up in a day or two.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Nguyễn Thái Ngọc Duy
In reply to this post by Nicolas Pitre-2
On Fri, Jan 7, 2011 at 11:33 AM, Nicolas Pitre <[hidden email]> wrote:
> Here's what I suggest:
>
>        cd my_project
>        BUNDLENAME=my_project_$(date "+%s").gitbundle
>        git bundle create $BUNDLENAME --all
>        maketorrent-console your_favorite_tracker $BUNDLENAME
>
> Then start seeding that bundle, and upload $BUNDLENAME.torrent as
> bundle.torrent inside my_project.git on your server.

I was about to ask if we could put more "trailer" sha-1 checksums to
the bundle, so we can verify which part is corrupt without
redownloading the whole thing (this is over http/ftp.. not torrent).

But I realize it's just easier to split the bundle into multiple
packs, so we can verify and redownload only corrupt packs. Logically
it is still a single pack. Splitting help put more sha-1 checksums in
without changing pack format. The packs will be merged back into one
with "index-pack --pack-stream" patch I sent elsewhere.
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Nicolas Pitre-2
On Mon, 10 Jan 2011, Nguyen Thai Ngoc Duy wrote:

> On Fri, Jan 7, 2011 at 11:33 AM, Nicolas Pitre <[hidden email]> wrote:
> > Here's what I suggest:
> >
> >        cd my_project
> >        BUNDLENAME=my_project_$(date "+%s").gitbundle
> >        git bundle create $BUNDLENAME --all
> >        maketorrent-console your_favorite_tracker $BUNDLENAME
> >
> > Then start seeding that bundle, and upload $BUNDLENAME.torrent as
> > bundle.torrent inside my_project.git on your server.
>
> I was about to ask if we could put more "trailer" sha-1 checksums to
> the bundle, so we can verify which part is corrupt without
> redownloading the whole thing (this is over http/ftp.. not torrent).
Aren't HTTP and FTP based on TCP which is meant to be a reliable
transport protocol already?  In this case, isn't the final SHA1 embedded
in the bundle/pack sufficient enough?  Normally, your HTTP/FTP client
should get you all data or partial data, but not wrong data.


Nicolas
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

John Wyzer
In reply to this post by Shawn Pearce
On 06/01/11 18:05, Shawn Pearce wrote:
> On Wed, Jan 5, 2011 at 18:29, Zenaan Harkness<[hidden email]>  wrote:
>> Bittorrent requires some stability around torrent files.
>>
>> Can packs be generated deterministically?
>

I hope that I don't get something technically wrong (did not read any
code, only skimmed the docs) and that this question is not redundant:

Why not provide an alternative mode for the git:// protocoll that
instead of retrieving a big packaged blob breaks this down to the
smallest atomic objects from the repository? Those are not changing and
should be able to survive partial transfers.
While this might not be as efficient network traffic-wise it would
provide a solution for those behind breaking connections.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Sam Vilain
In reply to this post by Ilari Liusvaara
On 08/01/11 07:52, Ilari Liusvaara wrote:

> Ability to contact multiple servers in sequence, each time advertising
> everything obtained so far. Then treat the new repo as clone of the last
> address.
>
> This would e.g. be very handy if you happen to have local mirror of say, Linux
> kernel and want to fetch some related project without messing with alternates
> or downloading everything again:
>
> git clone --use-mirror=~/repositories/linux-2.6 git://foo.example/linux-foo
>
> This would first fetch everything from local source and then update that
> from remote, likely being vastly faster.

Coming to this discussion a little late, I'll summarise the previous
research.

First, the idea of applying the straight BitTorrent protocol to the pack
files was raised, but as Nicolas mentions, this is not useful because
the pack files are not deterministic.  The protocol was revisited based
around the part which is stable, object manifests.  The RFC is at
http://utsl.gen.nz/gittorrent/rfc.html and the prototype code (an
unsuccessful GSoC project) is at http://repo.or.cz/w/VCS-Git-Torrent.git

After some thought, I decided that the BitTorrent protocol itself is all
cruft and that trying to cut it down to be useful was a waste of time.
So, this is where the idea of "automatic mirroring" came from.  With
Automatic Mirroring, the two main functions of P2P operation - peer
discovery and partial transfer - are broken into discrete features.

I wrote this patch series so far, for "client-side mirroring":

http://thread.gmane.org/gmane.comp.version-control.git/133626/focus=133628

The later levels are roughly discussed on this page:

http://code.google.com/p/gittorrent/wiki/MirrorSync

The "mirror sync" part is the complicated one, and as others have noted
no truly successful prototype has yet been built.  Actually the Perl
gittorrent implementation did manage to perform an incremental clone; it
just didn't wrap it up nicely.  But I won't go into that too much.
There was also another GSoC program to look at caching the object list
generation, the most expensive part of the process in the Perl
implementation.  This was a generic mechanism for accelerating object
graph traversal and showed promise, however unfortunately was never merged.

The client-side mirroring patch, in its current form, already supports
out-of-date mirrors.  It saves refs first into
'refs/mirrors/hostname/...' and finally contacts the main server to
check what objects it is still missing.  So, if there was a regular
bittorrent+bundle transport available, it would be a useful way to
support an incremental clone; the client would first clone the (static)
bittorrent bundle, unpack it with its refs into the 'refs/mirrors/xxx/'
namespace, making the subsequent 'git fetch' to get the most recent
objects a much more efficient operation.

Hope that helps!

Cheers,
Sam
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Sam Vilain
In reply to this post by John Wyzer
On 11/01/11 05:39, John Wyzer wrote:
> Why not provide an alternative mode for the git:// protocoll that
> instead of retrieving a big packaged blob breaks this down to the
> smallest atomic objects from the repository? Those are not changing
> and should be able to survive partial transfers.
> While this might not be as efficient network traffic-wise it would
> provide a solution for those behind breaking connections.

To put this into numbers, for perl.git that might mean transferring 2GB
of data instead of 70MB of pack.

Sam
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: Resumable clone/Gittorrent (again) - stable packs?

Nguyễn Thái Ngọc Duy
In reply to this post by John Wyzer
On Mon, Jan 10, 2011 at 11:39 PM, John Wyzer <[hidden email]> wrote:
> Why not provide an alternative mode for the git:// protocoll that instead of
> retrieving a big packaged blob breaks this down to the smallest atomic
> objects from the repository? Those are not changing and should be able to
> survive partial transfers.
> While this might not be as efficient network traffic-wise it would provide a
> solution for those behind breaking connections.

That's what I'm getting to, except that I'll send deltas as much as I can.
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
12
Loading...