optimising a push by fetching objects from nearby repos

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

optimising a push by fetching objects from nearby repos

Sitaram Chamarty
Hi,

Is there a trick to optimising a push by telling the receiver to pick up
missing objects from some other repo on its own server, to cut down even
more on network traffic?

So, hypothetically,

     git push user@host:repo1 --look-for-objects-in=repo2

I'm aware of the alternates mechanism, but that makes the dependency on
the other repo sort-of permanent.  I'm looking for a temporary
dependence, just for the duration of the push.  Naturally, the objects
should be brought into the target repo for that to happen, except that
this would be doing more from disk and less from the network.

My gut says this isn't possible, and I've searched enough to almost be
sure, but before I give up, I wanted to ask.

thanks
sitaram

Milki: I'm sure you won't mind the cc, since you know the context :-)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: optimising a push by fetching objects from nearby repos

Duy Nguyen
On Sat, May 10, 2014 at 8:39 PM, Sitaram Chamarty <[hidden email]> wrote:

> Hi,
>
> Is there a trick to optimising a push by telling the receiver to pick up
> missing objects from some other repo on its own server, to cut down even
> more on network traffic?
>
> So, hypothetically,
>
>     git push user@host:repo1 --look-for-objects-in=repo2
>
> I'm aware of the alternates mechanism, but that makes the dependency on
> the other repo sort-of permanent.  I'm looking for a temporary
> dependence, just for the duration of the push.  Naturally, the objects
> should be brought into the target repo for that to happen, except that
> this would be doing more from disk and less from the network.
>
> My gut says this isn't possible, and I've searched enough to almost be
> sure, but before I give up, I wanted to ask.

My feeling is it is possible, assuming that the target sees and reuses
objects from repo2 already. Injecting an alternate repo at runtime
should be possible. We exclude objects from sending at commit level,
not object level. So after the initial exclusion, we may need to run
the to-be-sent objects against the alternate repo to skip some more,
but that should not cost much if repo2 is fully packed. The receiver
always does the connectivity test. So if you make a mistake and
specify repo3 instead, the receiver will reject the push and the
target repo won't be corrupted.
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: optimising a push by fetching objects from nearby repos

brian m. carlson
In reply to this post by Sitaram Chamarty
On Sat, May 10, 2014 at 07:09:37PM +0530, Sitaram Chamarty wrote:

> Hi,
>
> Is there a trick to optimising a push by telling the receiver to pick up
> missing objects from some other repo on its own server, to cut down even
> more on network traffic?
>
> So, hypothetically,
>
>     git push user@host:repo1 --look-for-objects-in=repo2
>
> I'm aware of the alternates mechanism, but that makes the dependency on
> the other repo sort-of permanent.  I'm looking for a temporary
> dependence, just for the duration of the push.  Naturally, the objects
> should be brought into the target repo for that to happen, except that
> this would be doing more from disk and less from the network.
>
> My gut says this isn't possible, and I've searched enough to almost be
> sure, but before I give up, I wanted to ask.
I don't believe this is possible.  There has been some discussion on
related matters at least fairly recently, though.

Part of the reason nobody has implemented this is because it exposes
additional security concerns.  If I create a commit that references an
object I don't own, but is in someone else's repository, this feature
could allow me to gain access to objects which I shouldn't have access
to unless the authentication and permissions layer is very, very
careful.  This would make many very simple HTTPS and SSH setups much
more complex.  Alternates don't have this problem because they're done
server-side.

I definitely understand the desire for this, though.  I would probably
use it myself if it were available.

--
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: optimising a push by fetching objects from nearby repos

milki
On 17:23 Sat 10 May     , brian m. carlson wrote:

> I don't believe this is possible.  There has been some discussion on
> related matters at least fairly recently, though.
>
> Part of the reason nobody has implemented this is because it exposes
> additional security concerns.  If I create a commit that references an
> object I don't own, but is in someone else's repository, this feature
> could allow me to gain access to objects which I shouldn't have access
> to unless the authentication and permissions layer is very, very
> careful.  This would make many very simple HTTPS and SSH setups much
> more complex.  Alternates don't have this problem because they're done
> server-side.

If this were implemented service side and specified with, say, a config
option, would this security concern go away?

--
milki
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: optimising a push by fetching objects from nearby repos

brian m. carlson
On Sat, May 10, 2014 at 10:32:26AM -0700, milki wrote:

> On 17:23 Sat 10 May     , brian m. carlson wrote:
> > I don't believe this is possible.  There has been some discussion on
> > related matters at least fairly recently, though.
> >
> > Part of the reason nobody has implemented this is because it exposes
> > additional security concerns.  If I create a commit that references an
> > object I don't own, but is in someone else's repository, this feature
> > could allow me to gain access to objects which I shouldn't have access
> > to unless the authentication and permissions layer is very, very
> > careful.  This would make many very simple HTTPS and SSH setups much
> > more complex.  Alternates don't have this problem because they're done
> > server-side.
>
> If this were implemented service side and specified with, say, a config
> option, would this security concern go away?
It would probably be fine if it were a config option.  I'd prefer it be
off by default, though, to prevent surprises.

The attack scenario I'm thinking of is where you have several different
users, but the web server runs as one system user.  So /git/bmc/foo.git
is owned by bmc, and /git/alice/bar.git is owned by alice.  The web
server will check authentication based on the path, and approve or deny
it.  If it's approved, it will invoke the git daemon as a CGI script.

But the git daemon itself only knows that it was authenticated as a
given user, and knows nothing about what the permissions scheme is.  So
it will blithely let me refer to any other repository and import its
data if the option is enabled.  The web server only considered the path
it was fed, so it couldn't have blocked this.

--
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: optimising a push by fetching objects from nearby repos

Junio C Hamano
In reply to this post by Sitaram Chamarty
Sitaram Chamarty <[hidden email]> writes:

> Is there a trick to optimising a push by telling the receiver to pick up
> missing objects from some other repo on its own server, to cut down even
> more on network traffic?
>
> So, hypothetically,
>
>     git push user@host:repo1 --look-for-objects-in=repo2
>
> I'm aware of the alternates mechanism, but that makes the dependency on
> the other repo sort-of permanent.

In the direction of fetching, this may be give a good starting point.

    http://thread.gmane.org/gmane.comp.version-control.git/243918/focus=245397

In the direction of pushing, theoretically you could:

 - define a new capability "look-for-objects-in" to pass the name of
   the repository from "git push" to the "receive-pack";

 - have "receive-pack" temporarily borrow from the named repository
   (if the policy on the server side allows it), and accept the push;

 - repack in order to dissociate the receiving repository from the
   other repository it temporarily borrowed from.

which would be the natural inverse of the approach suggested in the
"Can I borrow just temporarily while cloning?" thread.

But I haven't thought things through with respect to what else need
to be modified to make sure this does not have adverse interaction
with simultaneous pushes into the same repository, which would make
it harder to solve for "receive-pack" than for "clone/fetch".


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: optimising a push by fetching objects from nearby repos

Sitaram Chamarty
On 05/11/2014 02:32 AM, Junio C Hamano wrote:

> Sitaram Chamarty <[hidden email]> writes:
>
>> Is there a trick to optimising a push by telling the receiver to pick up
>> missing objects from some other repo on its own server, to cut down even
>> more on network traffic?
>>
>> So, hypothetically,
>>
>>      git push user@host:repo1 --look-for-objects-in=repo2
>>
>> I'm aware of the alternates mechanism, but that makes the dependency on
>> the other repo sort-of permanent.
>
> In the direction of fetching, this may be give a good starting point.
>
>      http://thread.gmane.org/gmane.comp.version-control.git/243918/focus=245397

That's an interesting thread and it's recent too.  However, it's about
clone (though the intro email mentions other commands also).

I'm specifically interested in push efficiency right now.  When you
"fork" someone's repo to your own space, and you push your fork to the
same server, it ought to be able to get most of the common objects from
disk (specifically, from the repo you forked), and only what extra you
did from the network.

Clones do have a workaround (clone with --reference, then repack, as you
said in that thread), but no such workaround exists for push.

> In the direction of pushing, theoretically you could:
>
>   - define a new capability "look-for-objects-in" to pass the name of
>     the repository from "git push" to the "receive-pack";
>
>   - have "receive-pack" temporarily borrow from the named repository
>     (if the policy on the server side allows it), and accept the push;
>
>   - repack in order to dissociate the receiving repository from the
>     other repository it temporarily borrowed from.
>
> which would be the natural inverse of the approach suggested in the
> "Can I borrow just temporarily while cloning?" thread.
>
> But I haven't thought things through with respect to what else need
> to be modified to make sure this does not have adverse interaction
> with simultaneous pushes into the same repository, which would make
> it harder to solve for "receive-pack" than for "clone/fetch".

I'll leave it in your capable hands :-)  My C coding days are long gone!

I do have a way to do this in gitolite (haven't coded it yet; just
thinking).  Gitolite lets you specify something to do before git-*-pack
runs, and I was planning something like this:

terminology: borrow, borrower repo, reference repo

"borrow = relaxed" mode

     1.  check if the user has read access to the reference repo; skip
         the rest of this if he doesn't

     2.  from reference repo's "objects", find all directories and
         "mkdir" them into borrower's objects directory, then find all
         files and "ln" (hardlink) them. This is presumably what "clone
         -l" does.

     This method is close to constant time since we're not copying
     objects.

     It has the potential issue that if an object existed in the
     reference repo that was subsequently *deleted* (say, a commit that
     contained a password, which was quickly overwritten when
     discovered), and the attacker knows the SHA, he can get the commit
     out by sending an commit that depends on it, then fetching it back.

     (He could do that to the reference repo directly if he had write
     access, but we'll assume he doesn't, so this *is* a possible
     attack).

"borrow = strict" mode

     1.  (same as for "relaxed" mode)

     2.  actually *fetch* all refs from the reference repo to the
         borrower (into, say, 'refs/borrowed'), then delete all those
         refs so you just have the objects now.

     Unlike the previous method, this takes time proportional to the
     delta between borrower and reference, and may load the system a bit,
     but unless the reference repo is highly volatile, this will settle
     down. The point is that it cannot be used to get anything that the
     user doesn't already have access to anyway.

I still have to try it, but it sounds like both these would work.

I'd appreciate any comments though...

regards
sitaram
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: optimising a push by fetching objects from nearby repos

Storm-Olsen, Marius
On 5/10/2014 8:04 PM, Sitaram Chamarty wrote:

> On 05/11/2014 02:32 AM, Junio C Hamano wrote: That's an interesting
> thread and it's recent too.  However, it's about clone (though the
> intro email mentions other commands also).
>
> I'm specifically interested in push efficiency right now.  When you
> "fork" someone's repo to your own space, and you push your fork to
> the same server, it ought to be able to get most of the common
> objects from disk (specifically, from the repo you forked), and only
> what extra you did from the network.
>
...
>
> I do have a way to do this in gitolite (haven't coded it yet; just
> thinking).  Gitolite lets you specify something to do before
> git-*-pack runs, and I was planning something like this:

And here you're poking the stick at the real solution to your problem.

Many of the Git repo managers will neatly set up a server-side repo
clone for you, with alternates into the original repo saving both
network and disk I/O.

So your work flow would instead be:
   1. Fork repo on server
   2. Remotely clone your own forked repo

I think it's more appropriate to handle this higher level operation
within the security context of a git repo manager, rather than directly
in git.

--
.marius

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: optimising a push by fetching objects from nearby repos

Sitaram Chamarty
On 05/11/2014 07:04 AM, Storm-Olsen, Marius wrote:

> On 5/10/2014 8:04 PM, Sitaram Chamarty wrote:
>> On 05/11/2014 02:32 AM, Junio C Hamano wrote: That's an interesting
>> thread and it's recent too.  However, it's about clone (though the
>> intro email mentions other commands also).
>>
>> I'm specifically interested in push efficiency right now.  When you
>> "fork" someone's repo to your own space, and you push your fork to
>> the same server, it ought to be able to get most of the common
>> objects from disk (specifically, from the repo you forked), and only
>> what extra you did from the network.
>>
> ...
>>
>> I do have a way to do this in gitolite (haven't coded it yet; just
>> thinking).  Gitolite lets you specify something to do before
>> git-*-pack runs, and I was planning something like this:
>
> And here you're poking the stick at the real solution to your problem.
>
> Many of the Git repo managers will neatly set up a server-side repo
> clone for you, with alternates into the original repo saving both
> network and disk I/O.

Gitolite already has a "fork" command that does that (though it uses
"-l", not alternates).  I specifically don't want to use alternates, and
I also specifically am looking for something that activates on a push --
in the situations I am looking to optimise, the clone already happened.

> So your work flow would instead be:
>     1. Fork repo on server
>     2. Remotely clone your own forked repo
>
> I think it's more appropriate to handle this higher level operation
> within the security context of a git repo manager, rather than directly
> in git.

Yes, because of the "read access" check in my suggested procedure to
handle this.  (Otherwise this is as valid as the plan suggested for
clone in Junior's email in [1]).

[1]: http://thread.gmane.org/gmane.comp.version-control.git/243918/focus=245397

I will certainly be doing this in gitolite.  The point of my post was to
validate the flow with the *git* experts in case they catch something I
missed, not to say "this should be done *in* git".
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: optimising a push by fetching objects from nearby repos

Storm-Olsen, Marius
On 5/10/2014 9:10 PM, Sitaram Chamarty wrote:

> On 05/11/2014 07:04 AM, Storm-Olsen, Marius wrote:
>> On 5/10/2014 8:04 PM, Sitaram Chamarty wrote: Many of the Git repo
>> managers will neatly set up a server-side repo clone for you, with
>> alternates into the original repo saving both network and disk
>> I/O.
>
> Gitolite already has a "fork" command that does that (though it uses
> "-l", not alternates).  I specifically don't want to use alternates,
> and I also specifically am looking for something that activates on a
> push -- in the situations I am looking to optimise, the clone already
> happened.

You can probably get the managers to do a fork without alternates too.

Also, it doesn't matter if you have already cloned from the original
repo remotely. If you use the git manager to clone the original repo on
the server, and you push to your new repo, only your changes will go
back over the wire. The git protocol will figure out only which objects
are missing to complete the new HEAD, and send those.

So
    1. Clone remote repo
    2. Hack hack hack
    3. Fork repo on server
    4. Push changes to your own remote repo
is equally efficient.


>> So your work flow would instead be:
>>     1. Fork repo on server
>>     2. Remotely clone your own forked repo
>>
>> I think it's more appropriate to handle this higher level operation
>> within the security context of a git repo manager, rather than directly
>> in git.
>
> Yes, because of the "read access" check in my suggested procedure to
> handle this.  (Otherwise this is as valid as the plan suggested for
> clone in Junior's email in [1]).

It's similar, but security issues come into play due to the swapped
direction, which is why I think it's wrong to place it in the push
command. Now, having the 'borrow' complement to 'reference' in Git seems
like a good idea, and should work for your case too, but IMO should be
configured with in the security context of the repo manager, and not on
an individual push. *shrug*


> [1]:
> http://thread.gmane.org/gmane.comp.version-control.git/243918/focus=245397
>
> I will certainly be doing this in gitolite.  The point of my post was to
> validate the flow with the *git* experts in case they catch something I
> missed, not to say "this should be done *in* git".

Absolutely, and I think that's how everyone perceived it :) It's a good
idea, with some tweaks, I think.


--
.marius

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: optimising a push by fetching objects from nearby repos

Sitaram Chamarty
On 05/11/2014 08:41 AM, Storm-Olsen, Marius wrote:
> On 5/10/2014 9:10 PM, Sitaram Chamarty wrote:

>      1. Clone remote repo
>      2. Hack hack hack
>      3. Fork repo on server
>      4. Push changes to your own remote repo
> is equally efficient.

Your suggestions are good for a manual setup where the target repo
doesn't already exist.

But what I was looking for was validation from git.git folks of the idea
of replicating what "git clone -l" does, for an *existing* repo.

For example, I'm assuming that bringing in only the objects -- without
any of the refs pointing to them, making them all dangling objects --
will still allow the optimisation to occur (i.e., git will still say "oh
yeah I have these objects, even if they're dangling so I won't ask for
them from the pusher" and not "oh these are dangling objects; so I don't
recognise them from this perspective -- you'll have to send me those
again").

[1]: for any gitolite-aware folks reading this: this involves mirroring,
bringing a new mirror into play, normal repos, wild repos, and on and
on...
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: optimising a push by fetching objects from nearby repos

Junio C Hamano
Sitaram Chamarty <[hidden email]> writes:

> But what I was looking for was validation from git.git folks of the idea
> of replicating what "git clone -l" does, for an *existing* repo.
>
> For example, I'm assuming that bringing in only the objects -- without
> any of the refs pointing to them, making them all dangling objects --
> will still allow the optimisation to occur (i.e., git will still say "oh
> yeah I have these objects, even if they're dangling so I won't ask for
> them from the pusher" and not "oh these are dangling objects; so I don't
> recognise them from this perspective -- you'll have to send me those
> again").

So here is an educated guess by a git.git folk.  I haven't read the
codepath for some time, so I may be missing some details:

 - The set of objects sent over the wire in "push" direction is
   determined by the receiving end listing what it has to the
   sending end, and then the sending end excluding what the
   receiving end told that it already has.

 - The receiving end tells the sending end what it has by showing
   the names of its refs and their values.

Having otherwise dangling objects in your object store alone will
not make them reachable from the refs shown to the sending end.  But
there is another trick the receiving end employes.

 - The receiving end also includes the refs and their values that
   appear in the repository it borrows objects from its alternate
   repositories, when it tells what objects it already has to the
   sending end.

So what you "assumed" is not entirely correct---bringing in only the
objects will not give you any optimization.

But because we infer from the location of the object store
(i.e. "objects" directory) where the refs that point at these
borrowed objects exist (i.e. in "../refs" relative to that "objects"
directory) in order to make sure that we do not have to say "oh
these are dangling but we know their history is not broken", we
still get the same optimisation.

At least, that is the theory ;-)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: optimising a push by fetching objects from nearby repos

Sitaram Chamarty
On 05/11/2014 11:34 PM, Junio C Hamano wrote:

> Sitaram Chamarty <[hidden email]> writes:
>
>> But what I was looking for was validation from git.git folks of the idea
>> of replicating what "git clone -l" does, for an *existing* repo.
>>
>> For example, I'm assuming that bringing in only the objects -- without
>> any of the refs pointing to them, making them all dangling objects --
>> will still allow the optimisation to occur (i.e., git will still say "oh
>> yeah I have these objects, even if they're dangling so I won't ask for
>> them from the pusher" and not "oh these are dangling objects; so I don't
>> recognise them from this perspective -- you'll have to send me those
>> again").
>
> So here is an educated guess by a git.git folk.  I haven't read the
> codepath for some time, so I may be missing some details:
>
>   - The set of objects sent over the wire in "push" direction is
>     determined by the receiving end listing what it has to the
>     sending end, and then the sending end excluding what the
>     receiving end told that it already has.
>
>   - The receiving end tells the sending end what it has by showing
>     the names of its refs and their values.
>
> Having otherwise dangling objects in your object store alone will
> not make them reachable from the refs shown to the sending end.  But
> there is another trick the receiving end employes.
>
>   - The receiving end also includes the refs and their values that
>     appear in the repository it borrows objects from its alternate
>     repositories, when it tells what objects it already has to the
>     sending end.
>
> So what you "assumed" is not entirely correct---bringing in only the
> objects will not give you any optimization.
>
> But because we infer from the location of the object store
> (i.e. "objects" directory) where the refs that point at these
> borrowed objects exist (i.e. in "../refs" relative to that "objects"
> directory) in order to make sure that we do not have to say "oh
> these are dangling but we know their history is not broken", we
> still get the same optimisation.

Thanks!

Everything makes sense.  However, I'm not using the alternates
mechanism.

Since gitolite has the advantage of allowing me to do something before
and something after the git-receive-pack, I'm fetching all the refs into
a temporary namespace before, and deleting all of them after.  So, just
for the duration of the push, the refs do exist, and optimisation (of
network traffic) therefore happens.

In addition, since I check that the user has read access to the lender
repo (and don't do this optimisation if he does not), there is -- by
definition -- no security issue, in the sense that he cannot get
anything from the lender repo that he could not have got directly.

Thanks for all your help again, especially the very clear explanation!

regards
sitaram
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html