Git commit generation numbers

classic Classic list List threaded Threaded
89 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Shawn Pearce
On Wed, Jul 20, 2011 at 17:18,  <[hidden email]> wrote:
>
> if it's just locally generated, then I could easily see generation numbers
> being different on different people's ssstems, dependin on the order that
> they see commits (either locally generated or pulled from others)

But this should only happen if the user fudges with their Git sources
and makes Git produce a different generation number.

If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A))
+ 1" then it doesn't matter who merged what commits, the same commit
appears at the same part of the graph relative to all of its
ancestors, and therefore always has the same generation number. This
is true whether or not the commit contains the generation number.

> If it's part of the commit, then as that commit gets propogated the
> generation number gets propogated as well, and every repository will agree
> on what the generation number is for any commit that's shared.

This isn't really as beneficial as you are making it out to be. We
already can agree on what the generation number should be for any
given commit, if you topo-sort the commit DAG, you get the same
result.

> I agree that this consistancy guarantee seems to be valuable.

Its valuable, but its consistent either with a cache, or not.

--
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Phil Hord (hordp)
In reply to this post by David Lang
On 07/20/2011 08:18 PM, [hidden email] wrote:

> On Wed, 20 Jul 2011, Phil Hord wrote:
>
>> On 07/20/2011 07:36 PM, Nicolas Pitre wrote:
>>> On Wed, 20 Jul 2011, [hidden email] wrote:
>>>
>>>> If the generation number is part of the repository then it's going to
>>>> be the same for everyone.
>>> The actual generation number will be, and has to be, the same for
>>> everyone with the same repository content, regardless of the cache
>>> used.
>>> It is a well defined number with no room to interpretation.
>>
>> Nonsense.
>>
>> Even if the generation number is well-defined and shared by all
>> clients, the only quasi-essential definition is "for each A in
>> ancestors_of(B), gen(A) < gen(B)".
>>
>> In practice, the actual generation number *will be the same* for
>> everyone with the same repository content, unless and until someone
>> develops a different calculation method.  But there is no reason to
>> require that the number *has to be* the same for everyone unless you
>> expect (or require) everyone to share their gen-caches.
>
> and I think this is why Linus is not happy with a cache. He is seeing
> this as something that has significantly more value if it is going to
> be consistant in a distributed manner than if it's just something
> calculated locally that can be different from other systems.

It will only be used locally, so it needn't be consistent with anyone
else's.

>
> if it's just locally generated, then I could easily see generation
> numbers being different on different people's ssstems, dependin on the
> order that they see commits (either locally generated or pulled from
> others)
>
> If it's part of the commit, then as that commit gets propogated the
> generation number gets propogated as well, and every repository will
> agree on what the generation number is for any commit that's shared.
>
> I agree that this consistancy guarantee seems to be valuable.

I can't see why.

>> Surely there will be a competent and efficient gen-cache API.  But
>> most code can just ask if B --contains A or even just use rev-list
>> and benefit from the increased speed of the answer.  Because most
>> code doesn't really care about the gen numbers themselves, but only
>> the speed of determining ancestry.
>
> in that case, why bother with generation numbers at all? the improved
> data based heristic seems to solve that problem.

Does it?  Surely the ruckus would've died down in that case.  But I
haven't been reading pu.

It seems to me that the main drawback to a gen-cache is that it slows
down the first operation after even a local clone (with just hardlinks).

On the other hand, I see too many nails in the distributed-gen-numbers
coffin:  legacy commits can't catch up (and therefore suffer), and
legacy clients can trash or corrupt even "new-style" commits.

Phil

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Phil Hord (hordp)
In reply to this post by Shawn Pearce

On 07/20/2011 08:37 PM, Shawn Pearce wrote:

> On Wed, Jul 20, 2011 at 17:18,<[hidden email]>  wrote:
>> if it's just locally generated, then I could easily see generation numbers
>> being different on different people's ssstems, dependin on the order that
>> they see commits (either locally generated or pulled from others)
> But this should only happen if the user fudges with their Git sources
> and makes Git produce a different generation number.
>
> If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A))
> + 1" then it doesn't matter who merged what commits, the same commit
> appears at the same part of the graph relative to all of its
> ancestors, and therefore always has the same generation number. This
> is true whether or not the commit contains the generation number.

Interesting.  I was going to disagree with the latter part of your
statement, but then I realized you're right.

And that your algorithm allows duplicate generation numbers.

And that there's nothing wrong with that.

Because it meets the one quasi-essential need, "for each A in
ancestors_of(B), gen(A) < gen(B)".

>> If it's part of the commit, then as that commit gets propogated the
>> generation number gets propogated as well, and every repository will agree
>> on what the generation number is for any commit that's shared.
> This isn't really as beneficial as you are making it out to be. We
> already can agree on what the generation number should be for any
> given commit, if you topo-sort the commit DAG, you get the same
> result.
>
>> I agree that this consistancy guarantee seems to be valuable.
> Its valuable, but its consistent either with a cache, or not.

I still fail to see the value.

Phil

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Nicolas Pitre-2
In reply to this post by Phil Hord (hordp)
On Wed, 20 Jul 2011, Phil Hord wrote:

> On 07/20/2011 07:36 PM, Nicolas Pitre wrote:
> > On Wed, 20 Jul 2011, [hidden email] wrote:
> >
> > > If the generation number is part of the repository then it's going to
> > > be the same for everyone.
> > The actual generation number will be, and has to be, the same for
> > everyone with the same repository content, regardless of the cache used.
> > It is a well defined number with no room to interpretation.
>
> Nonsense.
>
> Even if the generation number is well-defined and shared by all clients, the
> only quasi-essential definition is "for each A in ancestors_of(B), gen(A) <
> gen(B)".

Sure.  But what do you gain by making holes in the sequence?

> In practice, the actual generation number *will be the same* for everyone with
> the same repository content, unless and until someone develops a different
> calculation method.  But there is no reason to require that the number *has to
> be* the same for everyone unless you expect (or require) everyone to share
> their gen-caches.

And with the above you clearly reinforced the argument _against_ storing
the generation number in the commit object.  If you can imagine a
different calculation method already, and if it is actually useful, then
who knows if something even better could be done eventually.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Phil Hord (hordp)
On 07/20/2011 08:58 PM, Nicolas Pitre wrote:

> On Wed, 20 Jul 2011, Phil Hord wrote:
>
>> On 07/20/2011 07:36 PM, Nicolas Pitre wrote:
>>> On Wed, 20 Jul 2011, [hidden email] wrote:
>>>
>>>> If the generation number is part of the repository then it's going to
>>>> be the same for everyone.
>>> The actual generation number will be, and has to be, the same for
>>> everyone with the same repository content, regardless of the cache used.
>>> It is a well defined number with no room to interpretation.
>> Nonsense.
>>
>> Even if the generation number is well-defined and shared by all clients, the
>> only quasi-essential definition is "for each A in ancestors_of(B), gen(A)<
>> gen(B)".
> Sure.  But what do you gain by making holes in the sequence?

Depends on the algorithm.  Probably speed.  Possibly more efficient
limited-cache building (jit-style discovery in reverse, as-needed, for
example).

What do you gain by enforcing contiguousness?  Why not require all gen
numbers to be even?  Or prime?  ;)

>> In practice, the actual generation number *will be the same* for everyone with
>> the same repository content, unless and until someone develops a different
>> calculation method.  But there is no reason to require that the number *has to
>> be* the same for everyone unless you expect (or require) everyone to share
>> their gen-caches.
> And with the above you clearly reinforced the argument _against_ storing
> the generation number in the commit object.  If you can imagine a
> different calculation method already, and if it is actually useful, then
> who knows if something even better could be done eventually.

Good.  Nice to see I'm being self-consistent, then.

Phil

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

David Lang
In reply to this post by Shawn Pearce
On Wed, 20 Jul 2011, Shawn Pearce wrote:

> On Wed, Jul 20, 2011 at 17:18,  <[hidden email]> wrote:
>>
>> if it's just locally generated, then I could easily see generation numbers
>> being different on different people's ssstems, dependin on the order that
>> they see commits (either locally generated or pulled from others)
>
> But this should only happen if the user fudges with their Git sources
> and makes Git produce a different generation number.
>
> If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A))
> + 1" then it doesn't matter who merged what commits, the same commit
> appears at the same part of the graph relative to all of its
> ancestors, and therefore always has the same generation number. This
> is true whether or not the commit contains the generation number.

I have to think about this more, but I'm wondering about cases where the
same result ia achieved via different methods, something along the lines
of one person developing something with _many_ commits (creating a large
generation number) that one person merges far sooner than another, causing
the commits that they do after the merge to have much larger generation
numbers than someone making the same changes, but doing the merge later

something like

   C9
    \
C2 - C10 - C11 - C12

vs
                 C9
                   \
C2 - C3 - C4 - C5 - C10

where the C10-12 in the first set and C3-5 in the second set are
completely unrelated to what's done in C9 and C12 in the first set and C10
in the sedond set are identical trees.

now I know that part of a commit is what it's parents are, so that is
different (and that may be enough to say that generations don't matter
and this entire issue is moot), but I haven't thought about it long enough
to convince myself what would (or should) happen in these cases.

David Lang

>> If it's part of the commit, then as that commit gets propogated the
>> generation number gets propogated as well, and every repository will agree
>> on what the generation number is for any commit that's shared.
>
> This isn't really as beneficial as you are making it out to be. We
> already can agree on what the generation number should be for any
> given commit, if you topo-sort the commit DAG, you get the same
> result.
>
>> I agree that this consistancy guarantee seems to be valuable.
>
> Its valuable, but its consistent either with a cache, or not.
>
>
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Christian Couder-2
In reply to this post by Jeff King
On Tue, Jul 19, 2011 at 10:00 PM, Jeff King <[hidden email]> wrote:

> On Tue, Jul 19, 2011 at 06:14:38AM +0200, Christian Couder wrote:
>
>> Perhaps but with "git replace" you can choose to create new replace refs and
>> deprecate the old replace refs to fix this where you got it wrong.
>>
>> It would be easier to do that if "git replace" supported sub directories like
>> "refs/replace/clock-skew/ted-july-2011/", so you could manage the replace refs
>> more easily.
>
> I think all of the arguments I cut from your email are reasonable, but
> the crux of the issue comes down to this point.
>
> If you are interested in actually correcting the skew, then yes, replace
> refs are a good solution. But doing so is going to involve somebody
> looking at the commits and deciding which ones are wrong, and what they
> should be.

I think that we can help the user a lot to find the skew, and then to
decide which commits are wrong, and then to fix the skew even if the
fix we suggest is far from being perfect.

> And maybe that's a good thing to do for people who really
> care about cleaning history.

Yeah, so maybe at one point we will want to help these people even if
we have implemented automatic generation numbers. Then this means that
automated generation numbers are useful only if:

1) there are commits with skews
2) the heuristics to deal with some skew don't work
3) the user is too lazy to use the help we (can) provide to fix the skews

I think that we can probably find heuristics that will deal with at
least 95% of the cases. For example we could perhaps decide that we
don't cut off a traversal until the date difference is greater than 5
days.

Then in the hopefully few cases where there are really big skews that
won't be caught by our heuristics, (but that we can automatically
detect when fetching or commiting,) we can perhaps afford to ask the
user to do a small analysis to properly fix the skew.

I mean that at one point when things are too weird it is ok and
perhaps even a good thing to involve the user.

> But for something like "speed up revision traversal by assuming commit
> timestamps are roughly increasing", we want something very automated,
> and what is needs to say is much weaker (not "this is what this commit
> _should_ say", but rather "this commit might be right, but it is not a
> good point for cutting off a traversal"). So that's a much easier
> problem, and it's easy to do in an automated way.

Yeah, generation numbers look like an easy thing to do. And yeah,
being automated is great too. But it does not mean it is the right
thing to do. (Or perhaps we could have them but not save them in any
cache, nor in the commit object.)

> So I think while you could use replace refs to handle this issue, it is
> not always going to be the right solution, and there is room for
> something simpler (and weaker).

You know, replace refs can be used to fix or improve a lot of things
like bad authors, clock skews, bisecting on a fixed up history,
working on a larger or smaller repository than the original, and so
on. And of course for each of these problems you may find another
solution tailored to the problem at hand that will seem simpler or
easier. But in the end if you develop all these other solutions you
will have developed a lot of stuff that will be harder to maintain,
less generic, more complex and so on, that properly developed replace
refs.

Thanks,
Christian.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Drew Northup
In reply to this post by David Lang

On Wed, 2011-07-20 at 16:26 -0700, [hidden email] wrote:

> On Wed, 20 Jul 2011, George Spelvin wrote:
>
> >> The alternative of having to sometimes use the generation number,
> >> sometimes use the possibly broken commit date, makes for much more
> >> complicated code that has to be maintained forever.  Having a solution
> >> that starts working only after a certain point in history doesn't look
> >> eleguant to me at all.  It is not like having different pack formats
> >> where back and forth conversions can be made for the _entire_ history.
> >
> > It seemed like a pretty strong argument to me, too.
>
> except that you then have different caches on different systems. If the
> generation number is part of the repository then it's going to be the same
> for everyone.

I keep hearing (reading) people stating this utterly unfounded argument.
The fact is that for any work not yet integrated back into a shared
repository it just isn't true--and even after upstream integration the
truth of such a statement may be limited.

I have not read yet one discussion about how generation numbers [baked
into a commit] deal with rebasing, for instance. Do we assign one more
than the revision prior to the base of the rebase operation or do we
start with the revision one after the highest of those original commits
included in the rebase? Depending on how that is done
_drastically_different_ numbers can come out of different repository
instances for the same _final_ DAG. This is one major reason why, as I
see it, local storage is good for generation numbers and putting them in
the commit is bad.

I have no problem with putting an _advisory_ "revision number" in the
commit. It would not be expected to have a proper "1-to-1 and onto"
functional association with the _final_ DAG, but it could potentially
get us some nice benefits. We would still need to answer questions like
the one I ask above, but it would hurt less to change if we need to.

One other sane option that was mentioned at least once in passing was to
store the generation number in some Git "filesystem-level" object. This
could then be reconciled with each "git gc" or "git fsck" operation if
not more often. This is less ad-hoc and messy than a separate cache,
becomes amenable to the standard tool-set, and always gets updated (no
invalid cache). If an _advisory_ revision number is available in commits
that are sent along those could conceivably be used to help build up the
local git-fs generation numbers more quickly. (If a "git pull" is issued
to our repo, or we push to another, we don't send the generation numbers
locally stored--we expect the git-fs machinery to regenerate those on
the fly.)

I may not be one of the "resident rocket scientists," but that's how I
see it.

--
-Drew Northup
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

George Spelvin
In reply to this post by David Lang
On <[hidden email]> wrote:
> On Wed, 20 Jul 2011, Shawn Pearce wrote:
>> If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A))
>> + 1" then it doesn't matter who merged what commits, the same commit
>> appears at the same part of the graph relative to all of its
>> ancestors, and therefore always has the same generation number. This
>> is true whether or not the commit contains the generation number.

> I have to think about this more, but I'm wondering about cases where the
> same result ia achieved via different methods, something along the lines
> of one person developing something with _many_ commits (creating a large
> generation number) that one person merges far sooner than another, causing
> the commits that they do after the merge to have much larger generation
> numbers than someone making the same changes, but doing the merge later

Can't happen.  Using the basic algorithm as Shawn described, the
generation number is defined uniquely by the ancestor DAG.

The generation number is the length of the longest path to a
root (zero-ancestor) commit through the DAG.

If you look at past discussion, several people have thought it was
okay to bake into the commit precsiely because it can be computed
once and will never change.

However, git does have some ability to amend the history DAG after
it's been written, using grafts and replace objects.  These can
change generation numbers, presisely because they change the DAG.

> something like
>
>    C9
>     \
> C2 - C10 - C11 - C12
>
> vs
>                  C9
>                    \
> C2 - C3 - C4 - C5 - C10
>
> where the C10-12 in the first set and C3-5 in the second set are
> completely unrelated to what's done in C9 and C12 in the first set
> and C10 in the second set are identical trees.

The generation numbers in the above are as follows:
First example:
        C2 = C9 = 0
        C10 = 1 = max(C2, C9) + 1
        C11 = 2 = C10 + 1
        C12 = 3 = C11 + 1

Second example:
        C2 = C9 = 0
        C3 = 1 = C2 + 1
        C4 = 2 = C2 + 1
        C5 = 3 = C4 + 1
        C10 = 4 = max(C5, C9) + 1

Now, the history pruning works fine if the "+1" is replaced my any other
non-zero increment, but it's not clear why you'd bother.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

George Spelvin
In reply to this post by Drew Northup
> I have not read yet one discussion about how generation numbers [baked
> into a commit] deal with rebasing, for instance. Do we assign one more
> than the revision prior to the base of the rebase operation or do we
> start with the revision one after the highest of those original commits
> included in the rebase? Depending on how that is done
> _drastically_different_ numbers can come out of different repository
> instances for the same _final_ DAG. This is one major reason why, as I
> see it, local storage is good for generation numbers and putting them in
> the commit is bad.

Er, no.  Whenever a new commit object is generated (as the result
of a rebase or not), its commit number is computed based on its
parent commits.  It is NEVER copied.

Just like the parent pointers themselves.  Remember, even though we talk
about "the same commit" after rebasing, it's really just an EQUIVALENT
commit according to some higher-level concept of similarity.  As far
as the core git engine is concerned, it's always a DIFFERENT commit,
with different parent hashes and a different hash itself.

This point hasn't been mentioned explicltly precisely because it's
so obvious; the history-walking code that the generation numbers are
for requires this property to function.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Drew Northup

On Thu, 2011-07-21 at 08:55 -0400, George Spelvin wrote:

> > I have not read yet one discussion about how generation numbers [baked
> > into a commit] deal with rebasing, for instance. Do we assign one more
> > than the revision prior to the base of the rebase operation or do we
> > start with the revision one after the highest of those original commits
> > included in the rebase? Depending on how that is done
> > _drastically_different_ numbers can come out of different repository
> > instances for the same _final_ DAG. This is one major reason why, as I
> > see it, local storage is good for generation numbers and putting them in
> > the commit is bad.
>
> Er, no.  Whenever a new commit object is generated (as the result
> of a rebase or not), its commit number is computed based on its
> parent commits.  It is NEVER copied.

I don't see the word "copy" in my original.

B-O1-O2-O3-O4-O5-O6
 \
  R1----R2-------R3

What's the correct generation number for R3? I would say gen(B)+3. My
reading of the posts made by some others was that they thought gen(O6)
was the correct answer. Still others seemed to indicate gen(O6)+1 was
the correct answer. I don't think everybody MEANT to be saying such
different things--that's just how they appeared on this end.

Now, did you mean something different by "commit number?"

--
-Drew Northup
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Phil Hord (hordp)
On 07/21/2011 11:57 AM, Drew Northup wrote:

> On Thu, 2011-07-21 at 08:55 -0400, George Spelvin wrote:
>>> I have not read yet one discussion about how generation numbers [baked
>>> into a commit] deal with rebasing, for instance. Do we assign one more
>>> than the revision prior to the base of the rebase operation or do we
>>> start with the revision one after the highest of those original commits
>>> included in the rebase? Depending on how that is done
>>> _drastically_different_ numbers can come out of different repository
>>> instances for the same _final_ DAG. This is one major reason why, as I
>>> see it, local storage is good for generation numbers and putting them in
>>> the commit is bad.
>> Er, no.  Whenever a new commit object is generated (as the result
>> of a rebase or not), its commit number is computed based on its
>> parent commits.  It is NEVER copied.
> I don't see the word "copy" in my original.
>
> B-O1-O2-O3-O4-O5-O6
>   \
>    R1----R2-------R3
>
> What's the correct generation number for R3? I would say gen(B)+3.
And you would be correct if you follow the SoP algorithm.

> My
> reading of the posts made by some others was that they thought gen(O6)
> was the correct answer. Still others seemed to indicate gen(O6)+1 was
> the correct answer.
Maybe the confusion comes from the different storage mechanisms being
discussed.  If the generation numbers are in a local cache and used by a
single client, the determinism of the specific numbers doesn't much
matter.  If they are part of the commit, it still doesn't need to be
completely deterministic. However, interoperability requires standards,
and standards favor determinism, so dogmatic determinism may triumph in
that case.

1. gen(06) might make sense if you mean to implement --date-order using
gen-numbers, for example.  But I don't think it's practical in any case.

2. gen(06)+1 might make sense if you mean to require that gen-numbers
are unique per repo.  But this is both unsupportable and unnecessary, so
it's a non-starter.

3. gen(B)+1 is what you'd get from the the algorithm I saw proposed.

All three of these are provably correct by my definition of "correct":
"for each A in ancestors_of(B), gen(A) < gen(B)".

However, [1] and [2] have some extra features of dubious value.  Simpler
is better for interoperability, so I like [3] for this purpose.

Even [3] has an extra feature I think is unnecessary: determinism.  If
that "requirement" is dropped, I think all three of these algorithms are
(functionally) roughly equivalent.

> I don't think everybody MEANT to be saying such
> different things--that's just how they appeared on this end.
>
> Now, did you mean something different by "commit number?"

I remain unconvinced that there is value in gen-number distribution, so
to my mind, the specific algorithm and whether or not it is
deterministic are unimportant.

Phil ~ who wasn't really being asked, but felt like answering

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

George Spelvin
In reply to this post by Drew Northup
Drew Northup wrote:

> On Thu, 2011-07-21 at 08:55 -0400, George Spelvin wrote:
>> I have not read yet one discussion about how generation numbers [baked
>> into a commit] deal with rebasing, for instance. Do we assign one more
>> than the revision prior to the base of the rebase operation or do we
>> start with the revision one after the highest of those original commits
>> included in the rebase? Depending on how that is done
>> _drastically_different_ numbers can come out of different repository
>> instances for the same _final_ DAG. This is one major reason why, as I
>> see it, local storage is good for generation numbers and putting them in
>> the commit is bad.
>
> Er, no.  Whenever a new commit object is generated (as the result
> of a rebase or not), its commit number is computed based on its
> parent commits.  It is NEVER copied.

> I don't see the word "copy" in my original.

Indeed, you didn't use it; it was my simplified mental model of your
suggestion that the rebased commits would have generation numbers that
somehow depended on the generation numbers before rebasing.

Althouugh you suggested something different, the mistake is the same:
the rebased commits' generation numbers have simply no relationship to
those of the original pre-rebase commits.  The generation numbers depend
only on the commits explicitly listed as parents in the commit objects.

That's why I went on to explain that the equivalence of the commits
produced by a rebase operation is a higher-level concept; the core git
object database just knows that they aren't identical, and therefore
are different.

Thus, they would retain the same relative order as before the rebase
(unless you permuted them with rebase -i), but start with the generation
number of the rebase target.

> B-O1-O2-O3-O4-O5-O6
>  \
>   R1----R2-------R3

> What's the correct generation number for R3? I would say gen(B)+3. My
> reading of the posts made by some others was that they thought gen(O6)
> was the correct answer. Still others seemed to indicate gen(O6)+1 was
> the correct answer. I don't think everybody MEANT to be saying such
> different things--that's just how they appeared on this end.

According to the canonical algorithm, it's gen(B)+3 = gen(R2)+1.

However, any non-decreasing series is equally permissible for
optimizing history walking, so you could add jumps to (for example)
make the numbers unique if that simplified anything.

I don't think it does simplify anything, so the issue hasn't been
discussed much.

For the purpose of the optimization enabled by the generation
numbers, however, it doesn't actually matter.

What matters is that if I am listing commits down multiple branches,
once I have walked back on each branch to commits of generation N or
less, I know that I have found all possible descendants of all commits
of generation N or more.

This lets me display the recent part of the commit DAG (back to generation
N) without exploring the entire commit treem or worrying that I'll have to
"back up" to insert a commit in its proper order.  Without precomputed
generation numbers, the only way to be sure of this is to explore back
to generation 0 (parentless commits) or to use date-based heuristics.

> Now, did you mean something different by "commit number?"

No, just a bran fart I didn't catch before posting.
I meant "generation number".
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jakub Narębski
In reply to this post by George Spelvin
George Spelvin, could you please try not mangle CC to include only
emails, stripping names (e.g. "[hidden email]" instead of
"Shawn Pearce <[hidden email]>")?

"George Spelvin" <[hidden email]> writes:
> On <[hidden email]> wrote:
>> On Wed, 20 Jul 2011, Shawn Pearce wrote:

>>> If the algorithm is always "gen(A) = max(gen(P) for each parent_of(A))
>>> + 1" then it doesn't matter who merged what commits, the same commit
>>> appears at the same part of the graph relative to all of its
>>> ancestors, and therefore always has the same generation number. This
>>> is true whether or not the commit contains the generation number.
>
>> I have to think about this more, but I'm wondering about cases where the
>> same result ia achieved via different methods, something along the lines
>> of one person developing something with _many_ commits (creating a large
>> generation number) that one person merges far sooner than another, causing
>> the commits that they do after the merge to have much larger generation
>> numbers than someone making the same changes, but doing the merge later
>
> Can't happen.  Using the basic algorithm as Shawn described, the
> generation number is defined uniquely by the ancestor DAG.
>
> The generation number is the length of the longest path to a
> root (zero-ancestor) commit through the DAG.
>
> If you look at past discussion, several people have thought it was
> okay to bake into the commit precsiely because it can be computed
> once and will never change.
>
> However, git does have some ability to amend the history DAG after
> it's been written, using grafts and replace objects.  These can
> change generation numbers, presisely because they change the DAG.

There is also another issue that I have mentioned, namely incomplete
clones - which currently means shallow clone, without access to full
history.


Nb. grafts are so horrible hack that I would be not against turning
off generation numbers if they are used.

In the case of replace objects you need both non-replaced and replaced
DAG generation numbers.

--
Jakub Narębski

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

George Spelvin
> There is also another issue that I have mentioned, namely incomplete
> clones - which currently means shallow clone, without access to full
> history.

As far as history walking is concerned, you can just consider "missing
parent" the same as "no parent" and start the generation numbers at 0.
As long as you recompute

> Nb. grafts are so horrible hack that I would be not against turning
> off generation numbers if they are used.

Yeah, but it's not too miserable to add support (the logic is very similar
to replace objects), and then you would be able to have the history walking
code depend on the presence of generation numbers.  (The "load the cache"
function would regenerate it if necessary.)

Only do this if you already have support for "no generation numbers" in
the history walking code for (say) loose objects.

> In the case of replace objects you need both non-replaced and replaced
> DAG generation numbers.

Yes, the cache validity/invalidation criteria are the tricky bit.
Honestly, this is where the code gets ugly, not computing and storing
the generation numbers.


One thought on an expanded generation number cache:

There are many git operations that use ONLY the commit DAG, and do not
actually use any information from the commits other than their hashes
and parent pointers.  The ones that come to mind are rev-parse, rev-list,
describe, name-rev, and merge-base.

These could be sped up if, instead of just generation numbers, we kept
a complete cached copy of the commit DAG, so the commit objects didn't
have to be uncompressed and parsed.

This could be provided by an extended form of generation number cache.
In addition to listing the generation number of each commit, it
would list all the ancestors (by file offset rather than hash, for
compactness).  Then simple commit walking could load this cache and
avoid unpacking commit objects from packs.

A compact implementation would abuse the flexibility of generation numbers
to make them serve double duty.  They would be used as offsets into a
table of parent pointers.  By keeping the table topologically sorted,
the offsets would satisfy the requirements for generation numbers, but
would be unique, and there would be additional gaps when a commit had
multiple parents.

The parent pointers would themselves be 31-bit offsets into the table of
SHA-1 hashes, with the msbit meaning "this commit has multiple parents,
also look at the following table entry".  (If we use offset 0 to mean
"no parents", it might be more convenient to have the offset point to
the *end* of the run of parents rather than the beginning, so "following"
would be earlier in the file, but that's an implementation detail.)

I'm assuming that 2^31 commits having (in aggregate) 2^32 parents would
be enough for the time being.  As a local cache, it can be extended
with a software upgrade.  There's no need to ever have support for two
formats in any given release; just notice that the cache format is wrong,
blow it away, and regenerate it.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Shawn Pearce
On Thu, Jul 21, 2011 at 13:27, George Spelvin <[hidden email]> wrote:
>
> be enough for the time being.  As a local cache, it can be extended
> with a software upgrade.  There's no need to ever have support for two
> formats in any given release; just notice that the cache format is wrong,
> blow it away, and regenerate it.

Don't assume that. Consider a repository stored on NFS that is
read-only to you. The NFS server has one version of Git installed, and
is using cache format A. You have a newer version of Git installed on
your workstation, using cache format B. Now you cannot use this
repository as a local filesystem... its only available to you over the
Git protocols. This breaks a number of people's environments.  :-)

Its better if we can avoid having to change file formats very often,
even if they are a local "cache".

--
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Pēteris Kļaviņš
In reply to this post by Phil Hord (hordp)
On 21/07/2011 5:24 PM, Phil Hord wrote:

> Maybe the confusion comes from the different storage mechanisms being
> discussed. If the generation numbers are in a local cache and used by a
> single client, the determinism of the specific numbers doesn't much
> matter. If they are part of the commit, it still doesn't need to be
> completely deterministic. However, interoperability requires standards,
> and standards favor determinism, so dogmatic determinism may triumph in
> that case.
>
> 1. gen(06) might make sense if you mean to implement --date-order using
> gen-numbers, for example. But I don't think it's practical in any case.
>
> 2. gen(06)+1 might make sense if you mean to require that gen-numbers
> are unique per repo. But this is both unsupportable and unnecessary, so
> it's a non-starter.
>
> 3. gen(B)+1 is what you'd get from the the algorithm I saw proposed.
>
> All three of these are provably correct by my definition of "correct":
> "for each A in ancestors_of(B), gen(A) < gen(B)".
>
> However, [1] and [2] have some extra features of dubious value. Simpler
> is better for interoperability, so I like [3] for this purpose.
>
> Even [3] has an extra feature I think is unnecessary: determinism. If
> that "requirement" is dropped, I think all three of these algorithms are
> (functionally) roughly equivalent.
>
>> I don't think everybody MEANT to be saying such
>> different things--that's just how they appeared on this end.
>>
>> Now, did you mean something different by "commit number?"
>
> I remain unconvinced that there is value in gen-number distribution, so
> to my mind, the specific algorithm and whether or not it is
> deterministic are unimportant.
>

The beauty of Git is that no two copies of a Git repository as a whole
are the same:  some people make shallow copies;  others prune away all
branches except for the one they are interested in;  yet others graft
together multiple original repositories.  The upshot is that two copies
of the same repository may end up having different commits as their root
commits, and so the generation numbers computed for their repositories
would be different.  Indeed, the shallow repository copy could later be
filled out with additional underlying commits, and so on.

Given this context, I can't see the value in fixing generation numbers
within commits.  In my mind generation numbers are extremely useful
transient helper objects in every Git repository but they have no
meaning outside that repository, sort of like GIT_WORK_TREE.

Peter

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Christian Couder-2
On Fri, Jul 22, 2011 at 12:40 AM, Pēteris Kļaviņš
<[hidden email]> wrote:
>
> The beauty of Git is that no two copies of a Git repository as a whole are
> the same:  some people make shallow copies;  others prune away all branches
> except for the one they are interested in;  yet others graft together
> multiple original repositories.  The upshot is that two copies of the same
> repository may end up having different commits as their root commits, and so
> the generation numbers computed for their repositories would be different.
>  Indeed, the shallow repository copy could later be filled out with
> additional underlying commits, and so on.

Not only people want different repos, but with their own repo they
want different "views" (or "virtual graph") of it.

> Given this context, I can't see the value in fixing generation numbers
> within commits.  In my mind generation numbers are extremely useful
> transient helper objects in every Git repository but they have no meaning
> outside that repository, sort of like GIT_WORK_TREE.

It's not even per repository that they have a meaning, it's per "view"
of the commit graph.

Thanks,
Christian.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jakub Narębski
In reply to this post by George Spelvin
On Thu, 21 Jul 2011, George Spelvin wrote:

> > There is also another issue that I have mentioned, namely incomplete
> > clones - which currently means shallow clone, without access to full
> > history.
>
> As far as history walking is concerned, you can just consider "missing
> parent" the same as "no parent" and start the generation numbers at 0.
> As long as you recompute.

Well, shallow clone case can be considered both for putting 'true'
generation numbers in commit header, and against it.

For, because with generation numbers in commits you can use true
generation numbers.

Against, because if there are commits without generation numbers in
header, you cannot assign true generation number, and you can only use
"shallow" generation number, in generation numbers cache.

> > Nb. grafts are so horrible hack that I would be not against turning
> > off generation numbers if they are used.
>
> Yeah, but it's not too miserable to add support (the logic is very similar
> to replace objects), and then you would be able to have the history walking
> code depend on the presence of generation numbers.  (The "load the cache"
> function would regenerate it if necessary.)
>
> Only do this if you already have support for "no generation numbers" in
> the history walking code for (say) loose objects.

Grafts are non-transferable, and if you use them to cull rather than add
history they are unsafe against garbage collection... I think.
 
> > In the case of replace objects you need both non-replaced and replaced
> > DAG generation numbers.
>
> Yes, the cache validity/invalidation criteria are the tricky bit.
> Honestly, this is where the code gets ugly, not computing and storing
> the generation numbers.

BTW. with storing generation number in commit header there is a problem
what would old version of git, one which does not understand said header,
do during rebase.  Would it strip unknown headers, or would it copy
generation number verbatim - which means that it can be incorrect?

BTW2. code size comparing in-commit and external cache cases must take
into account yet to be written fsck for in-commit generation numbers.

--
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Nicolas Pitre-2
On Fri, 22 Jul 2011, Jakub Narebski wrote:

> BTW. with storing generation number in commit header there is a problem
> what would old version of git, one which does not understand said header,
> do during rebase.  Would it strip unknown headers, or would it copy
> generation number verbatim - which means that it can be incorrect?

They would indeed be copied verbatim and become incorrect.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
12345