Git commit generation numbers

classic Classic list List threaded Threaded
89 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Git commit generation numbers

Linus Torvalds-3
Ok, so I see that the old discussion about generation numbers has resurfaced.

And I have to say, with six years of git use, I think it's not a
coincidence that the notion of generation numbers has come up several
times over the years: I think the lack of them is literally the only
real design mistake we have.

And I absolutely *detest* the generation number cache thing I see on the list.

Maybe I missed the discussion that actually added them to the commits
(I don't read the git mailing list regularly any more) but I think
it's a mistake to add an external cache to work around the fact that I
didn't add the generation numbers originally.

So I think we should just add the generation numbers now. We can make
the rule be that if a commit doesn't have a generation number, we end
up having to compute it (with no real need for caching). Yes, it's
expensive. But it's going to be a *lot* less expensive over time as
people start using a git version that adds the generation numbers to
commits.

And we can easily mix this - there's no "flag-day" issues. Old
versions of git will ignore the generation number and generate new
commits that doesn't have it. New versions of git will generate them,
and use them. And once the project starts having generation numbers in
some commits, the "generating them" part will get cheaper over time.

I'll send out a patch that admittedly does not have much testing as a
reply to this one. It ends up being really simple. Of course, maybe
it's simple because I did something incredibly stupid, but please take
a look.

                                 Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
On Thu, Jul 14, 2011 at 11:24:27AM -0700, Linus Torvalds wrote:

> And I have to say, with six years of git use, I think it's not a
> coincidence that the notion of generation numbers has come up several
> times over the years: I think the lack of them is literally the only
> real design mistake we have.

Agreed.

> And I absolutely *detest* the generation number cache thing I see on
> the list.

I'd love to have in-commit generation numbers. I'm just not sure we can
get the speeds we want without caching them for existing commits.

> Maybe I missed the discussion that actually added them to the commits
> (I don't read the git mailing list regularly any more) but I think
> it's a mistake to add an external cache to work around the fact that I
> didn't add the generation numbers originally.
>
> So I think we should just add the generation numbers now. We can make
> the rule be that if a commit doesn't have a generation number, we end
> up having to compute it (with no real need for caching). Yes, it's
> expensive. But it's going to be a *lot* less expensive over time as
> people start using a git version that adds the generation numbers to
> commits.

I'm not sure that is the best plan. Calculating generation numbers
involves going to all roots. So once you have to find any generation
number, it's going to be expensive, no matter how many recent commits
have generation numbers already in them (but it won't get _more_
expensive as more commits are added; you'll always be traversing from
the commit in question down to the roots).

As we add new commits with generation numbers, we won't need to do a
calculation to get their numbers. But if you are doing something like
"tag --contains", you are going to want to know the generation number of
old tags (otherwise, you can't know whether your cutoff might hit them
or not). IOW, even if we add generation numbers _today_, every "tag
--contains" in linux-2.6 is going to end up traversing from v3.0-rc7
down to the roots to get its generation number (v3.0-rc8 would get an
embedded generation, of course).

So if you aren't going to cache generation numbers, then you might as
well write your traversal algorithm to assume you don't know them for
old commits. Because calculating them needs to touch every ancestor, and
that's probably equivalent to the worst-case for your algorithm.


There's also one other issue with generation numbers. How do you handle
grafts and object-replacement refs?  If you graft history, your embedded
generation numbers will all be junk, and you can't trust them.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
On Thu, Jul 14, 2011 at 11:37 AM, Jeff King <[hidden email]> wrote:
>
> I'd love to have in-commit generation numbers. I'm just not sure we can
> get the speeds we want without caching them for existing commits.

So my argument would be that we'd simply be much better off fixing the
fundamental data structure (which we can), and let it become the
long-term solution.

Now, if *may* turn out that we'd want to have some cache for
generation numbers in commits that don't have them, but I absolutely
think that that should be a "add-on" rather than anything fundamental.
For example, if we just merge the "add generation numbers to the
commit object" logic first, then the "cache" case never really needs
to care about us generating new commits. They simply won't need the
cache.

Also, I suspect that the cache could easily be done as a *small* and
*incomplete* cache, ie you don't need to cache all commits, it would
be sufficient to cache a few hundred spread-out commits, and just know
that "from any commit, the cached commit will be quickly reachable".

> I'm not sure that is the best plan. Calculating generation numbers
> involves going to all roots. So once you have to find any generation
> number, it's going to be expensive, no matter how many recent commits
> have generation numbers already in them (but it won't get _more_
> expensive as more commits are added; you'll always be traversing from
> the commit in question down to the roots).

It only ends up being expensive if the commit has parents that don't
have generation numbers.

That's a fairly short-term problem. For the kernel, for example,
basically no development happens on a base that is older than one or
two releases. So if I (and Greg, with the stable tree) start using my
patch, within a couple of weeks, pretty much all development would
have a generation number in its history.

Sure, sometimes I'd merge from people who based their tree on
something old, and I'd end up calculating it all. But it would get
progressively rarer.

> As we add new commits with generation numbers, we won't need to do a
> calculation to get their numbers. But if you are doing something like
> "tag --contains", you are going to want to know the generation number of
> old tags (otherwise, you can't know whether your cutoff might hit them
> or not). IOW, even if we add generation numbers _today_, every "tag
> --contains" in linux-2.6 is going to end up traversing from v3.0-rc7
> down to the roots to get its generation number (v3.0-rc8 would get an
> embedded generation, of course).

So that could easily be handled by caching. In fact, I suspect that
you could make the cache no associate with a commit ID, but be
associated with the tags and heads. But again, then the cache would be
a "secondary" issue, not something fundamental.

> So if you aren't going to cache generation numbers, then you might as
> well write your traversal algorithm to assume you don't know them for
> old commits.

But that's how our algorithms are *already* written.

So why not have that as the fallback? You get the advantage of
generation numbers only with modern things, but those are the ones you
actually tend to use.

Merge bases are *very* seldom historical, for example.

                     Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
In reply to this post by Jeff King
On Thu, Jul 14, 2011 at 11:37 AM, Jeff King <[hidden email]> wrote:
>
> There's also one other issue with generation numbers. How do you handle
> grafts and object-replacement refs?  If you graft history, your embedded
> generation numbers will all be junk, and you can't trust them.

So I don't think this is a real problem in practice.

Grafts are already unreliable. You cannot sanely merge over a graft,
and it has nothing to do with generation numbers.

I'm actually sorry that we ever did grafting. It's fundamentally
broken, and can actually destroy your repository (by hiding real
parents and then causing the commits to get garbage collected). So I
don't think grafting should be used as an argument for or against
anything - it's a hack that breaks some fundamental git database
constraints.

                             Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
In reply to this post by Linus Torvalds-3
On Thu, Jul 14, 2011 at 11:47 AM, Linus Torvalds
<[hidden email]> wrote:
>
> Also, I suspect that the cache could easily be done as a *small* and
> *incomplete* cache, ie you don't need to cache all commits, it would
> be sufficient to cache a few hundred spread-out commits, and just know
> that "from any commit, the cached commit will be quickly reachable".

Put another way: we could do the cache not as a real dynamic entity,
but as something that gets generated at "git clone" time or when
re-packing.

I'm actually much more nervous about a cache being inconsistent than I
would be about having generation numbers in the tree. The latter we
can (and should - but my patch didn't) add a fsck test for, and then
you would never get into some situation where there's some really
subtle issue with merge base calculation due to a corrupt cache.

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jakub Narębski
In reply to this post by Linus Torvalds-3
Linus Torvalds <[hidden email]> writes:

> On Thu, Jul 14, 2011 at 11:37 AM, Jeff King <[hidden email]> wrote:
> >
> > There's also one other issue with generation numbers. How do you handle
> > grafts and object-replacement refs?  If you graft history, your embedded
> > generation numbers will all be junk, and you can't trust them.
>
> So I don't think this is a real problem in practice.
>
> Grafts are already unreliable. You cannot sanely merge over a graft,
> and it has nothing to do with generation numbers.
>
> I'm actually sorry that we ever did grafting. It's fundamentally
> broken, and can actually destroy your repository (by hiding real
> parents and then causing the commits to get garbage collected). So I
> don't think grafting should be used as an argument for or against
> anything - it's a hack that breaks some fundamental git database
> constraints.

What about object-replacement refs (i.e. "git replace" and refs/replace/)?

This is modern replacement for grafts mechanism, which is safe against
garbage collecting, and contrary to grafts it is transferable (as a ref).

With replacement objects (e.g. to repair some fragment of history to
make it bisectable - I think that was original idea behind introducing
git-replace, or instead of grafts to join with historical repository -
IIRC the reason why grafts mechanism was created) you can also have
invalid generation numbers if they are stored in commit headers.  With
generation cache we can simply invaliate it if grafts or replacements
change...

P.S. grafts are quite useful when doing history surgery.  Create
grafts, check history, use git-filter-branch to make new DAG
permanent, remove grafts.

P.P.S. What about "grafts lite", i.e. shallow clone?  With generation
cache we can invalidate it when depth changes...

--
Jakub Narębski
Poland


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
In reply to this post by Linus Torvalds-3
On Thu, Jul 14, 2011 at 11:47:45AM -0700, Linus Torvalds wrote:

> On Thu, Jul 14, 2011 at 11:37 AM, Jeff King <[hidden email]> wrote:
> >
> > I'd love to have in-commit generation numbers. I'm just not sure we can
> > get the speeds we want without caching them for existing commits.
>
> So my argument would be that we'd simply be much better off fixing the
> fundamental data structure (which we can), and let it become the
> long-term solution.
>
> Now, if *may* turn out that we'd want to have some cache for
> generation numbers in commits that don't have them, but I absolutely
> think that that should be a "add-on" rather than anything fundamental.
> For example, if we just merge the "add generation numbers to the
> commit object" logic first, then the "cache" case never really needs
> to care about us generating new commits. They simply won't need the
> cache.

Sure, I'd be fine with that (modulo the graft issue, which you don't
seem to care about). I half-toyed with making an extra "add generation
numbers to commit header" on top of my series, but I wanted to first
prove that generation numbers actually could yield speedups.

> Also, I suspect that the cache could easily be done as a *small* and
> *incomplete* cache, ie you don't need to cache all commits, it would
> be sufficient to cache a few hundred spread-out commits, and just know
> that "from any commit, the cached commit will be quickly reachable".

Yeah, that would work. Is it worth the trouble? Your cache size is still
O(n). And you still have the complexity of _having_ a cache.  Yes, the
size is 1/100th of what it was (dropping from 6M to 600K on linux-2.6).
But you're also going to spend more time calculating. I think you'd have
to measure to see how it performs in practice.

> It only ends up being expensive if the commit has parents that don't
> have generation numbers.
>
> That's a fairly short-term problem. For the kernel, for example,
> basically no development happens on a base that is older than one or
> two releases. So if I (and Greg, with the stable tree) start using my
> patch, within a couple of weeks, pretty much all development would
> have a generation number in its history.

Sure, that makes generation during commit-time cheaper, and eventually
the cost just goes away. I'm more concerned that it won't actually speed
up algorithms where you look at old commits, which was the whole point
in the first place.

> > As we add new commits with generation numbers, we won't need to do a
> > calculation to get their numbers. But if you are doing something like
> > "tag --contains", you are going to want to know the generation number of
> > old tags (otherwise, you can't know whether your cutoff might hit them
> > or not). IOW, even if we add generation numbers _today_, every "tag
> > --contains" in linux-2.6 is going to end up traversing from v3.0-rc7
> > down to the roots to get its generation number (v3.0-rc8 would get an
> > embedded generation, of course).
>
> So that could easily be handled by caching. In fact, I suspect that
> you could make the cache no associate with a commit ID, but be
> associated with the tags and heads. But again, then the cache would be
> a "secondary" issue, not something fundamental.

Yeah, you could do that. And it would handle "tag --contains" and
"branch --contains" (the latter doesn't even really need a cache; as the
branch tips move, they will get new commits with generation numbers). I
suspect we could get faster topo-sorting and possibly faster merge-base
calculation out of generation numbers, too.  But that won't happen if we
only have generation numbers for a handful of specific commits.

> > So if you aren't going to cache generation numbers, then you might as
> > well write your traversal algorithm to assume you don't know them for
> > old commits.
>
> But that's how our algorithms are *already* written.

Sort of. We tend to rely on commit timestamps as a proxy for generation
numbers. But in the face of clock skew, git will give wrong answers
(e.g., Ted posted some examples of name-rev giving wrong answers near
some skew in linux-2.6).

If we aren't going to go whole-hog on generation numbers, I'm much more
tempted to simply keep using commit timestamps. It's easy to build a
cache of commits with bogus timestamps (which I've already posted a
patch for) if you want to better accuracy at the cost of more
complexity. And as time progresses, you tend to ask about commits near
the skewed ones less often (and hopefully lessons learned from seeing
how the skew occurred will help us prevent them from reocurring in new
commits).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
In reply to this post by Linus Torvalds-3
On Thu, Jul 14, 2011 at 11:55:39AM -0700, Linus Torvalds wrote:

> I'm actually much more nervous about a cache being inconsistent than I
> would be about having generation numbers in the tree. The latter we
> can (and should - but my patch didn't) add a fsck test for, and then
> you would never get into some situation where there's some really
> subtle issue with merge base calculation due to a corrupt cache.

Interesting. I'm nervous about that, too, which is why I _favor_ the
cache. Because we calculate the cache ourselves, we know its accurate
according to the parent pointers. If we find a bug, we fix it and bump
the cache version, which forces it to regenerate.

Contrast that with a bogus generation number that makes its way into an
actual commit object. That's there for eternity, just like the commit
timestamp skew we already have. I find it much less likely to happen
than skew in the commit timestamp, if only because generations are a
dirt-simple concept. But it is a case where there is duplicated
information in the actual DAG, and if that information doesn't match up
we are screwed.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
In reply to this post by Jeff King
On Thu, Jul 14, 2011 at 12:08 PM, Jeff King <[hidden email]> wrote:
>
> If we aren't going to go whole-hog on generation numbers, I'm much more
> tempted to simply keep using commit timestamps.

Sure. I think it's entirely reasonable to say that the issue basically
boils down to one git question: "can commit X be an ancestor of commit
Y" (as a way to basically limit certain algorithms from having to walk
all the way down). We've used commit dates for it, and realistically
it really has worked very well. But it was always a broken heuristic.

So yes, I personally see generation counters as a way to do the commit
date comparisons right. And it would be perfectly fine to just say "if
there are no generation numbers, we'll use the datestamps instead, and
know that they could be incorrect".

That "use the datestamps" fallback thing may well involve all the
heuristics we already do (ie check for the stamps looking sane, and
not trusting just one individual one).

                           Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Ted Ts'o
In reply to this post by Linus Torvalds-3
On Thu, Jul 14, 2011 at 11:55:39AM -0700, Linus Torvalds wrote:

> On Thu, Jul 14, 2011 at 11:47 AM, Linus Torvalds
> <[hidden email]> wrote:
> >
> > Also, I suspect that the cache could easily be done as a *small* and
> > *incomplete* cache, ie you don't need to cache all commits, it would
> > be sufficient to cache a few hundred spread-out commits, and just know
> > that "from any commit, the cached commit will be quickly reachable".
>
> Put another way: we could do the cache not as a real dynamic entity,
> but as something that gets generated at "git clone" time or when
> re-packing.

Would it be considered evil if we put the generation number in the
pack, but not consider it part of the formal object (i.e., it would be
just a cache, but one that wouldn't change once the pack was created)?

                            - Ted
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
On Thu, Jul 14, 2011 at 12:46 PM, Ted Ts'o <[hidden email]> wrote:
>
> Would it be considered evil if we put the generation number in the
> pack, but not consider it part of the formal object (i.e., it would be
> just a cache, but one that wouldn't change once the pack was created)?

That would actually be a major change to data structures, and would
require some serious surgery and be hard to support in a
backwards-compatible way (think different git versions accessing the
same repository).

Much bigger patch than the one I did.

So it sounds like it would work - and it would probably be a simple
matter of just incrementing the pack version number if you just say
"cannot access the pack with old versions" - but I think it's a really
fragile approach.

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
In reply to this post by Linus Torvalds-3
On Thu, Jul 14, 2011 at 12:23:31PM -0700, Linus Torvalds wrote:

> On Thu, Jul 14, 2011 at 12:08 PM, Jeff King <[hidden email]> wrote:
> >
> > If we aren't going to go whole-hog on generation numbers, I'm much more
> > tempted to simply keep using commit timestamps.
>
> Sure. I think it's entirely reasonable to say that the issue basically
> boils down to one git question: "can commit X be an ancestor of commit
> Y" (as a way to basically limit certain algorithms from having to walk
> all the way down). We've used commit dates for it, and realistically
> it really has worked very well. But it was always a broken heuristic.

Yeah, I agree with that.

> So yes, I personally see generation counters as a way to do the commit
> date comparisons right. And it would be perfectly fine to just say "if
> there are no generation numbers, we'll use the datestamps instead, and
> know that they could be incorrect".

In that case, is it really worth adding generation numbers to the cache?
Because they _can_ be wrong, too. I suspect they will be wrong less
often than commit timestamps, if only because they're dirt simple to
calculate. But all it takes is some crappy porcelain doing:

  git cat-file commit $foo |
  munge_the_parents |
  git hash-object -t commit --stdin -w

to give us a bogus object. Sure, we can catch it via fsck. But we could
also catch commit timestamp skew via fsck just as easily.

> That "use the datestamps" fallback thing may well involve all the
> heuristics we already do (ie check for the stamps looking sane, and
> not trusting just one individual one).

Those aren't foolproof, of course. I asked people a few months ago to
run my skew-detection program on various repos, and some repos have long
runs of skew (think somebody with a bad clock or a bogus program doing a
whole series). But they're fast and work OK in practice. We should apply
them more consistently (name-rev, for example, will tolerate a day of
skew, but will not look past a single commit).

And if people really want to be thorough, we can mark the skewed commits
in a cache during "git gc" for them (or they can just say "for this
traversal, I want to be thorough; turn off timestamp cutoffs").


Out of curiosity, what don't you like about the generation cache? The
idea of using external storage? Generating it on the fly? The particular
implementation is too slow or crappy?

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
In reply to this post by Linus Torvalds-3
On Thu, Jul 14, 2011 at 12:51:39PM -0700, Linus Torvalds wrote:

> On Thu, Jul 14, 2011 at 12:46 PM, Ted Ts'o <[hidden email]> wrote:
> >
> > Would it be considered evil if we put the generation number in the
> > pack, but not consider it part of the formal object (i.e., it would be
> > just a cache, but one that wouldn't change once the pack was created)?
>
> That would actually be a major change to data structures, and would
> require some serious surgery and be hard to support in a
> backwards-compatible way (think different git versions accessing the
> same repository).

If we put it in the index, but not the pack, then it wouldn't be any
more painful than pack index v2. I don't recall there being huge fallout
from that; we just gave a reasonable deprecation period before switching
it on as the default.

I'm not sure it is much less crappy than having the cache in a separate
file. It does take less space, since the pack index already contains all
of the sha1s. But if we don't like the on-the-fly writing of what was in
my series, it would not be hard to generate the same cache during
pack-index time. Not having it in a separate file makes it hard to
invalidate the cache when the graph changes (due to grafts or replace
refs). But maybe we don't care about that. Or maybe it's OK to tell the
user to manually rebuild the pack index if they tweak those features.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Ted Ts'o
In reply to this post by Linus Torvalds-3
On Thu, Jul 14, 2011 at 12:51:39PM -0700, Linus Torvalds wrote:
>
> So it sounds like it would work - and it would probably be a simple
> matter of just incrementing the pack version number if you just say
> "cannot access the pack with old versions" - but I think it's a really
> fragile approach.

So if we ever change the pack format again, it's something to think
about adding, but probably not worth it on its own...

What if we simply have a cache file per pack, which again is generated
when the pack is first received or generated, but is otherwise not
dynamic?  It's an extra file which is icky, but it would keep things
simpler.

                                                        - Ted
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
In reply to this post by Jeff King


Jeff King <[hidden email]> wrote:
>
>Out of curiosity, what don't you like about the generation cache?

The thing I hate about it is very fundamental: I think it's a hack around a basic git design mistake. And it's a mistake we have known about for a long time.

Now, I don't think it's a *fatal* mistake, but I do find it very broken to basically say "we made a mistake in the original commit design, and instead of fixing it we create a separate workaround for it".

THAT I find distasteful. My reaction is that if we're going to add generation numbers, then were should just do it the way we should have done them originally, rather than as some separate hack.

See? That's why I wouldn't have any problem with adding a separate cache on top of it, if it's really required, but I would hope that it isn't really needed.

So a cache in itself is not necessarily wrong. But leaving the original design mistake in place IS.

And fixing it really ended up being a very tiny patch, no?

     Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Junio C Hamano
In reply to this post by Jeff King
Jeff King <[hidden email]> writes:

> There's also one other issue with generation numbers. How do you handle
> grafts and object-replacement refs?  If you graft history, your embedded
> generation numbers will all be junk, and you can't trust them.

By the way, I doubt your "invalidate and recompute generation cache when
replacement changes" would really work when we consider object transfer
(which is the whole point of deprecating graft with object replacement
mechanism). For the purpose of connectivity check during object transfer,
we deliberately _ignore_ the object replacements, so you would at least
want to have an ability to show the generation number according to the
"true" history recorded in commits (which can come from Linus's in-commit
generation number once everybody migrates) and the generation number that
takes grafts and replacements into account (for which we cannot depend on
in-commit record).
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
In reply to this post by Linus Torvalds-3
On Thu, Jul 14, 2011 at 01:19:51PM -0700, Linus Torvalds wrote:

> >Out of curiosity, what don't you like about the generation cache?
>
> The thing I hate about it is very fundamental: I think it's a hack
> around a basic git design mistake. And it's a mistake we have known
> about for a long time.
>
> Now, I don't think it's a *fatal* mistake, but I do find it very
> broken to basically say "we made a mistake in the original commit
> design, and instead of fixing it we create a separate workaround for
> it".
>
> THAT I find distasteful. My reaction is that if we're going to add
> generation numbers, then were should just do it the way we should have
> done them originally, rather than as some separate hack.
>
> See? That's why I wouldn't have any problem with adding a separate
> cache on top of it, if it's really required, but I would hope that it
> isn't really needed.
>
> So a cache in itself is not necessarily wrong. But leaving the
> original design mistake in place IS.

Thanks, that makes some sense to me.

However, I'm not 100% convinced leaving generation numbers out was a
mistake. The git philosophy seems always to have been to keep the
minimal required information in the DAG. And I think that has served us
well, because we're not saddled with cruft that seemed like a good idea
early on, but isn't.

Generation numbers are _completely_ redundant with the actual structure
of history represented by the parent pointers. Having them in there is
not about giving git more information that it doesn't have, but about
being a cheap place to stuff a value that is a little expensive to
calculate.

And so that seems a bit hack-ish to me.

I liken it somewhat to the "don't store renames" debate. We don't want
to crystallize forever in the history whatever crappy rename-detection
algorithm is done at the time of commit. We put the minimum amount of
information in the DAG, and it's the runtime's responsibility to get the
answer.

I think the decision is a little more gray with generation numbers,
because it's not about "you got this information with a wrong and crappy
algorithm" like it might be with rename detection, but rather "we're
sticking this redundant number in the commit object, and we assume that
it will always be useful enough to future algorithms to merit being
here".

> And fixing it really ended up being a very tiny patch, no?

Well, yes. But it also doesn't yield a 100-fold speedup in "git tag
--contains" for existing repositories. So it's not quite a full
solution.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
In reply to this post by Junio C Hamano
On Thu, Jul 14, 2011 at 01:26:32PM -0700, Junio C Hamano wrote:

> Jeff King <[hidden email]> writes:
>
> > There's also one other issue with generation numbers. How do you handle
> > grafts and object-replacement refs?  If you graft history, your embedded
> > generation numbers will all be junk, and you can't trust them.
>
> By the way, I doubt your "invalidate and recompute generation cache when
> replacement changes" would really work when we consider object transfer
> (which is the whole point of deprecating graft with object replacement
> mechanism). For the purpose of connectivity check during object transfer,
> we deliberately _ignore_ the object replacements, so you would at least
> want to have an ability to show the generation number according to the
> "true" history recorded in commits (which can come from Linus's in-commit
> generation number once everybody migrates) and the generation number that
> takes grafts and replacements into account (for which we cannot depend on
> in-commit record).

It should actually work in that scenario, at least with replace refs,
but the performance is suboptimal. The copy of git doing the object
transfer will turn off read_replace_refs, our validity token will
not match, we will see that our cache is no longer valid, and
regenerate it. Another run with replace-refs turned on will do the same
thing in reverse. Even two programs running simultaneously will still be
correct, because the cache is replaced atomically.

However, there are two issues:

  1. I don't think grafts have a "respect grafts" flag in the same way;
     I haven't looked at how the packing code decides not to respect
     them, but the "stir graft info into the checksum" data should use
     the same check.

  2. If you do a lot of object transfer, you will ping-pong back and
     forth between cache versions, which is inefficient. It would
     probably be better to store the cache that is valid under condition
     $SHA1 as:

       .git/cache/generations/$SHA1

     In most cases, you would have a single file (i.e., you are not
     using replace refs at all). But if you did, then you keep two
     separate caches, one for the view from replace-refs, and one for
     the standard view.

If we ignore replace refs and grafts, as Linus suggested, and always
store the true generation number, then we could generate it at pack time
(and even put it in the pack index if we want to deal with a version
bump there).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Junio C Hamano
Jeff King <[hidden email]> writes:

> It should actually work in that scenario, at least with replace refs,...
> regenerate it. Another run ...

I know; that is what I called "doubt it would really work". Having to
regenerate twice does not count as working.

> However, there are two issues:
>
>   1. I don't think grafts have a "respect grafts" flag in the same way;
>      I haven't looked at how the packing code decides not to respect
>      them, but the "stir graft info into the checksum" data should use
>      the same check.

I do not think graft and object transfer meshes well at all, so I wouldn't
worry about it.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
In reply to this post by Jeff King
On Thu, Jul 14, 2011 at 1:31 PM, Jeff King <[hidden email]> wrote:
>
> However, I'm not 100% convinced leaving generation numbers out was a
> mistake. The git philosophy seems always to have been to keep the
> minimal required information in the DAG.

Yes.

And until I saw the patches trying to add generation numbers, I didn't
really try to push adding generation numbers to commits (although it
actually came up as early as July 2005, so the "let's use generation
numbers in commits" thing is *really* old).

In other words, I do agree that we should strive for minimal required
information.

But dammit, if you start using generation numbers, then they *are*
required information. The fact that you then hide them in some
unarchitected random file doesn't change anything! It just makes it
ugly and random, for chrissake!

I really don't understand your logic that says that the cache is
somehow cleaner. It's a random hack! It's saying "we don't have it in
the main data structure, so let's add it to some other one instead,
and now we have a consistency and cache generation problem instead".

Just look at the size of the patches in question. Your caching patches
are bigger and more complicated. Sure, part of it is that your series
adds the code to _use_ the generation number, but look purely at the
code to maintain them.

Why do you think the odd separate cache is somehow better than just
doing it right? Seriously? If we require the generation numbers, then
they have *become* that minimal information that we should save!

 And I think that has served us
> well, because we're not saddled with cruft that seemed like a good idea
> early on, but isn't.

Again - we discussed adding generation numbers about 6 years ago. We
clearly *should* have done it. Instead, we went with the hacky "let's
use commit time", that everybody really knew was technically wrong,
and was a hack, but avoided the need.

Now, six years later, you clearly are saying that we need the
generation numbers, but then you go off and try to say that they
should be in some secondary non-architected random collection of data
structures that isn't covered by the security and maintenance
guarantees that the core git objects are.

Dammit, one of the things that makes git special is that the data
structures are NOT random odd ad-hoc files. There is a design to them.

> Generation numbers are _completely_ redundant with the actual structure
> of history represented by the parent pointers.

Not true. That's only true if you add ".. if you parse the whole
history" to that statement.

And we've *never* parsed the whole history, because it's just too
expensive and doesn't scale. So right now we depend on commit dates
with a few hacks.

So no, generation numbers are not at all redundant. They are
fundamental. It's why we had this discussion six years ago.

> And so that seems a bit hack-ish to me.

Um? If you feel that way, then why the hell are you pushing your EVEN
MORE HACKISH CACHE PATCHES?

That's what this really boils down to. I think that if we have a value
that we need, then it should be recorded. In the data structures. Not
in some random other location that isn't part of the real git data
structures.

We don't do caches in git, because we don't NEED to. Sure, gitk has
it's hacky cache, but that's not core functionality.

I think it's a sign of good design that we can do a "find .git" and
explain every single file, and show that it's all core functionality
(again, with the exception of "gitk.cache", and I suspect that's
because gitk is a script, not because of any really fundamental data
issues), and explain it.

I think the *cache* is a hell of a lot more hacky than just doing it right.

> I liken it somewhat to the "don't store renames" debate.

That's total and utter bullshit.

Storing renames is *wrong*. I've explained a million times why it's
wrong. Doing it is a disaster. I know. I've used systems that did it.
It's crap. It's fundamentally information that is actively misleading
and WRONG. It's not even that you can do rename detection at run-time,
it's that you *HAVE* to do rename detection at run-time, because doing
it at commit time is simply utterly and fundamentally *wrong*.

Just look at "git blame -C" to remind yourself why rename information is wrong.

But even more importantly, look at git merges. Look at how git has
gotten merging right since pretty much day #1, and has absolutely no
issues with files that got generated two different ways. Look at every
SCM that tries to do rename detection, and look at how THEY CANNOT DO
MERGES RIGHT.

It's that simple. Rename detection is not about avoiding "redundant
data". It's about doing the right thing.

                          Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
12345