Git commit generation numbers

classic Classic list List threaded Threaded
89 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
On Fri, Jul 15, 2011 at 04:10:23PM -0700, Linus Torvalds wrote:

> On Fri, Jul 15, 2011 at 2:17 PM, Linus Torvalds
> <[hidden email]> wrote:
> >
> > For example, for the "git tag --contains" thing, what's the
> > performance effect of just skipping tags that are much older than the
> > commit we ask for?
>
> Hmm.
>
> Maybe there is something seriously wrong with this trivial patch, but
> it gave the right results for the test-cases I threw at it, and passes
> the tests.
>
> Before:
>
>    [torvalds@i5 linux]$ time git tag --contains v2.6.24 > correct
>
>    real 0m7.548s
>    user 0m7.344s
>    sys 0m0.116s
>
> After:
>
>    [torvalds@i5 linux]$ time ~/git/git tag --contains v2.6.24 > date-cut-off
>
>    real 0m0.161s
>    user 0m0.140s
>    sys 0m0.016s
>
> and 'correct' and 'date-cut-off' both give the same answer.

Without even looking carefully at your patches for any minor mistakes, I
can tell you that the speedup you're seeing is approximately right.
Because it's almost exactly the same optimization I made in my
timestamp-based patches (links to which I sent you earlier today).

However, you can make it even faster. The "tag --contains" code will ask
"is_descendant_of" repeatedly for the same set of "want" commits. So you
end up traversing some parts of the graph over and over. My patches
share the marks over a set of contains traversals, so you only ever
touch each commit once. And that's what my patches do.

With yours, on my box:

  $ time git tag --contains HEAD~1000 >/dev/null
  real    0m0.113s
  user    0m0.104s
  sys     0m0.008s

and mine:

  $ time git tag --contains HEAD~1000 >/dev/null
  real    0m0.035s
  user    0m0.020s
  sys     0m0.012s

I suspect you can make the difference even more prominent by having more
tags, or by having multiple "want" commits.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
In reply to this post by Linus Torvalds-3
On Fri, Jul 15, 2011 at 04:36:40PM -0700, Linus Torvalds wrote:

> On Fri, Jul 15, 2011 at 4:16 PM, Linus Torvalds
> <[hidden email]> wrote:
> >
> > I have fewer branches than tags, but I get something similar for "git
> > branch --contains":
>
> The time-based heuristic does seem to be important. If I just remove
> it, I get increasingly long times for things that aren't contained in
> my branches.
>
> And in fact, I think that is why the code used the merge-base helper
> functions - not because it wanted merge bases, but because the merge
> base stuff will work from either end until it decides things aren't
> relevant any more. Because *without* the time-based heuristics, the
> trivial "is this a descendant" algorithm ends up working very badly
> for the case where the target doesn't exist in the branches. Examples
> of NOT having a date-based cut-off, but just doing the straightforward
> (non-merge-base) ancestry walk:
>
>   time ~/git/git branch --contains v2.6.12
>   real 0m0.113s
>
>   [torvalds@i5 linux]$ time ~/git/git branch --contains v2.6.39
>   real 0m3.691s

Yes, exactly. That is why my first patch (which goes to a recursive
search), takes about the same amount of time as "git rev-list --all"
(and I suspect your 3.691s above is similar). And then the second one
drops that again to .03s.

I think you are simply recreating the strategy and timings I have posted
several times now.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Christian Couder-2
In reply to this post by Ted Ts'o
On Fri, Jul 15, 2011 at 8:42 PM, Ted Ts'o <[hidden email]> wrote:

> On Fri, Jul 15, 2011 at 09:44:21AM -0700, Linus Torvalds wrote:
>> So I would like to repeat: I think our commit-date based hack has been
>> pretty successful. We've lived with it for years and years. Even the
>> "let's try to fix it by adding slop" code is from three years ago
>> (commit 7d004199d1), which means that for three years we never really
>> saw any serious problems. I forget what problem we actually did see -
>> I have this dim memory of it being Ted that had problems with a merge
>> because git picked a crap merge base, but that may just be my
>> Alzheimer's speaking.
>
> My original main issue was simply that "git tag --contains" and "git
> branch --contains" was either (a) incorrect, or (b) slower than
> popping up gitk and pulling the information out of the GUI.  The
> reason for (b) is because of gitk.cache.
>
> Maybe the answer then is creating a command-line tool (it doesn't have to
> be in "core" of git) which just pulls the dammned information out of
> gitk.cache....
>
> (Yes, it's gross, but I'm not worrying about the long-term
> architecture of git or anything high-falutin' like that.  I'm just a
> poor dumb user who just wants git tag --contains and git branch
> --contains to be fast and accurate...)

If  "git tag --contains" and "git branch --contains" give incorrect
answers because the commiter date is wrong in some commits, then why
not use "git replace" to "change" the commiter date in the commits
that have a wrong date? Is it because you don't want to use "git
replace", or because there is no script to do it automatically, or is
there another reason?

Thanks,
Christian.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

George Spelvin
In reply to this post by Linus Torvalds-3
> The thing I hate about it is very fundamental: I think it's a hack around a basic git
> design mistake. And it's a mistake we have known about for a long time.
>
> Now, I don't think it's a *fatal* mistake, but I do find it very broken to basically
> say "we made a mistake in the original commit design, and instead of fixing it we
> create a separate workaround for it".
>
> THAT I find distasteful. My reaction is that if we're going to add generation
> numbers, then were should just do it the way we should have done them originally,
> rather than as some separate hack.

There are a few design mistakes in git.  The way the object type
and size are prefixed to the data for hasing purposes, which prevents
aligned fetching from memory-mapped data in the hashing code, isn't too
pretty either.

But git has generally preferred to avoid storing information that can
be recomputed.  File renames are the big example.  given this, why the
heck store generation numbers?

They *can* be computed on demand, so arguably they *should*.  Cacheing is
then an optimization, just like packs, pack indexes, the hashed object
storage directories, and all that.


I'm in the "make it a cache" camp, honestly.  


For example, here's a different possible generation number scheme.
By making the generation number a cache, it becomes a valid alternative
to experiment with.

Simply store a topologically sorted list of commits.  Each commit's
position can serve as a generation number, and is greater than the
positions of all ancestors.  But by using the offset within the list,
the number is stored implicitly.

Generation numbers don't have to be consecutive as long as they're
correctly ordered, so you could, e.g. choose to make them unique.

I don't think this is actually worth it; I'm just using it as a
not-completely-insane example of a different design that nonetheless
achieves the same goal.

Why freeze this in the object format?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Long, Martin
> Why freeze this in the object format?

Because if you put it in the object format, then it gets pushed and
pulled around, thereby putting generation numbers in every clone.

I'm starting to think put them in the object store, for exactly that
reason, and to start moving repositories in a direction where the look
more like they would if this had been done correctly from the start.

Then, because some operations are still going to create a lot of
traversals, a cache is always an option to improve the performance in
that area.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
In reply to this post by George Spelvin
On Sun, Jul 17, 2011 at 11:27 AM, George Spelvin <[hidden email]> wrote:
>
> There are a few design mistakes in git.  The way the object type
> and size are prefixed to the data for hasing purposes, which prevents
> aligned fetching from memory-mapped data in the hashing code, isn't too
> pretty either.

Why would you ever care? That makes no sense.

> But git has generally preferred to avoid storing information that can
> be recomputed.  File renames are the big example.  given this, why the
> heck store generation numbers?

Guys, please don't bring up file renames. I explained once already why
bringing up file renames just makes you look like a f^&% moron.

Let me explain one more time:

 - Storing file renames is STUPID. It's stupid for very fundamental
reasons that have absolutely *NOTHING* to do with "it can be computed
later".

It's fundamentally stupid because it will FOREVER SCREW UP YOUR DATA,
and because it will make merging an unmitigated disaster and make your
repository depend on how you *created* your data, rather than on what
the data is. It will totally break the situation of one person doing a
rename, while another person does something else to the metadata (eg a
create of the same filename).

Trying to track file identities will leave to very fundamentally
unsolvable issues like "which file identity do we choose when two
different files get the same name", or "which file identity will we
choose when one file splits in two".

Git doesn't track renames, because unlike pretty much every other SCM
out there, git really does have a good design, and because I damn well
understood the real problems.

So bringing it up as an example of "we don't store it because we can
compute it" is really totally idiotic. It's a sign of not
understanding the problems with renames. Stop doing it. That argument
is totally irrelevant. Really.

It's like saying "We shouldn't do generation numbers because fish
don't use bicycles". The only thing that kind of argument does is to
make me convinced that you don't understand the problem enough to be
worth even arguing with. It is not only a worthless argument, but it
makes your every other argument suspect.

Comprende? Stop it.

> They *can* be computed on demand, so arguably they *should*.

Umm, no.

That's actually a really bad argument.

There are valid things that we "should" do, but they have nothing to
do with "if something can be done, it should be done". That's just a
crazy argument.

A thing we really *should* do is perform well. And be really reliable.
And support a distributed workflow.

Those are real arguments that aren't about "just because it's there".

Now, some of those arguments can then be used to say "don't bother
storing redundant data". For example, redundant data takes disk space
and network bandwidth, and if something can be recomputed cheaply (ie
if it doesn't have a negative impact on performance), then redundant
data is just bad.

And what appears like a much better argument (right now) is that some
data isn't needed AT ALL, because you can make do with other data
entirely (ie dates).

But "just because we could recompute it" is a bad bad reason.

The thing is, the very basic design of git is all about *incomplete*
DAG traversal. The DAG traversal part is pretty obvious and simple,
but the *partial* thing really is very very important. We absolutely
need it for reasonable scalability. We've spent a *lot* of time in git
development on trying to perform really well by avoiding work. Not
just in revision traversal, but in many other areas too (like making
diff and merge much faster by being able to handle whole identical
recursive subdirectories by just checking the SHA1, for example).

That's a *really* fundamental design issue in git. Performance was
always a primary goal. And by primary, I really mean primary. As in
"more important than just about anything else".  There were other
primary goals, but really not very many.

And there really aren't very good ways to limit DAG traversal.
Generation numbers are one of the very few fundamental ones. We hacked
around it with dates, and it works pretty well in practice (well
enough that I'm certainly ok with the hack), but it's definitely one
of the areas where git simply does something "wrong". It's simply not
a entirely reliable algorithm, and that fact makes me a bit
uncomfortable with it.

(Now, in theory, a global *approximate* time is theoretically possible
in a distributed environment, and as such it's arguable that "global
time with a slop that is based on the speed of light and knowledge of
location" is at least theoretically sound. So the real problem with
commit dates is that people simply don't have good clocks. So it's a
practical problem rather than a theoretical one, and it's a practical
problem that doesn't really cause enough problems in practice to not
be workable. But I'm making excuses for it, and I _know_ I'm making
excuses for it, so I'm not really happy about it)

And it's just about the only area where I am aware of git doing
something "wrong". Which is why I would like to have had generation
numbers even though the dates do work.

Anyway, to get back to the actual issue of caching vs not caching: if
you think "we could compute it dynamically" means that we should, then
we damn well shouldn't cache it either - why cache it, when you could
just compute it. And if it's worth it to waste resources on the cache
in order to avoid performance issues, then it damn well would be ok to
waste (fewer) resources on just saving the generation number in the
object data base. And make that *fundamental* fix to a hack that git
has had since pretty much day one.

And btw, git didn't have the date-based hack originally, because I
didn't think it would be problematic enough. I thought that we could
do universally efficient partial DAG traversal - not having to go all
the way to the root -  based purely on the DAG. The code in
"everybody_uninteresting()" tries to be that "limit DAG traversal by
only looking at the DAG itself", and it works for many simple
situations. But it turns out that it does *not* work for many other
cases.

So the generation number really is very very fundamnetal. It's
absolutely not some "additional information that can be computed",
because the whole AND ONLY point of having the number is to not
compute it.

We are never interested in the generation number for its own sake. We
are only interested in it in order to avoid having to look at the rest
of the DAG.

So no, the number fundamentally isn't computable, because computing it
obviates the need for it.

                           Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

George Spelvin
> So the generation number really is very very fundamnetal. It's
> absolutely not some "additional information that can be computed",
> because the whole AND ONLY point of having the number is to not
> compute it.
>
> We are never interested in the generation number for its own sake. We
> are only interested in it in order to avoid having to look at the rest
> of the DAG.

You're making my point and somehow not seeing it.

What you're describing here is the archetpical cache.

The only reason for having a memory cache is to avoid accessing memory!
The only reason for having a TLB is to avoid walking the page tables!
The only reason for having a page cache is to avoid hitting the disk!
The only reason for having a dcache is to avoid traversing the file
system directories!

And yes, the only reason for having a generation number cache is to avoid
traversing the DAG.  D'oh.  Do you think this is somehow news to anyone?

The fundamental nature of a cache is that it lets you look something up
quickly that you could compute but don't want to.

I'm slapping my forehead like Homer Simpson here.  The fact that computing
the generation number is expensive is why it's worth cacheing.  But the
fact that it *can* be computed is a reason not to clutter the published
commit object format with it.


The generation number is NOT FUNDAMENTAL.  It contains no information
that's not already in the DAG.  The danger of putting it into a commit
is that you'll do it wrong, and thereby screw everything up.

If we have broken code that generates a broken cache, we fix the code
and the bugs magically go away.

If we have broken code that generates a broken commit object, we have
a huge problem.

Just like we don't ship pack indexes around, but recompute them on arrival.
The index is essential for performance, but it's absolutely non-essential
for correctness.


As a general design principle, the exported data structures, like the
commits, should be as simple as possible.  Do not include extraneous
or redundant data, because then you have to deal with the possibility
of inconsistency.  This leads to bugs.  (Frequently buffer overflow bugs.)

Maybe it would have been worth violating that principle during the initial
git design.  I still see a good argument for not doing that even if we
had a time machine.

But now that the commit format is established and widely used, the argument
has far more force.  Changing the commit format provides zero functionality
gain, and the performance gain can be obtained a different way.

Maybe a bit more code, but nothing extraordinary.

To me, the KISS principle says "don't change the commit format!"

Now, you complain about code complexity.  But this is a read-only cache.
The generation number of a commit object never changes.  There's no update
operation.  Like an I-cache, if there's ever any problem, throw it away.

Arguing that "the patch to put it in the commit object is smaller" is
stupidly short-sighted.  Now every version of git from now until forever
has to support both kinds of commit objects.  (And browsing old git
trees will forever be slow.)

You only take on that sort of legacy support burden if you absolutely have to.

> But "just because we could recompute it" is a bad bad reason.

Bull puckey.  You're ugly and stupid and WRONG.

It's an excellent reason.  I'm amazed that you're not seeing it.
The principle is "don't include redundant data in a transport format."
Because it can be recomputed, it's redundant.  Therefore, it shouldn't
be included in the transport format.

It's exactly the same principle as "don't store the indexes in the
database dump" and "don't store filename hashes in file system
archives".

This is a principle, not an iron-clad rule.  It can be violated for
good and sufficient reasons, notably performance.

But in this case, we can get the performance without it.  Without,
in fact, changing the git transport format at all.

And "don't change a widely-used transport format" is ANOTHER important
principle.  Backward-compatible is much better than incompatible, but
far better to avoid changing it at all.

Breaking two such principles without an absolutely iron-clad reason is
ugly and stupid and wrong.

(As you well know, the more general principle is "don't store redundant
data AT ALL unless you need to for performance".  Redundant data is A
Bad Thing.  It can get out of sync.  But if you have to, a private cache
is much better than a exchange format.)


Put another way, it IS stupid, it IS expendable, and therefore it SHOULD go.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
On Sun, Jul 17, 2011 at 4:39 PM, George Spelvin <[hidden email]> wrote:
>
> I'm slapping my forehead like Homer Simpson here.  The fact that computing
> the generation number is expensive is why it's worth cacheing.  But the
> fact that it *can* be computed is a reason not to clutter the published
> commit object format with it.

And I'm slapping *my* forehead.

Nobody has *ever* given a reason why the cache would be better than
just making it explicit.

That's my issue.

Why is that so hard for people to understand? The cache is just EXTRA WORK.

To take your TLB example: it's like having a TLB for a page table that
would be as easy to just create in a way that it's *faster* to look up
in the actual data structure than it would be to look up in the cache.

Or to take your disk cache example: wouldn't you say that a disk cache
is a F&*&ING BAD IDEA if it is slower than the disk it caches?

Seriously.

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
In reply to this post by Christian Couder-2
On Sat, Jul 16, 2011 at 11:16:45AM +0200, Christian Couder wrote:

> If  "git tag --contains" and "git branch --contains" give incorrect
> answers because the commiter date is wrong in some commits, then why
> not use "git replace" to "change" the commiter date in the commits
> that have a wrong date? Is it because you don't want to use "git
> replace", or because there is no script to do it automatically, or is
> there another reason?

That would work. There are a few tricky things, though:

  1. Most commits have less than 100 skewed commits. But some have many
     (e.g., thousands in the mesa repo). How well does git cope with
     large numbers of replace refs, performance-wise?

  2. Declaring which commits are skewed is actually tricky. You can find
     a commit whose timestamp is less than the timestamp of one of its
     ancestors. But you don't know whether it is skewed, or the
     ancestor.

     If you are implementing a list of commits whose timestamps
     shouldn't be used for traversal cutoff, it doesn't really matter
     who is _right_; you just care about whether the timestamps are
     strictly increasing from that point.

     But once you start replacing commits, you need to put in a
     reasonable value for the timestamp. So you may well be replacing a
     perfectly valid commit with one that has bogus, skewed information
     in the commit timestamp.

  3. Any value you put in is actually going to be a lie during things
     like "git log --pretty=raw". That may be OK. But it is letting an
     optimization meant to make traversal fast and accurate bleed into
     the actual data we show the user.

  4. Sometimes we need to do traversals on the real objects (e.g.,
     because we are doing upload-pack). To get the benefit, those
     traversals would presumably need to look at both the original
     object and the replacement, use the timestamp from the replacement
     for traversal, but otherwise use the original object.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

George Spelvin
In reply to this post by Linus Torvalds-3
> Nobody has *ever* given a reason why the cache would be better than
> just making it explicit.

I thought I listed a few.  Let me be clearer.

1) It involves changing the commit format.  Since the change is
   backward-compatible, it's not too bad, but this is still fundamentally
   A Bad Thing, to be avoided if possible.

2) It can't be retrofitted to help historical browsing.

3) You have to support commits without generation numbers forever.
   This is a support burden.  If you can generate generation numbers for
   an entire repository, including pre-existing commits, you can *throw
   out* the commit date heuristic code entirely.

4) It can't be made to work with grafts or replace objects.

5) It includes information which is redundant, but hard to verify,
   in git objects.  Leading to potentially bizarre and version-dependent
   behaviour if it's wrong.  (Checking that the numbers are consistent
   is the same work as regenerating a cache.)

6) It makes git commits slightly larger.  (Okay, that's reaching.)

> Why is that so hard for people to understand? The cache is just EXTRA WORK.

That's why it *might* have been a good idea to include the number in
the original design.  But now that the design is widely deployed, it's
better to avoid changing the design if not necessary.

With a bit of extra work, it's not necessary.

> To take your TLB example: it's like having a TLB for a page table that
> would be as easy to just create in a way that it's *faster* to look up
> in the actual data structure than it would be to look up in the cache.

You've subtly jumped points.  The original point was that it's worth
precomputing and storing the generation numbers.  I was trying to
say that this is fundamentally a caching operation.

Now we're talking about *where* to store the cached generation numbers.

Your point, which is a very valid one, is that they are to be stored
on disk, exactly one per commit, can be computed when the commit is
generated, and are accessed at the same time as the commit, so it makes
all kinds of sense to store them *with* the commits.  As part of them,
even.

This has the huge benefit that it does away with the need for a *separate*
data structure.  (Kinda sorts like the way AMD stores instruction
boundaries in the L1 I-cache, avoiding the need for a separate data
structure.)

I'm arguing that, despite this annoying overhead, there are valid reasons
to want to store it separately.  There are some practical ones, but the
basic one is an esthetic/maintainability judgement of "less cruft in
the commit objects is worth more cruft in the code".

Git has done very well partly *because* of the minimality of its basic
persistent object database format.  I think we should be very reluctant
to add to that without a demonstrated need that *cannot* be met in
another way.


In this particular case, a TLB is not a transport format.  It's okay
to add redundant cruft to make it faster, because it only lasts until
the next reboot.  (A more apropos, software-oriented analogy might be
"struct page".)

A git commit object *is* a transport format, one specifically designed
for transporting data a very long way forward in time, so it should be
designed with considerable care, and cruft ruthlessly eradicated.

Whatever you add to it has to be supported by every git implementation,
forever.  As does every implementation bug ever produced.

A cache, on the other hand, is purely a local implementation detail.
It can be changed between versions with much less effort.

I agree it's more implementation work.  But the upside is a cleaner
struct commit.  Which is a very good thing.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Anthony Van de Gejuchte
On 18-jul-2011, at 07:13, George Spelvin wrote:

>> Nobody has *ever* given a reason why the cache would be better than
>> just making it explicit.
>
> I thought I listed a few.  Let me be clearer.
>
> 1) It involves changing the commit format.  Since the change is
>   backward-compatible, it's not too bad, but this is still fundamentally
>   A Bad Thing, to be avoided if possible.

Git is designed to ignore data in this case afaik, so I do not see any
reason why backwards-compatibility gets broken here.

>
> 2) It can't be retrofitted to help historical browsing.

I like to see more (valid) arguments, as I do not see what you are
trying to explain.

>
> 3) You have to support commits without generation numbers forever.
>   This is a support burden.  If you can generate generation numbers for
>   an entire repository, including pre-existing commits, you can *throw
>   out* the commit date heuristic code entirely.

I'll give you a few months to rethink at this statement until this
feature does get used widely. I think there was never a moment where
we would ever think to rebuild older commits as this would break the
hash of the commits where many people are potential looking for.

>
> 4) It can't be made to work with grafts or replace objects.
>
> 5) It includes information which is redundant, but hard to verify,
>   in git objects.  Leading to potentially bizarre and version-dependent
>   behaviour if it's wrong.  (Checking that the numbers are consistent
>   is the same work as regenerating a cache.)

The data is *consistent* as long as the hash doesn't change, storing the
data in the commits *can* reduce resource and makes calculations cheaper.
Therefore, I think there are enough reasons to add the generation number
in the commit. Yes, many data can be calculated or can be an overhead,
but as Torvalds already said, it can be used as consistency check.

If the data does get wrong, then its probably caused by something stupid
enough to break the rules. Yes, this is a problem but I think there are
already enough reasons given, look back to the archives of this topic.

Ok, there is one possible thing that *can* go wrong and that is when you
are changing history with generation numbers with an older git client.

(And thats a good reason to communicate with others as clear as possible
about this feature, but its still not version-dependent as it doesn't
require a client to use it)

>
> 6) It makes git commits slightly larger.  (Okay, that's reaching.)
>
>> Why is that so hard for people to understand? The cache is just EXTRA WORK.
>
> That's why it *might* have been a good idea to include the number in
> the original design.  But now that the design is widely deployed, it's
> better to avoid changing the design if not necessary.
>
> With a bit of extra work, it's not necessary.
>
>> To take your TLB example: it's like having a TLB for a page table that
>> would be as easy to just create in a way that it's *faster* to look up
>> in the actual data structure than it would be to look up in the cache.
>
> You've subtly jumped points.  The original point was that it's worth
> precomputing and storing the generation numbers.  I was trying to
> say that this is fundamentally a caching operation.
>
> Now we're talking about *where* to store the cached generation numbers.
>
> Your point, which is a very valid one, is that they are to be stored
> on disk, exactly one per commit, can be computed when the commit is
> generated, and are accessed at the same time as the commit, so it makes
> all kinds of sense to store them *with* the commits.  As part of them,
> even.
>
> This has the huge benefit that it does away with the need for a *separate*
> data structure.  (Kinda sorts like the way AMD stores instruction
> boundaries in the L1 I-cache, avoiding the need for a separate data
> structure.)
>
> I'm arguing that, despite this annoying overhead, there are valid reasons
> to want to store it separately.  There are some practical ones, but the
> basic one is an esthetic/maintainability judgement of "less cruft in
> the commit objects is worth more cruft in the code".
>
> Git has done very well partly *because* of the minimality of its basic
> persistent object database format.  I think we should be very reluctant
> to add to that without a demonstrated need that *cannot* be met in
> another way.
>
>
> In this particular case, a TLB is not a transport format.  It's okay
> to add redundant cruft to make it faster, because it only lasts until
> the next reboot.  (A more apropos, software-oriented analogy might be
> "struct page".)
>
> A git commit object *is* a transport format, one specifically designed
> for transporting data a very long way forward in time, so it should be
> designed with considerable care, and cruft ruthlessly eradicated.
>
> Whatever you add to it has to be supported by every git implementation,
> forever.  As does every implementation bug ever produced.
>
> A cache, on the other hand, is purely a local implementation detail.
> It can be changed between versions with much less effort.
>
> I agree it's more implementation work.  But the upside is a cleaner
> struct commit.  Which is a very good thing.

A cache would use more resources because they can become invalid at any
point and *should* be recalculated by every client. We are processing
data that *can* be reused by everybody with a git client which has this
specific feature, but does not break anything with an older client.

So please, calculate things only once as this may save a *lot* of time :-)

I would see more advantage in a cache if the data could differs on
every client, but that still doesn't mean that you should use one.

> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to [hidden email]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Maybe I shouldn't even have responded to this as I tend not to agree with
the given opinions to use a cache, even when I think that Torvalds starts
throwing arguments as well for certain reasons, but thats probably my wrong
thinking at it.


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

George Spelvin
>> 1) It involves changing the commit format.  Since the change is
>>   backward-compatible, it's not too bad, but this is still fundamentally
>>   A Bad Thing, to be avoided if possible.

> Git is designed to ignore data in this case afaik, so I do not see any
> reason why backwards-compatibility gets broken here.

That's what I just wrote.  "The change is backward-compatible"
is a simpler and shorter way of writing "it doesn't break
backwards-compatibility" (to put the generation number in the commit
object).

I just said that *any* change is still undesirable.

>> 2) It can't be retrofitted to help historical browsing.

> I like to see more (valid) arguments, as I do not see what you are
> trying to explain.

I apologize for being unclear.  I meant that if you store the generation
in the commit, then you can't add generation numbers to an existing
repository ("retrofit") in order to speed up --contains and --topo-sort
operations on pre-existing git repositories.

(Without recomputing all the hashes and breaking the ability to merge
with people not using the feature.)

As Linus points out, this is not likely to be a major performance issue
in practice, as operations like finding merge bases overwhelmingly
use recent objects (which will have generation numbers once the feature
goes in), but it is a measurable disadvantage.

>> 3) You have to support commits without generation numbers forever.
>>   This is a support burden.  If you can generate generation numbers for
>>   an entire repository, including pre-existing commits, you can *throw
>>   out* the commit date heuristic code entirely.

> I'll give you a few months to rethink at this statement until this
> feature does get used widely. I think there was never a moment where
> we would ever think to rebuild older commits as this would break the
> hash of the commits where many people are potential looking for.

I'm afraid that your English grammar is sufficiently mangled here that
I don't understand *your* point.  Which is a shame because it's
one of my more important points.

Storing the generation number inside the commit means that a commit
with a generation number has a different hash than a commit without one.
This means that people won't want to break the hashes of existing commits
by adding them.  In many cases, ever.

Which means that git will have to be able to work without the generation
numbers forever.

If the generation numbers are stored in a separate data structure that
can be added to an existing repository, then a new version of git can
do that when needed.  Which lets git depend on always having the the
generation numbers to do all history walking and stop using commit date
based heuristics completely.

>> 4) It can't be made to work with grafts or replace objects.
>>
>> 5) It includes information which is redundant, but hard to verify,
>>   in git objects.  Leading to potentially bizarre and version-dependent
>>   behaviour if it's wrong.  (Checking that the numbers are consistent
>>   is the same work as regenerating a cache.)

> The data is *consistent* as long as the hash doesn't change, storing the
> data in the commits *can* reduce resource and makes calculations cheaper.

You're mixing up two issues.  Storing the generation number *anywhere*
can make calculations cheaper.  Storing them in the commit is indeed the
*simplest* place, but the calculation cost point is equally true if the
numbers are stored somewhere else.

As for consistency...

I'm defining "consistent" as consistency between the generation number
and the parent pointers.  This is the property that the history-walking
optimizations depend on.

A commit's generation number is consistent if it is larger than the
generation number of any of its parents.  (Optionally, you
may require that it be larger by exatly 1.)

A generation number is *not* consistent if is less than or equal to the
generation number of one of its parents.

If this happens, history walking code that uses the generation numbers
will not produce correct output.

Further, the nature of the incorrectness will depend on implementation
details ("potentially bizarre and version-dependent behaviour") of the
history-walking code.

By computing the generation numbers when needed, the entire "what happens
if someone makes a commit with an inconsistent generation number"
problem goes away.  It goes from "not likely to happen" or "somthing
that has to be checked for when receiving objects" to "can't happen".

The computation to verify that an incoming commit's generation number
is consistent is exactly the same computation needed to compute the
generation number it should have: look up all parent commit generation
numbers and take the maximum.  The only question is whether we store
the result after computing it, or compare with the included generation
number and possibly print an error message.


For example, suppose I generate a commit with a generation number of
UINT_MAX.  Will this crash git?  That's a new error condition the code
has to worry about.  If I generate the generation number locally, I know
that can't happen in any repository that I can download in a reasonable
period of time.

If we had generation numbers from day 1, we could just require that they
always be checked, and an inconsistent object could be always rejected.

But since old git versions ignore the generation number in commits, a
bad generation number could spread a long way before someone notices it.
It becomes a visible problem.  Not a really big one (I'm pretty sure
that refusing to pull it introduces no security holes), but it's an
error condition that we have to actually think about.


> A cache would use more resources because they can become invalid at any
> point and *should* be recalculated by every client. We are processing
> data that *can* be reused by everybody with a git client which has this
> specific feature, but does not break anything with an older client.
>
> So please, calculate things only once as this may save a *lot* of time :-)

This is silly.  The cache can't become invalid except by disk corruption,
which can corrupt numbers stored in the commit object just the same.
(The corruption can be detected by git-fsck, but that's also true
independent of where the numbers are stored.)

And the work to recalculate the numbers is far less than the work to
garbage collect, or repack, or generate the index of an incoming pack,
or any of a dozen operations that are normally done by all clients.
(Don't get me started on rename detection!)

This is a completely misplaced optimization.  Walking every commit in
the repository takes a few seconds and enough memory that we don't want
to do it every "git log" operation, but it's barely perceptible compared
to other repository maintenance operations.

Do it once when you install a new git software version and then you can
forget about it.

> I would see more advantage in a cache if the data could differs on
> every client, but that still doesn't mean that you should use one.

If you use grafts or replace objects, it can be.  That's my point 4)
above.  Supporting these makes maintaining a cache trickier, but it's
simply impossible to do with in-commit generation numbers.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Christian Couder
In reply to this post by Jeff King
On Monday 18 July 2011 05:41:06 Jeff King wrote:

> On Sat, Jul 16, 2011 at 11:16:45AM +0200, Christian Couder wrote:
> > If  "git tag --contains" and "git branch --contains" give incorrect
> > answers because the commiter date is wrong in some commits, then why
> > not use "git replace" to "change" the commiter date in the commits
> > that have a wrong date? Is it because you don't want to use "git
> > replace", or because there is no script to do it automatically, or is
> > there another reason?
>
> That would work. There are a few tricky things, though:
>
>   1. Most commits have less than 100 skewed commits. But some have many
>      (e.g., thousands in the mesa repo). How well does git cope with
>      large numbers of replace refs, performance-wise?

If it did not cope well, it should be possible to improve the performance.

Anyway, another way to fix the problem with "git replace" could be to create
branches with commits that have a fixed commiter date and then to use "git
replace" only to connect these branches to the graph.

For example if you have this:

A - B - X1 - X2 - X3 - C - D

where X1, X2 and X3 are skewed, then you can create this:

A - B - X1 - X2 - X3 - C - D
         \ Y1 - Y2 - Y3

where Y1, Y2, Y3 are the same as X1, X2, X3 except they are not skewed.
Then you only need to do "git replace X3 Y3" so you create only one replace
ref.

>
>   2. Declaring which commits are skewed is actually tricky. You can find
>      a commit whose timestamp is less than the timestamp of one of its
>      ancestors. But you don't know whether it is skewed, or the
>      ancestor.
>
>      If you are implementing a list of commits whose timestamps
>      shouldn't be used for traversal cutoff, it doesn't really matter
>      who is _right_; you just care about whether the timestamps are
>      strictly increasing from that point.
>
>      But once you start replacing commits, you need to put in a
>      reasonable value for the timestamp. So you may well be replacing a
>      perfectly valid commit with one that has bogus, skewed information
>      in the commit timestamp.

Perhaps but with "git replace" you can choose to create new replace refs and
deprecate the old replace refs to fix this where you got it wrong.

It would be easier to do that if "git replace" supported sub directories like
"refs/replace/clock-skew/ted-july-2011/", so you could manage the replace refs
more easily.

For example you could create new refs in "refs/replace/clock-skew/ted-
july-2011-2/" if you found a better fix. And then use these new refs instead of
those in "refs/replace/clock-skew/ted-july-2011/".

>   3. Any value you put in is actually going to be a lie during things
>      like "git log --pretty=raw". That may be OK. But it is letting an
>      optimization meant to make traversal fast and accurate bleed into
>      the actual data we show the user.

With replace refs, the user could choose the "lies" told to him/her by
selecting the replace refs or set of replace refs that are used.

As commits are immutable, when they are created with bad data, the best we can
do is let the user choose if they want to see the original or another "fixed"
version. Because the original will always be "true" in a way.

>   4. Sometimes we need to do traversals on the real objects (e.g.,
>      because we are doing upload-pack). To get the benefit, those
>      traversals would presumably need to look at both the original
>      object and the replacement, use the timestamp from the replacement
>      for traversal, but otherwise use the original object.

Yeah, or maybe when we do traversals on real objects we could afford not to
rely on commiter date or some other "fragile" data.

Thanks,
Christian.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
On Tue, Jul 19, 2011 at 06:14:38AM +0200, Christian Couder wrote:

> >      But once you start replacing commits, you need to put in a
> >      reasonable value for the timestamp. So you may well be replacing a
> >      perfectly valid commit with one that has bogus, skewed information
> >      in the commit timestamp.
>
> Perhaps but with "git replace" you can choose to create new replace refs and
> deprecate the old replace refs to fix this where you got it wrong.
>
> It would be easier to do that if "git replace" supported sub directories like
> "refs/replace/clock-skew/ted-july-2011/", so you could manage the replace refs
> more easily.

I think all of the arguments I cut from your email are reasonable, but
the crux of the issue comes down to this point.

If you are interested in actually correcting the skew, then yes, replace
refs are a good solution. But doing so is going to involve somebody
looking at the commits and deciding which ones are wrong, and what they
should be. And maybe that's a good thing to do for people who really
care about cleaning history.

But for something like "speed up revision traversal by assuming commit
timestamps are roughly increasing", we want something very automated,
and what is needs to say is much weaker (not "this is what this commit
_should_ say", but rather "this commit might be right, but it is not a
good point for cutting off a traversal"). So that's a much easier
problem, and it's easy to do in an automated way.

So I think while you could use replace refs to handle this issue, it is
not always going to be the right solution, and there is room for
something simpler (and weaker).

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Nicolas Pitre-2
In reply to this post by George Spelvin
On Mon, 18 Jul 2011, George Spelvin wrote:

> Storing the generation number inside the commit means that a commit
> with a generation number has a different hash than a commit without one.
> This means that people won't want to break the hashes of existing commits
> by adding them.  In many cases, ever.
>
> Which means that git will have to be able to work without the generation
> numbers forever.

I've been diverting myself from $day_job by reading through this thread.  
Still, I couldn't make my mind between having the generation number
stored in the commit object or in a separate cache by reading all the
arguments for each until now. Admittedly I'm not as involved in the
design of Git as I once was, so my comments can be considered with the
same proportions.

Obviously, with a perfect design, we would have had gen numbers from the
beginning.  But we did mistakes, and now have to regret and live with
them (and yes I have my own share of responsibility for some of those
regrets which are now embodied in the Git data format).

> If the generation numbers are stored in a separate data structure that
> can be added to an existing repository, then a new version of git can
> do that when needed.  Which lets git depend on always having the the
> generation numbers to do all history walking and stop using commit date
> based heuristics completely.

To me this is the killer argument.  Being able to forget about the
broken date heuristics entirely and simplify the code is what makes the
external cache so fundamentally better as it can be applied to any
existing repositories.  And it has no backward compatibility issues as
old Git version won't work any worse if they can't make any usage of
that cache.

The alternative of having to sometimes use the generation number,
sometimes use the possibly broken commit date, makes for much more
complicated code that has to be maintained forever.  Having a solution
that starts working only after a certain point in history doesn't look
eleguant to me at all.  It is not like having different pack formats
where back and forth conversions can be made for the _entire_ history.

And if you don't care about graft/replace then the cached data is
immutable just like the in-commit version would, so there is no
consistency issues.  If you do care about graft/replace (or who knows
what other dag alteration scheme might be created in 5 years from now)
then a separate cache will be required _anyway_, regardless of any
in-commit gen number.

So to say that if a generation number is _really_ needed, then it should
go in a separate cache.  Saying that if we would have done it initially
then it would have been inside the commit object is not a good enough
justification to do it today if it can't be applied to the whole of
already existing repositories and avoid special cases.

I however have not formed any opinion on that fundamental question i.e.
whether or not gen numbers are worth it in today's conditions. Neither
did I think about the actual cache format (I don't think that adding it
to the pack index is a good idea if grafts are to be honored) which
certainly has bearing on that fundamental question too.

But I don't see the point of starting to add them now to commit objects,
even if we regret not doing it initially, simply because having them
appear randomly based on the Git version/implementation being used is
still much uglier than some ad hoc cache or even not having them at all.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

George Spelvin
> The alternative of having to sometimes use the generation number,
> sometimes use the possibly broken commit date, makes for much more
> complicated code that has to be maintained forever.  Having a solution
> that starts working only after a certain point in history doesn't look
> eleguant to me at all.  It is not like having different pack formats
> where back and forth conversions can be made for the _entire_ history.

It seemed like a pretty strong argument to me, too.

> And if you don't care about graft/replace then the cached data is
> immutable just like the in-commit version would, so there is no
> consistency issues.  If you do care about graft/replace (or who knows
> what other dag alteration scheme might be created in 5 years from now)
> then a separate cache will be required _anyway_, regardless of any
> in-commit gen number.

A possible workaround would be to keep track of the largest generation
number skew introduced by any graft, and add that safety factor into
the history-walking code, but that would be painful if you replace a
single large commit with an equivalent long development history, such
as adding a historical development tree behind a recently-cut-off one.
or development history You can do a workaround at the expense of ine

> Neither did I think about the actual cache format (I don't think that
> adding it to the pack index is a good idea if grafts are to be honored)
> which certainly has bearing on that fundamental question too.

I was thinking of something very close to the V2 pack format.
http://book.git-scm.com/7_the_packfile.html
A magic number, a 256-entry fanout table, a sorted list of 20-byte hashes,
followed by a matching list of 4-byte generation numbers.

Ending with a 20-byte hash of the replaces and grafts state that this
cache is valid for, and a hash of the cache itself.

A bit of code factoring should make it easy to share much of the code.


It would certainly be possible to share the SHA1 table in an existing
pack index and store the generation numbers of the base (no replacement)
case, but you'd have to store null values for all the non-commit objects.

That takes 4 bytes per object, while a separate list of commits
takes 24 bytes per commit.  A separate list is better if commits
are less than 1/6 of all objects.

Looking at git's own object database, we have:
 66125 blobs   (45.50%)
 49292 trees   (33.92%)
 29554 commits (20.33%)
   362 tags    ( 0.25%)
145333 total

So we're actually a bit over the 16.66% optimum. but it's not far enough
to be a real efficiency problem.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

David Lang
On Wed, 20 Jul 2011, George Spelvin wrote:

>> The alternative of having to sometimes use the generation number,
>> sometimes use the possibly broken commit date, makes for much more
>> complicated code that has to be maintained forever.  Having a solution
>> that starts working only after a certain point in history doesn't look
>> eleguant to me at all.  It is not like having different pack formats
>> where back and forth conversions can be made for the _entire_ history.
>
> It seemed like a pretty strong argument to me, too.

except that you then have different caches on different systems. If the
generation number is part of the repository then it's going to be the same
for everyone.

in either case, you still have the different heristics depending on what
version of git someone is running

David Lang
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Nicolas Pitre-2
On Wed, 20 Jul 2011, [hidden email] wrote:

> On Wed, 20 Jul 2011, George Spelvin wrote:
>
> > > The alternative of having to sometimes use the generation number,
> > > sometimes use the possibly broken commit date, makes for much more
> > > complicated code that has to be maintained forever.  Having a solution
> > > that starts working only after a certain point in history doesn't look
> > > eleguant to me at all.  It is not like having different pack formats
> > > where back and forth conversions can be made for the _entire_ history.
> >
> > It seemed like a pretty strong argument to me, too.
>
> except that you then have different caches on different systems.

So what?

> If the generation number is part of the repository then it's going to
> be the same for everyone.

The actual generation number will be, and has to be, the same for
everyone with the same repository content, regardless of the cache used.  
It is a well defined number with no room to interpretation.

> in either case, you still have the different heristics depending on what
> version of git someone is running

Indeed.


Nicolas
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Phil Hord (hordp)
On 07/20/2011 07:36 PM, Nicolas Pitre wrote:
> On Wed, 20 Jul 2011, [hidden email] wrote:
>
>> If the generation number is part of the repository then it's going to
>> be the same for everyone.
> The actual generation number will be, and has to be, the same for
> everyone with the same repository content, regardless of the cache used.
> It is a well defined number with no room to interpretation.

Nonsense.

Even if the generation number is well-defined and shared by all clients,
the only quasi-essential definition is "for each A in ancestors_of(B),
gen(A) < gen(B)".

In practice, the actual generation number *will be the same* for
everyone with the same repository content, unless and until someone
develops a different calculation method.  But there is no reason to
require that the number *has to be* the same for everyone unless you
expect (or require) everyone to share their gen-caches.

Surely there will be a competent and efficient gen-cache API.  But most
code can just ask if B --contains A or even just use rev-list and
benefit from the increased speed of the answer.  Because most code
doesn't really care about the gen numbers themselves, but only the speed
of determining ancestry.

Phil

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

David Lang
On Wed, 20 Jul 2011, Phil Hord wrote:

> On 07/20/2011 07:36 PM, Nicolas Pitre wrote:
>> On Wed, 20 Jul 2011, [hidden email] wrote:
>>
>>> If the generation number is part of the repository then it's going to
>>> be the same for everyone.
>> The actual generation number will be, and has to be, the same for
>> everyone with the same repository content, regardless of the cache used.
>> It is a well defined number with no room to interpretation.
>
> Nonsense.
>
> Even if the generation number is well-defined and shared by all clients, the
> only quasi-essential definition is "for each A in ancestors_of(B), gen(A) <
> gen(B)".
>
> In practice, the actual generation number *will be the same* for everyone
> with the same repository content, unless and until someone develops a
> different calculation method.  But there is no reason to require that the
> number *has to be* the same for everyone unless you expect (or require)
> everyone to share their gen-caches.

and I think this is why Linus is not happy with a cache. He is seeing this
as something that has significantly more value if it is going to be
consistant in a distributed manner than if it's just something calculated
locally that can be different from other systems.

if it's just locally generated, then I could easily see generation numbers
being different on different people's ssstems, dependin on the order that
they see commits (either locally generated or pulled from others)

If it's part of the commit, then as that commit gets propogated the
generation number gets propogated as well, and every repository will agree
on what the generation number is for any commit that's shared.

I agree that this consistancy guarantee seems to be valuable.

> Surely there will be a competent and efficient gen-cache API.  But most code
> can just ask if B --contains A or even just use rev-list and benefit from the
> increased speed of the answer.  Because most code doesn't really care about
> the gen numbers themselves, but only the speed of determining ancestry.

in that case, why bother with generation numbers at all? the improved data
based heristic seems to solve that problem.

David Lang

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
12345