Git commit generation numbers

classic Classic list List threaded Threaded
89 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Geert Bosch

On Jul 14, 2011, at 21:19, Linus Torvalds wrote:
> But dammit, if you start using generation numbers, then they *are*
> required information. The fact that you then hide them in some
> unarchitected random file doesn't change anything! It just makes it
> ugly and random, for chrissake!

Generation numbers never will be required information, because we
can always compute them. These numbers are really much more similar
to other pack index information than anything else.

<aside>
Sometimes I wish we'd have general "depth" information for each
SHA1, which would be the maximum number of steps in the DAG to reach
a leaf. This way, if we want to do something like "git log
drivers/net/slip.c", we don't have to bother reading the majority
of trees that have a depth less than two. The depth can also be used
as a limiter for "contains" operations, where we want to see if
commit X contains commit Y: depth (X) has to be at least depth (Y).

However, any such notion, wether generation or depth or whatever
else we'll think of tomorrow, is something particular to a certain
implementation of git. It does not add anything to the information
we stored.
</aside>

I don't think my commit should have a different SHA1 from yours,
because your tree has a more generation numbers than mine.

The beauty and genius of GIT is that it just takes the minimum
amount of data needed to uniquely identify the information to be
stored, and stores that in a UNIQUE format. By allowing generation
numbers to either be present or absent, that's all broken.

It's like computing the SHA1 of compressed data: it doesn't depend
on the data we store, just about the particular representation we
choose. Fortunately we have done away with the first mistake.

So, if you're going to add generation numbers, there has to be a
flag day, after which generation numbers are required everywhere.
Of course it would be possible to recognize "old style" commits
and convert them on the fly, but that is true for pretty much
any format change. However, adding redundant information seems
like a poor excuse for having a flag day.

Storing generation data in pack indices on the other hand makes
perfect sense: when we generate these indices, we do complete
traversals and have all required information trivially at hand.  We
can never have that many loose objects, so lack of generation
information there isn't a big deal. By storing generation information
in the index, we can be sure it is consistent with the data contained
in the pack, so there are no cache invalidation issues.

I know I must have missed some stupid and obvious reason why
this is all wrong, I just don't quite see it yet.

  -Geert
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
In reply to this post by Linus Torvalds-3
On Thu, Jul 14, 2011 at 06:19:30PM -0700, Linus Torvalds wrote:

> Yes.
>
> And until I saw the patches trying to add generation numbers, I didn't
> really try to push adding generation numbers to commits (although it
> actually came up as early as July 2005, so the "let's use generation
> numbers in commits" thing is *really* old).
>
> In other words, I do agree that we should strive for minimal required
> information.
>
> But dammit, if you start using generation numbers, then they *are*
> required information. The fact that you then hide them in some
> unarchitected random file doesn't change anything! It just makes it
> ugly and random, for chrissake!

So you don't see a difference between storing the information directly
in the commit object, where it affects the sha1 of the commit, and
calculating and storing it somewhere else? That is what seems ungit to
me. You aren't adding new information to the DAG (note I said "DAG" and
not commit) that is not already there, but you are changing the ids of
commits in the DAG.

I'm not saying that's a reason to ultimately reject the idea of putting
generation numbers in commit objects. But it is a reason to give us
pause and figure out if there are other solutions, because it will be
the first time such redundant information has been added. And that's
what I've been trying to do during this discussion with you: work out
what the options are and evaluate them.

> I really don't understand your logic that says that the cache is
> somehow cleaner. It's a random hack! It's saying "we don't have it in
> the main data structure, so let's add it to some other one instead,
> and now we have a consistency and cache generation problem instead".

Are packfiles unclean, or a random hack? How about pack indices? What
about Nico's and Shawn's ideas for a packv4 that would gain efficiency
by storing objects not in their whole format, but in a way that would
make tree examination faster (but would be able to restore the whole
objects byte for byte)?

Those things rely on the idea that the git DAG is a data model that we
present to the user, but that we're allowed to do things behind the
scenes to make things faster. We're allowed to make an index of offsets
of objects in the packfile for faster lookup. Why are we not allowed to
use an index for other object data if it will speed up our local
algorithms?

Again, I'm not saying that the patches I posted are necessarily the
answer. Maybe my cache implementation sucks. Maybe the value should go
into a pack index instead. Maybe the whole idea is stupid. But I don't
think it's worth rejecting out-of-hand the idea that the generation
number might be stored outside of the commit object. I do think it's
worth talking about what the actual downsides are, as compared to other
options.

For example, you mentioned there a consistency problem in the paragraph
above. What is it?

If you mean the problem with refs/replace, then yes, that is an open
problem to be solved (though not a hard one, as I already mentioned a
solution elsewhere). But is that problem better or worse with this
solution versus an embedded generation number? It seems to me that an
embedded generation number is even worse.

> Just look at the size of the patches in question. Your caching patches
> are bigger and more complicated. Sure, part of it is that your series
> adds the code to _use_ the generation number, but look purely at the
> code to maintain them.

It's 300 lines of code. That can also be used to store arbitrary
meta-information for commits. I've already achieved significant speedups
in some workflows by caching patch-id calculations, which would reuse
this code.  And I certainly don't think _that_ should go into the commit
object.

Yes, it's more complex than simply adding a generation number to the
commit header. But simply adding a generation number does not actually
give the 100-fold speedup I'm seeing. So again, I'm not interested in
rejecting solutions out of hand; I'm interested in things like: is the
complexity of the cache worth this speedup? What other options do we
have, and what speedup do they provide? Do we care enough about this
speedup to even bother?

> Why do you think the odd separate cache is somehow better than just
> doing it right? Seriously? If we require the generation numbers, then
> they have *become* that minimal information that we should save!

What do you do when generation numbers don't match the DAG represented
by the parent pointers? Are you proposing to just ignore it? I'm not
asking that question adversarially; ignoring may be the sane thing to
do, and we say "generation numbers are to be trusted, even if they don't
match parent pointers".

> Now, six years later, you clearly are saying that we need the
> generation numbers, but then you go off and try to say that they
> should be in some secondary non-architected random collection of data
> structures that isn't covered by the security and maintenance
> guarantees that the core git objects are.

I don't think I said we clearly need them. I said we can get speedups by
using them, and I showed some patches. I _also_ posted patches showing
how to accomplish similar speedups using timestamps. Note that all of my
patches started with "RFC". I am trying to figure out which is the best
way to proceed.

And why _would_ they need to be covered by the security and maintenance
guarantees of core objects? You can trivially calculate them from the
core objects. Are pack indices also a "secondary non-architected random
collection of data structures"?

> Dammit, one of the things that makes git special is that the data
> structures are NOT random odd ad-hoc files. There is a design to them.

There is just as much documentation and design for the new file format I
added as there is for pack indices (in fact, they're quite similar in
design). I really see them at the same level: something we calculate to
speed up some algorithms, but something we could regenerate at any time
if we felt like.

> > And so that seems a bit hack-ish to me.
>
> Um? If you feel that way, then why the hell are you pushing your EVEN
> MORE HACKISH CACHE PATCHES?

Please, there is really no need to shout. And I find it quite silly that
you would refer to me as "pushing" these patches when they have been
clearly listed as RFC, and everything I have posted in the nearby
threads has been about comparing different strategies (with patches and
timings for some of those other strategies!).

> We don't do caches in git, because we don't NEED to. Sure, gitk has
> it's hacky cache, but that's not core functionality.

I'm sorry to tell you that there is already a cache for external
conversion of diffs for blobs. And that I have a patch series which
makes "git cherry" much more pleasant to use by caching patch ids.

Do we "need" those? No, of course not. Git works just fine without them,
albeit a bit slower. But is it sometimes worth making a space-time
tradeoff to make some algorithms faster? I think it sometimes is,
depending on the space and time factors, and the complexity of the
storage (e.g., consistency problems with caching).

> I think it's a sign of good design that we can do a "find .git" and
> explain every single file, and show that it's all core functionality
> (again, with the exception of "gitk.cache", and I suspect that's
> because gitk is a script, not because of any really fundamental data
> issues), and explain it.

Would it make you happier if we stored the generation data in the pack
index when we index the packs?

> I think the *cache* is a hell of a lot more hacky than just doing it right.

You still haven't explained how we would "do it right" and get the same
speedups. When I responded to your initial email, your answers were
along the lines of "we could cache fewer things". If your position is to
damn the speedup, the cache is not worth the complexity, I can buy that.
If your position is that the complexity is not worth it, and we are
better off to keep using timestamps, I can buy that. If your position is
that you can find a clever way, using only the generation numbers in
newly created commits, to get similar speedups in "git {tag,branch}
--contains", I'd love to hear it.

> > I liken it somewhat to the "don't store renames" debate.
>
> That's total and utter bullshit.
>
> Storing renames is *wrong*. I've explained a million times why it's
> wrong. Doing it is a disaster. I know. I've used systems that did it.
> It's crap. It's fundamentally information that is actively misleading
> and WRONG. It's not even that you can do rename detection at run-time,
> it's that you *HAVE* to do rename detection at run-time, because doing
> it at commit time is simply utterly and fundamentally *wrong*.

Yes, I am well aware that stored renames are wrong for merging. The
problem is that they not a function of a tree state (which is what a
commit stores), but rather of the difference between two states. So when
you diff the commit's state with some other arbitrary merge-base, any
renames recorded at commit time would be worthless.

But consider another case. Each time I run "git log -M --raw", I compute
the same renames over and over. Let's say I have a case in which this is
annoyingly slow, and want to speed it up. The state of a particular
commit and the state of its parents are invariants for a particular sha1
commit id; this is a fundamental property of git, as you well know. So
for a given rename-detection algorithm (and any parameters it has), the
set of renames between the states will also be an invariant.

Now imagine I create a persistent cache mapping the commit sha1 for some
sane default set of rename algorithm parameters to a set of rename
pairs. My annoyingly slow "log -M" is now faster, and I'm happier.

I think you encounter a similar set of questions here as you do with the
concept of a generation header. If the information is an invariant for a
particular commit sha1, can we and should we store it in the commit
object? Is the speedup worth the complexity of a cache? What are the
circumstances under which the cache is not applicable, and how often do
they come up? Can we accurately detect when the cache is not applicable?

And that is why I compared it to the idea of storing renames. Please
note that I did _not_ say they were exactly the same situation, or that
the answers to one set of questions were the same as the answers to
another.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jakub Narębski
In reply to this post by Linus Torvalds-3
Linus Torvalds <[hidden email]> writes:

> On Thu, Jul 14, 2011 at 1:31 PM, Jeff King <[hidden email]> wrote:
> >
> > However, I'm not 100% convinced leaving generation numbers out was a
> > mistake. The git philosophy seems always to have been to keep the
> > minimal required information in the DAG.
>
> Yes.
>
> And until I saw the patches trying to add generation numbers, I didn't
> really try to push adding generation numbers to commits (although it
> actually came up as early as July 2005, so the "let's use generation
> numbers in commits" thing is *really* old).
>
> In other words, I do agree that we should strive for minimal required
> information.
>
> But dammit, if you start using generation numbers, then they *are*
> required information. The fact that you then hide them in some
> unarchitected random file doesn't change anything! It just makes it
> ugly and random, for chrissake!
>
> I really don't understand your logic that says that the cache is
> somehow cleaner. It's a random hack! It's saying "we don't have it in
> the main data structure, so let's add it to some other one instead,
> and now we have a consistency and cache generation problem instead".

You store redundant information, one that is used to speed up
calculations, in a cache.
 
[...]
> > Generation numbers are _completely_ redundant with the actual structure
> > of history represented by the parent pointers.

What is more important the perceived structure of history can change
by three mechanisms:

 * grafts
 * replace objects
 * shallow clone

I can understand that you don't want to worry about grafts - they are
a terrible hack.  We can simply turn off using generation numbers
stored in commit if they are present.

The problem with shallow clones is only at beginning, when some of
commits in shallow repository does not have generation numbers.  You
cannot simply calculate generation number for a new commit in such
case.

But what about REPLACE OBJECTS?  If one for example use "git replace"
on root commit to join contemporary repository with historical
repository... this is not addressed in your emails.


And let's not forget the fact that we need cache for old commits which
don't have yet generation number in a commit.


BTW. you are not fair comparing size of code.  

First, some of Peff code is about _using_ generation numbers, which
will be needed regardless of whether generation numbers are stored in
cache or packfile index, or whether they are embedded in commit
objects.

Second, with generation number commit header you need to write fsck
code, and have to consider size of this yet-to-be-written code.

[...]
> > I liken it somewhat to the "don't store renames" debate.
>
> That's total and utter bullshit.

I think Peff meant here that if you make mistakes in calculating
rename info or generation number, and have incorrect information
stored in commit object, you are f**ked.
 
> Storing renames is *wrong*. I've explained a million times why it's
> wrong. Doing it is a disaster. I know. I've used systems that did it.
> It's crap. It's fundamentally information that is actively misleading
> and WRONG. It's not even that you can do rename detection at run-time,
> it's that you *HAVE* to do rename detection at run-time, because doing
> it at commit time is simply utterly and fundamentally *wrong*.
>
> Just look at "git blame -C" to remind yourself why rename information is wrong.

Also doing full code movement and copying detection (that is what "git
blame -C" does) rather than simplistic whole-file rename detection is
pretty much impossible at commit time.

Nb. most SCMs that use path-id based rename tracking require that user
explicitly marks renames using "scm move" or "scm rename" (well,
Mercurial has a tool for rename detection before commit, "hg
addremove").  But asking user to mark code movements is simply
infeasible.
 
> But even more importantly, look at git merges. Look at how git has
> gotten merging right since pretty much day #1, and has absolutely no
> issues with files that got generated two different ways. Look at every
> SCM that tries to do rename detection, and look at how THEY CANNOT DO
> MERGES RIGHT.
>
> It's that simple. Rename detection is not about avoiding "redundant
> data". It's about doing the right thing.

Well, rename tracking supporters say that heuristic rename detection
can be wrong.


By the way, what happened to "wholesame directory rename detection"
patches?  Without them in the situation where one side renamed
directory, and other created new file in said directory git on merge
creates file in re-created old name of directory...

--
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Long, Martin
I strongly agree with Linus that the cache should not form part of the
solution to this problem, but could maybe be a later add-on, which
improved performance.

There is a possible improvement, which may remove the need for the
cache. It doesn't solve the issue of broken numbers, but I think the
key to that is just to ensure the traversal algorithm is
deterministic, stable, and immutable.

Firstly, I presume the generation number would not form part of the
SHA1 calculation? No? Cool.

When calculating a generation number by doing a traversal, would it
not be possible to update some, or all, commit objects touched, with
their generation numbers. Again, this would be expensive, but there
would possibly be even quicker gains than Linus's original proposal to
just add numbers to the new commit.

A compromise might be to only update some commits - notably those with
2 or more parents, so that both parents don't need to be traversed,
and possibly every nth commit (to give regular checkpoints that can be
utilised when traversing a branch). I would suggest commits with 2
children for the latter, but with my limited knowledge of the
implementation, I understand that Is more difficult to find.
Obviously, these numbers would only be pegged locally, and wouldn't by
synced on push, as they already exist on the far end. However, it
could be possible to run a process on a bare repo to shoot through and
peg commits, then at least new clones will be "well pegged"

Martin Long
UK
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Long, Martin
>
> Firstly, I presume the generation number would not form part of the
> SHA1 calculation? No? Cool.

I suspect this may be where my suggestion falls down. Though I suspect
there is a case for object metadata which doesn't form part of the
SHA. Would generation number tampering be a concern?

Caching offers the ability to store that metadata, to provide the same
performance gain, but maintain the integrity of the SHA chain.
However, it does still leave the generation number liable to
tampering, meaning a generic non-SHA metadata solution might be
better.

TBH, there are few situations where historical generations are useful
- finding gen numbers of tags is one of them. Most cases are going to
be for new commits, and in that case, a few new commits at the tip of
each branch will very quickly reduce the number of traversals. What
use case would really create enough traversals that it should be a
performance concern?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
In reply to this post by Jeff King
On Fri, Jul 15, 2011 at 12:46 AM, Jeff King <[hidden email]> wrote:
>
> So you don't see a difference between storing the information directly
> in the commit object, where it affects the sha1 of the commit, and
> calculating and storing it somewhere else?

Sure, I see the difference. And I think it's uglier to have two
different places for required information.

>  That is what seems ungit to
> me. You aren't adding new information to the DAG (note I said "DAG" and
> not commit) that is not already there, but you are changing the ids of
> commits in the DAG.

Umm. It's redundant, but so what? We have tons of redundant
information in there already. Those commits are very explicitly using
a 40-byte ASCII representation of the 20-byte SHA1 names. The very
original deeper object structure is also redundant: we repeat the
object size in the object itself, even though it's part of the
implicit object format itself.

We also very purposefully repeat the type of the object there, even
though the type is basically always redundant (in fact, the core git
functions require you to give the type of the object as part of the
lookup, and will error out if the SHA1 points to the wrong type). That
was one of my original design decisions, exactly because I wanted the
redundancy for verification.

Redundancy isn't a problem. It's a source of sanity checking.

I'm not seeing why you are harping on it.

I think it's much worse to have the same information in two different
places where it can cause inconsistencies that are hard to see and may
not be repeatable. If git ever finds the wrong merge base (because,
say, the generation numbers are wrong), I want it to be a *repeatable*
thing. I want to be able to repeat on the git mailing list "hey, guys,
look at what happens when I try to merge commits ABC and XYZ". If you
go "yeah, it works for me", then that is bad.

What I tried very hard to do in the git data structures is to make
them (a) immutable (so the DAG could never have two-way links, for
example) and (b) "simple".

Right now, we do *have* a "generation number". It's just that it's
very easy to corrupt even by mistake. It's called "committer date". We
could improve on it.

> Are packfiles unclean, or a random hack? How about pack indices?

No. Neither of them are unclean or random. The original git design was
very much about thinking of the object space as a "filesystem". Now,
the original object layout actually used the native OS filesystem, and
I naively thought that would be ok. Using  aspecialized filesystem
instead doesn't really change anything. It's not fundamentally
different from the difference between running git on ext3 or btrfs or
nfs or whatever. In fact, I think we've had more filesystem-related
bugs wrt NFS than we've had with pack-files.

The pack indices are actually kind of ugly - and I would have
preferred having them in the same file instead of having the worry of
consistency across two different files. They *are* the kind of thing
that could cause local inconsistency, but they are fairly simple, and
they have some serious protection in them (ie they aren't just SHA1'd
in themselves, they contain a SHA1 of the pack-file they index in them
to make sure that any inconsistency is findable). Again, that's
"redundancy". But I consider the packfile/index to be just a
filesystem. It really fundamentally *is* that.

Partly for that reason, I do think that if the generation count was
embedded in the pack-file, that would not be an "ugly" decision. The
pack-files have definitely become "core git data structures", and are
more than just a local filesystem representation of the objects:
they're obviously also the data transport method, even if the rules
there are slightly different (no index, thank god, and incomplete
"thin" packs).

That said, I don't think a generation count necessarily "fits" in the
pack-file. They are designed to be incremental, so it's not very
natural there. But I do think it would be conceptually prettier to
have the "depth of commit" be part of the "filesystem" data than to
have it as a separate ad-hoc cache.

> Those things rely on the idea that the git DAG is a data model that we
> present to the user, but that we're allowed to do things behind the
> scenes to make things faster.

.. and that is relevant to this discussion exactly *how*?

It's not. It's totally irrelevant. I certainly would never walk away
from the DAG model. It's a fundamental git decision, and it's the
correct one.

But it all boils down to one simple issue: we should have added
generation counts back in 2005. It's likely the *one* data format
decision that I regret. Using commit dates was wrong. Everybody knew
it was wrong, but we ended up going with it just to keep the format
constant.

If I had realized how small the patch was to add generation counters,
and that it wouldn't have broken backwards compatibility (ie fsck
doesn't start complaining). I would have done it originally, instead
of all the crazy hacks we did for commit date verification.

And that is what this discussion fundamentally boils down to for me.

If we should have fixed it in the original specification, we damn well
should fix it today. It's been "ignorable" because it's just not been
important enough. But if git now adds a fundamental cache for them,
then that information is clearly no longer "not important enough".

                               Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Drew Northup
In reply to this post by Long, Martin

On Fri, 2011-07-15 at 16:33 +0100, Long, Martin wrote:
> >
> > Firstly, I presume the generation number would not form part of the
> > SHA1 calculation? No? Cool.
>
> I suspect this may be where my suggestion falls down. Though I suspect
> there is a case for object metadata which doesn't form part of the
> SHA. Would generation number tampering be a concern?

If you take Jeff's perspective on the purpose of generation numbers
(representing metadata about the DAG in a more readily-available format)
then "tampering" is not really a concern as the metadata is merely local
(to the running instance of Git) ephemera that we can cache between runs
for the sake of efficiency. Linus' perspective on generation numbers
seems to be of a more hard and fast type of data.

So, are we really talking about [corpus] generation numbers (used to
describe the state of the DAG in the way one describes his known family
tree) or are we talking about _revision_numbers_ (used to describe the
commit, as Subversion does)? I think we've got two (or more) groups
talking about different things (and aims) and trying to use the same
words to do so.

> Caching offers the ability to store that metadata, to provide the same
> performance gain, but maintain the integrity of the SHA chain.
> However, it does still leave the generation number liable to
> tampering, meaning a generic non-SHA metadata solution might be
> better.

I'm not sure where you are going with this. I wouldn't think "tampering"
with _current_DAG-based ephemera would do much other than create a
performance hit. If you are really talking about a static
_revision_number_ then that belongs in the commit, where it cannot be
changed (and may be completely meaningless when taken out of context, as
SVN revision numbers are). What such a number may entail is probably up
for discussion, but perhaps in a different thread.

> TBH, there are few situations where historical generations are useful
> - finding gen numbers of tags is one of them. Most cases are going to
> be for new commits, and in that case, a few new commits at the tip of
> each branch will very quickly reduce the number of traversals. What
> use case would really create enough traversals that it should be a
> performance concern?

The answer to this is found in a previous thread
http://article.gmane.org/gmane.comp.version-control.git/176807

(remember, generation number vs. revision number...)

Also, please don't cull the CC list! (Added Geert Bosch)

--
-Drew Northup
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Shawn Pearce
In reply to this post by Linus Torvalds-3
On Fri, Jul 15, 2011 at 09:10, Linus Torvalds
<[hidden email]> wrote:
> Right now, we do *have* a "generation number". It's just that it's
> very easy to corrupt even by mistake. It's called "committer date". We
> could improve on it.
...
> If I had realized how small the patch was to add generation counters,
> and that it wouldn't have broken backwards compatibility (ie fsck
> doesn't start complaining). I would have done it originally, instead
> of all the crazy hacks we did for commit date verification.

What about going forward making the requirement that a new commit must
have a committer date whose date is >= the maximum date of its
parents?

We could also add a check during fast-forward merges to refuse to
perform the merge if the incoming commit has a committer date too far
forward in the future (e.g. more than 5 minutes). If you pull from a
moron whose system clock is set such that the committer date isn't a
proxy for generation number, Git would just refuse the merge, and you
could ask them to fix their objects.

--
Shawn.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
On Fri, Jul 15, 2011 at 9:18 AM, Shawn Pearce <[hidden email]> wrote:
>
> What about going forward making the requirement that a new commit must
> have a committer date whose date is >= the maximum date of its
> parents?

So you suggest just making commit dates be the generation number.

I'd be ok with that. It's basically what we've been doing for the last
six years.

But in that case, we shouldn't be doing the generation count cache either.

Btw, I do agree that we probably should add a warning for the case
("your clock is wrong - your commit date is before the commit date of
your parents") and maybe require the use of "-f" or something to
override it. That would certainly be a good thing quite independently
of anything else. So regardless of generation counts, it's probably
worth it.

But if you think commit date is good enough for generation counts -
and I'm not arguing against it - then please tell me why you would
then want to have a separate generation count cache.

So I would like to repeat: I think our commit-date based hack has been
pretty successful. We've lived with it for years and years. Even the
"let's try to fix it by adding slop" code is from three years ago
(commit 7d004199d1), which means that for three years we never really
saw any serious problems. I forget what problem we actually did see -
I have this dim memory of it being Ted that had problems with a merge
because git picked a crap merge base, but that may just be my
Alzheimer's speaking.

Obviously there are cases where we miss some merge base and it doesn't
really end up mattering, so we may well have a *ton* of commits that
have bad dates, but they just haven't affected us enough for us to
care. That's fine too - I dislike how our algorithm isn't truly
reliable, but at the same time I think we're so robust that it all
works regardless.

So I think it's ugly and fairly hacky, but it has worked well enough
in practice. I dislike our commit dates, but I don't _hate_ them. I do
think it was a mistake, but not one I'm especially ashamed of.

So why do I dislike the generation count cache so much? I dislike it
exactly because

  "if the commit date isn't good enough, then dammit, we should have
just added a generation count".

And if we should have added it six years ago, then we should add it
today. Not say "oh, we made a mistake six years ago, let's work around
the mistake instead of fixing it".

That's really what it boils down to. Let's not paper over a mistake.
Either we need the generation depth or we don't. And if we do need it,
we should replace the date-based hackery with it (where "replace" may
well be "still fall back on our traditional date-based hackery in the
absense of generation counters").

But if we decide that we don't really need generation counters AT ALL,
and can just continue with the commit date hack, then I'm personally
ok with that too.

So to me, it's a "either or" situation. Either the commit dates are
good enough, or we should add generation counts to the commits.

But in *neither* case is it ok to do some external cache to work around it.

                          Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Ted Ts'o
On Fri, Jul 15, 2011 at 09:44:21AM -0700, Linus Torvalds wrote:
> So I would like to repeat: I think our commit-date based hack has been
> pretty successful. We've lived with it for years and years. Even the
> "let's try to fix it by adding slop" code is from three years ago
> (commit 7d004199d1), which means that for three years we never really
> saw any serious problems. I forget what problem we actually did see -
> I have this dim memory of it being Ted that had problems with a merge
> because git picked a crap merge base, but that may just be my
> Alzheimer's speaking.

My original main issue was simply that "git tag --contains" and "git
branch --contains" was either (a) incorrect, or (b) slower than
popping up gitk and pulling the information out of the GUI.  The
reason for (b) is because of gitk.cache.

Maybe the answer then is creating a command-line tool (it doesn't have to
be in "core" of git) which just pulls the dammned information out of
gitk.cache....

(Yes, it's gross, but I'm not worrying about the long-term
architecture of git or anything high-falutin' like that.  I'm just a
poor dumb user who just wants git tag --contains and git branch
--contains to be fast and accurate...)

                                                - Ted
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Luck, Tony
In reply to this post by Linus Torvalds-3
On Fri, Jul 15, 2011 at 9:44 AM, Linus Torvalds
<[hidden email]> wrote:
> Btw, I do agree that we probably should add a warning for the case
> ("your clock is wrong - your commit date is before the commit date of
> your parents") and maybe require the use of "-f" or something to
> override it. That would certainly be a good thing quite independently
> of anything else. So regardless of generation counts, it's probably
> worth it.

What if my clock is wrong in the opposite direction - set to some time
out in 2025.
It would pass the check you propose and let the commit go in - but would
cause problems for everyone if that tree was pulled into upstream.

You'd also want a check in pull(merge) that none of the commits being
added were in the future (as defined by the time on your machine).

-Tony
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
On Fri, Jul 15, 2011 at 11:46 AM, Tony Luck <[hidden email]> wrote:
>
> What if my clock is wrong in the opposite direction - set to some time
> out in 2025.
> It would pass the check you propose and let the commit go in - but would
> cause problems for everyone if that tree was pulled into upstream.

I think Shawn suggested that we just notice it at merge time.

But yes, it's why (a) I'd suggest we have a "-f" to override and (b) I
do think that generation counts are a better idea. You could still
screw them up, but it would be due to an outright bug or malicious
behavior, rather than simple incompetence on the part of a user.

Incompetent users (where "date on the machine set to the wrong
century" is just _one_ sign of incompetence) are something git should
pretty much take for granted. It may not be the common case, but it's
certainly something we should design for and take into account.

In contrast, if somebody *wants* to screw his repository up by
re-writing objects with "git hash-object" etc, be my guest. We should
just make sure fsck catches anything serious.

So I would suggest checking the date regardless of any generation
count issues, because it would possibly find badly configured machines
that should be fixed. The same way we complain when we find no name.

Whether it should then be a correctness issue or not is kind of separate.

> You'd also want a check in pull(merge) that none of the commits being
> added were in the future (as defined by the time on your machine).

I don't think you need to care about "none of the commits", just
making sure the tip is reasonable. That would not only be expensive,
and not what we normally do (we show the diff against endpoints, not
all changes, etc). It would also cause problems for "fixed"
repositories (ie anything that has historical dates that are wrong,
but are ok now).

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
In reply to this post by Ted Ts'o
On Fri, Jul 15, 2011 at 11:42 AM, Ted Ts'o <[hidden email]> wrote:
>
> My original main issue was simply that "git tag --contains" and "git
> branch --contains" was either (a) incorrect, or (b) slower than
> popping up gitk and pulling the information out of the GUI.  The
> reason for (b) is because of gitk.cache.

With "original issue" I actually meant the case that caused us to add
the "slop" commit (7d004199d1). But I was too lazy to try to find the
archives from March 2008..

                  Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
In reply to this post by Linus Torvalds-3
On Fri, Jul 15, 2011 at 09:10:48AM -0700, Linus Torvalds wrote:

> I think it's much worse to have the same information in two different
> places where it can cause inconsistencies that are hard to see and may
> not be repeatable. If git ever finds the wrong merge base (because,
> say, the generation numbers are wrong), I want it to be a *repeatable*
> thing. I want to be able to repeat on the git mailing list "hey, guys,
> look at what happens when I try to merge commits ABC and XYZ". If you
> go "yeah, it works for me", then that is bad.

Having the information in two different places is my concern, too. And I
think the fundamental difference between putting it inside or outside
the commit sha1 (where outside encompasses putting it in a cache, in the
pack-index, or whatever), is that I see the commit sha1 as somehow more
"definitive". That is, it is the sole data we pass from repo to repo
during pushes and pulls, and it is the thing that is consistency-checked
by hashes.

So if there is an inconsistency between what the parent pointers
represent, and what the generation number in "outside" storage says,
then the outside storage is wrong, and the parent pointers are the right
answer. It becomes a lot more fuzzy to me if there is an inconsistency
between what the parent pointers represent, and what the generation
number says.

How should that situation be handled? Should fsck check for it and
complain? Should we just ignore it, even though it may cause our
traversal algorithms to be inaccurate? Like clock skew, there's not much
that can be done if the commits are published.

Those are serious questions that I think should be considered if we are
going to put a generation header into the commit object, and I haven't
seen answers for them yet.

> Partly for that reason, I do think that if the generation count was
> embedded in the pack-file, that would not be an "ugly" decision. The
> pack-files have definitely become "core git data structures", and are
> more than just a local filesystem representation of the objects:
> they're obviously also the data transport method, even if the rules
> there are slightly different (no index, thank god, and incomplete
> "thin" packs).
>
> That said, I don't think a generation count necessarily "fits" in the
> pack-file. They are designed to be incremental, so it's not very
> natural there. But I do think it would be conceptually prettier to
> have the "depth of commit" be part of the "filesystem" data than to
> have it as a separate ad-hoc cache.

Sure, I would be fine with that. When you say "packfile", do you mean
the the general concept, as in it could go in the pack index as opposed
to the packfile itself? Or specifically in the packfile? The latter
seems a lot more problematic to me in terms of implementation.

> > Those things rely on the idea that the git DAG is a data model that we
> > present to the user, but that we're allowed to do things behind the
> > scenes to make things faster.
>
> .. and that is relevant to this discussion exactly *how*?

Because keeping the generation information outside of the DAG keeps the
model we present to the user simple (and not just the user; the
information that we present to other programs), but lets git still use
the information without calculating it from scratch each time. Just like
we present the data as a DAG of loose objects via things like "git
cat-file", even though the underlying storage inside a packfile may be
very different. I just don't see those two ideas as fundamentally
different.

> It's not. It's totally irrelevant. I certainly would never walk away
> from the DAG model. It's a fundamental git decision, and it's the
> correct one.

Of course not. I never suggested we should.

> And that is what this discussion fundamentally boils down to for me.
>
> If we should have fixed it in the original specification, we damn well
> should fix it today. It's been "ignorable" because it's just not been
> important enough. But if git now adds a fundamental cache for them,
> then that information is clearly no longer "not important enough".

OK, so let's say we add generation headers to each commit. What happens
next? Are we going to convert algorithms that use timestamps to use
commit generations? How are we going to handle performance issues when
dealing with older parts of history that don't have generations?

Again, those are serious questions that need answered. I respect that
you think the lack of a generation header is a design decision that
should be corrected. As I said before, I'm not 100% sure I agree, but
nor do I completely disagree (and I think it largely boils down to a
philosophical distinction, which I think you will agree should take a
backseat to real, practical concerns). But it's not 2005, and we have a
ton of history without generation numbers. So adding them now is only
one piece of the puzzle.

What's your solution for the rest of it?

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
On Fri, Jul 15, 2011 at 03:48:07PM -0400, Jeff King wrote:

> OK, so let's say we add generation headers to each commit. What happens
> next? Are we going to convert algorithms that use timestamps to use
> commit generations? How are we going to handle performance issues when
> dealing with older parts of history that don't have generations?
>
> Again, those are serious questions that need answered. I respect that
> you think the lack of a generation header is a design decision that
> should be corrected. As I said before, I'm not 100% sure I agree, but
> nor do I completely disagree (and I think it largely boils down to a
> philosophical distinction, which I think you will agree should take a
> backseat to real, practical concerns). But it's not 2005, and we have a
> ton of history without generation numbers. So adding them now is only
> one piece of the puzzle.
>
> What's your solution for the rest of it?

I just read some of your later emails to others in the thread. It seems
like your answer is "assume the timestamp-based limiting is good enough
for old history".

I'm OK with that. It obviously falls down in a few specific situations,
but certainly has not been an unbearable problem for the past 5 years.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
In reply to this post by Jeff King
On Fri, Jul 15, 2011 at 12:48 PM, Jeff King <[hidden email]> wrote:
>
> Having the information in two different places is my concern, too. And I
> think the fundamental difference between putting it inside or outside
> the commit sha1 (where outside encompasses putting it in a cache, in the
> pack-index, or whatever), is that I see the commit sha1 as somehow more
> "definitive". That is, it is the sole data we pass from repo to repo
> during pushes and pulls, and it is the thing that is consistency-checked
> by hashes.

Sure. That is also the data that is the same for everybody.

That's a big deal, in the sense that it's the only thing we should
rely on if we want consistent behavior. Immediately if core
functionality starts using any other data, behavior becomes "local".
And I think that's really *really* dangerous.

Sure, we have "local behavior" in a lot of small details. We very much
intentionally have it in the ref-logs, and since branches and tags are
local we also have it in things like "--decorate", which obviously
depends on exactly which local refs you have.

We also have local behavior in things like .git/config etc files, so
git can behave very differently for different people even with what is
otherwise an identical repository.

So local behavior is good and expected for some things. We *want* it
for things like colorization decisions, we want it for aliases, and we
want it for branch naming.

But really core behavior shouldn't depend on local information. I
think it would be wrong if something like a merge base decision would
be based on any local information.

For example, should it matter whether something is packed or not? I
really don't think so. That's a pretty random implementation detail,
and if we get different end results because some commit happens to be
packed, vs not packed (because, say, we'd be hiding generation
information in the pack) that would be wrong.

Now, we do have things like merge resolution caches etc (which
obviously do save and use local information again), but I think that's
pretty well clarified.

> So if there is an inconsistency between what the parent pointers
> represent, and what the generation number in "outside" storage says,
> then the outside storage is wrong, and the parent pointers are the right
> answer. It becomes a lot more fuzzy to me if there is an inconsistency
> between what the parent pointers represent, and what the generation
> number says.

So I really don't see why you harp on that. If the generation counters
are in the objects THEY BY DEFINITION CANNOT BE INCONSISTENT.

That's a big issue.

Sure, they may be LYING, but that's a different thing entirely. They
will be lying to everybody consistently. There would never be any
question about what the generation number of a commit is.

See what I'm trying to say? There's no way that they would cause
different behavior for different people. Everything is 100%
consistent.

The exact same thing is true of commit dates, btw. They may be
confused as hell, and they may cause us to do bad things when we
traverse the history, but different clocks on different machines will
still not cause git to act differently on different machines. There's
no possibility of inconsistency.

(Of course, different *versions* of git may traverse the history
differently, since we've changed the heuristics over time. So we do
have that kind of inconsistent behavior, where we give different
results from different versions of git).

And btw, having "incorrect" data in the git objects is not the end of
the world. You can generate merge commits that simply have the wrong
parents. That will be confusing as hell to the user, and it will make
future merges not work very well, but it's a bug in the archive, and
that's "ok". The developers may not be very happy about it. In fact,
afaik we've had a few cases like that in the kernel tree, because
early git had bugs where it would not properly forget parents after a
failed merge. Most of them are ARM-related, because the ARM tree was
one of the first users of git (outside of me, but I had fewer issues
with what happens when things go wrong).

So I would not be *too* shocked if we'd end up with "odd" generation
counts due to some odd bug. It sounds unlikely, but my point is that
that is not at all what I'd *worry* about.

> How should that situation be handled? Should fsck check for it and
> complain? Should we just ignore it, even though it may cause our
> traversal algorithms to be inaccurate? Like clock skew, there's not much
> that can be done if the commits are published.

Right. I simply think it's not a big deal.

IOW, if we would rely on generation counts instead of clock dates,
maybe the generation counts would have occasional problems too, but I
suspect they'd be *much* rarer than time-based issues, because at
least the generation count is a well-defined number rather than a
random thing we pick out of emails and badly maintained machines.

That said, I'm not 100% sure at all that we want generation numbers at
all. Their use is pretty limited. If we had had them from the
beginning, I think we would simply have replaced the date-based commit
list sorting with a generation-number-based one, and it should have
been possible to guarantee that we never output a parent before the
commit in rev-parse.

As it is, I have to admit that looking at it, I shudder at changing
the current date-based logic and replacing it with a "date or
generation number".

The date-based one, despite all its fuzziness and not being very well
defined ("Global clock in a distributed system? You're a moron") and
up being a *nice* heuristic for certain human interaction. So it's not
a wonderful solution from a technical standpoint, but it does have (I
think) some nice UI advantages.

(For an example of that: using "--topo-sort" for revision history may
be a very good thing technically, but even if it wasn't for the fact
that it's more expensive, I think that our largely time-based default
order for "git log" in many ways is a better interface for humans. Of
course, when mixed with actually giving a history graph, that changes,
because then you want the "related" commits to group together, rather
than by time. So I think it's just basically a fuzzy area, without any
clear hard rules - which is probably why using that fuzzy timestamp
works so well in practice)

> Those are serious questions that I think should be considered if we are
> going to put a generation header into the commit object, and I haven't
> seen answers for them yet.

I do agree that the really *big* question is "do we even need it at
all". I do like perhaps just tightening the commit timestamp rules.
Because I do think they would probably work very well for the
"contains" problem too.

With the exact same fuzzy downsides, of course. Timestamps aren't
perfect, and they need that annoying fuzz factor thing.

>> That said, I don't think a generation count necessarily "fits" in the
>> pack-file. They are designed to be incremental, so it's not very
>> natural there. But I do think it would be conceptually prettier to
>> have the "depth of commit" be part of the "filesystem" data than to
>> have it as a separate ad-hoc cache.
>
> Sure, I would be fine with that. When you say "packfile", do you mean
> the the general concept, as in it could go in the pack index as opposed
> to the packfile itself? Or specifically in the packfile? The latter
> seems a lot more problematic to me in terms of implementation.

I was thinking the "general" issue - it might make most sense to put
them in the index.
>> If we should have fixed it in the original specification, we damn well
>> should fix it today. It's been "ignorable" because it's just not been
>> important enough. But if git now adds a fundamental cache for them,
>> then that information is clearly no longer "not important enough".
>
> OK, so let's say we add generation headers to each commit. What happens
> next? Are we going to convert algorithms that use timestamps to use
> commit generations? How are we going to handle performance issues when
> dealing with older parts of history that don't have generations?

So I do think the _initial_ question need to be the other way around:
do we have to have generation numbers at all?

I think it's likely a design misfeature not to have them, but
considering that we don't, and have been able to make do without for
so long, I'm also perfectly willing to believe that we could speed up
"contains" dramatically with the same kind of (crazy and inexact)
tricks we use for merge bases.

(Looking at a profile, a third - and the top entry - of the "git tag
--contains" profile cost is just in "clear_commit_marks()" - not doing
any real work, rather *undoing* the work in order to re-do things. So
it's entirely possible that the real issue is simply that
"in_merge_bases()" is badly done, and we could speed things up a lot
independently of anything else).

For example, for the "git tag --contains" thing, what's the
performance effect of just skipping tags that are much older than the
commit we ask for?

                Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Jeff King
On Fri, Jul 15, 2011 at 02:17:26PM -0700, Linus Torvalds wrote:

> > Having the information in two different places is my concern, too. And I
> > think the fundamental difference between putting it inside or outside
> > the commit sha1 (where outside encompasses putting it in a cache, in the
> > pack-index, or whatever), is that I see the commit sha1 as somehow more
> > "definitive". That is, it is the sole data we pass from repo to repo
> > during pushes and pulls, and it is the thing that is consistency-checked
> > by hashes.
>
> Sure. That is also the data that is the same for everybody.
>
> That's a big deal, in the sense that it's the only thing we should
> rely on if we want consistent behavior. Immediately if core
> functionality starts using any other data, behavior becomes "local".
> And I think that's really *really* dangerous.

Yes, I see your argument. I just don't think it's all that big a deal,
because the information is so easily derived from data that _is_ the
same for everybody (and when you _do_ want it to be different locally,
because you are grafting, that is easy to do).

But I think at this point we have both said all there is to say. There
is no actual data to be brought forth in this argument, and we obviously
disagree on this point. So I think we may have to agree to disagree.

And as I said before, I am willing to concede generation numbers in the
commit header. But we need the rest of the solution, too.

> So I really don't see why you harp on that. If the generation counters
> are in the objects THEY BY DEFINITION CANNOT BE INCONSISTENT.
>
> That's a big issue.
>
> Sure, they may be LYING, but that's a different thing entirely. They
> will be lying to everybody consistently. There would never be any
> question about what the generation number of a commit is.
>
> See what I'm trying to say? There's no way that they would cause
> different behavior for different people. Everything is 100%
> consistent.

Read my email again. I am clearly talking about inconsistency between
two data items in the sha1-checked DAG itself. You then proceed to yell
at me that they are not inconsistent, now talking about inconsistency
between different people with the same DAG, but different caches. In
other words, you are talking about an entirely different type of
inconsistency. And then you proceed to say that the generation numbers
may be lying, which is _exactly_ what I meant when I said inconsistency.

I don't mind arguing with you, even if I think you use capital letters
too frequently; but when you do use them, please take care that I really
am being a bonehead, and it is not you misrepresenting what I said.

As to lying (aka inconsistency between items within the DAG), you say:

> And btw, having "incorrect" data in the git objects is not the end of
> the world. You can generate merge commits that simply have the wrong
> parents. That will be confusing as hell to the user, and it will make
> future merges not work very well, but it's a bug in the archive, and
> that's "ok". The developers may not be very happy about it. In fact,
> afaik we've had a few cases like that in the kernel tree, because
> early git had bugs where it would not properly forget parents after a
> failed merge. Most of them are ARM-related, because the ARM tree was
> one of the first users of git (outside of me, but I had fewer issues
> with what happens when things go wrong).

No, it's not the end of the world. I just think it's worse than the
possibility of inconsistency between two users' idea of the graph,
because the bug stays with you for all of history, instead of getting
fixed with a new version of git.

> That said, I'm not 100% sure at all that we want generation numbers at
> all. Their use is pretty limited. If we had had them from the
> beginning, I think we would simply have replaced the date-based commit
> list sorting with a generation-number-based one, and it should have
> been possible to guarantee that we never output a parent before the
> commit in rev-parse.
>
> As it is, I have to admit that looking at it, I shudder at changing
> the current date-based logic and replacing it with a "date or
> generation number".
>
> The date-based one, despite all its fuzziness and not being very well
> defined ("Global clock in a distributed system? You're a moron") and
> up being a *nice* heuristic for certain human interaction. So it's not
> a wonderful solution from a technical standpoint, but it does have (I
> think) some nice UI advantages.

That is the conclusion I am coming to, also. I don't find the external
cache as odious as you obviously do. But that was why I posted the
patches with an RFC tag. I wanted to see how painful people found the
concept. But if it's too ugly a concept, I think the path of least
resistance is just making timestamps suck less (by using more consistent
and robust skew avoidance[1] in our various algorithms, and by perhaps
taking more care to notify the user of skew early, before commits are
published).

And then we don't really need generation numbers anymore.  As elegant as
they might have been if they were there from day one, it's just not
worth the hassle of maintaining the dual solution.

[1] We use "N slop commits" in some places and "allow 86400 seconds of
    skew" in other places. We should probably use both, and apply them
    consistently.

> > Those are serious questions that I think should be considered if we are
> > going to put a generation header into the commit object, and I haven't
> > seen answers for them yet.
>
> I do agree that the really *big* question is "do we even need it at
> all". I do like perhaps just tightening the commit timestamp rules.
> Because I do think they would probably work very well for the
> "contains" problem too.
>
> With the exact same fuzzy downsides, of course. Timestamps aren't
> perfect, and they need that annoying fuzz factor thing.

Yeah. But in practice, that fuzz is really easy to implement, has worked
pretty well so far, and doesn't actually hurt performance measurably,
because skew is rare, and a constant, small timestamp tends to equate to
a constant, small number of commits.

> > Sure, I would be fine with that. When you say "packfile", do you mean
> > the the general concept, as in it could go in the pack index as opposed
> > to the packfile itself? Or specifically in the packfile? The latter
> > seems a lot more problematic to me in terms of implementation.
>
> I was thinking the "general" issue - it might make most sense to put
> them in the index.

If we were to go the cache route, I think I am leaning that way, too, if
only because we don't duplicate the 20-byte sha1 per commit, which keeps
our I/O down.

> > OK, so let's say we add generation headers to each commit. What happens
> > next? Are we going to convert algorithms that use timestamps to use
> > commit generations? How are we going to handle performance issues when
> > dealing with older parts of history that don't have generations?
>
> So I do think the _initial_ question need to be the other way around:
> do we have to have generation numbers at all?

No, we don't need them. My "contains" patches were already implemented
using timestamps, and it's pretty fast. They fall down only in the face
lying timestamps (i.e., skew). The whole reason to switch to generation
headers was that we could assume they would be correct, and our
algorithms using them would be more likely to be correct.

And I do think a generation header would be more likely to be correct
than a timestamp, if only because timestamps are harder to get right.

> I think it's likely a design misfeature not to have them, but
> considering that we don't, and have been able to make do without for
> so long, I'm also perfectly willing to believe that we could speed up
> "contains" dramatically with the same kind of (crazy and inexact)
> tricks we use for merge bases.

Already done. I can point you to the patches if you want.

> For example, for the "git tag --contains" thing, what's the
> performance effect of just skipping tags that are much older than the
> commit we ask for?

It's as fast as using generations. See these two patches:

  http://article.gmane.org/gmane.comp.version-control.git/150261

  http://article.gmane.org/gmane.comp.version-control.git/150262

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
In reply to this post by Linus Torvalds-3
On Fri, Jul 15, 2011 at 2:17 PM, Linus Torvalds
<[hidden email]> wrote:
>
> For example, for the "git tag --contains" thing, what's the
> performance effect of just skipping tags that are much older than the
> commit we ask for?

Hmm.

Maybe there is something seriously wrong with this trivial patch, but
it gave the right results for the test-cases I threw at it, and passes
the tests.

Before:

   [torvalds@i5 linux]$ time git tag --contains v2.6.24 > correct

   real 0m7.548s
   user 0m7.344s
   sys 0m0.116s

After:

   [torvalds@i5 linux]$ time ~/git/git tag --contains v2.6.24 > date-cut-off

   real 0m0.161s
   user 0m0.140s
   sys 0m0.016s

and 'correct' and 'date-cut-off' both give the same answer.

The date-based "slop" thing is (at least *meant* to be - note the lack
of any extensive testing) "at least five consecutive commits that have
dates that are more than five days off".

Somebody should double-check my logic. Maybe I'm doing something
stupid. Because that's a *big* difference.

                     Linus

patch.diff (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
On Fri, Jul 15, 2011 at 4:10 PM, Linus Torvalds
<[hidden email]> wrote:
>
> Maybe there is something seriously wrong with this trivial patch, but
> it gave the right results for the test-cases I threw at it, and passes
> the tests.
>
> Before:

I have fewer branches than tags, but I get something similar for "git
branch --contains":

  [torvalds@i5 linux]$ time git branch --contains v2.6.12 | sha1sum
  9d4224eec98ec7b0bcd5331dfa5badb9ef1fd510  -

  real 0m4.205s
  user 0m4.112s
  sys 0m0.084s
  [torvalds@i5 linux]$ time ~/git/git branch --contains v2.6.12 | sha1sum
  9d4224eec98ec7b0bcd5331dfa5badb9ef1fd510  -

  real 0m0.112s
  user 0m0.100s
  sys 0m0.008s

ie identical results, except one took 4.2s and with the patch it took 0.1s.

This is all hot-cache, of course, and on a fast machine.

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git commit generation numbers

Linus Torvalds-3
And one last comment:

On Fri, Jul 15, 2011 at 4:16 PM, Linus Torvalds
<[hidden email]> wrote:
>
> I have fewer branches than tags, but I get something similar for "git
> branch --contains":

The time-based heuristic does seem to be important. If I just remove
it, I get increasingly long times for things that aren't contained in
my branches.

And in fact, I think that is why the code used the merge-base helper
functions - not because it wanted merge bases, but because the merge
base stuff will work from either end until it decides things aren't
relevant any more. Because *without* the time-based heuristics, the
trivial "is this a descendant" algorithm ends up working very badly
for the case where the target doesn't exist in the branches. Examples
of NOT having a date-based cut-off, but just doing the straightforward
(non-merge-base) ancestry walk:

  time ~/git/git branch --contains v2.6.12
  real 0m0.113s

  [torvalds@i5 linux]$ time ~/git/git branch --contains v2.6.39
  real 0m3.691s

and what ends up happening is that in the latter case, every branch
walks all the way to the root and checks every commit (walking all the
merges too). While in the first case, it's very quick because it will
find that particular commit when it walk straight backwards (so it
doesn't even have to do a lot of recursion - the first branch that
hits that commit will be a success), so it won't have to look at all
the side ways of getting there.

Of course, the above particular difference happens to be due to the
"depth-first" implementation working well for the thing I am searching
for.  But it does show that the date-based cut-off matters due to
traversal issues like that.

                          Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
12345