Git performance results on a large repository

classic Classic list List threaded Threaded
34 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

David Mohs
Joshua Redstone <joshua.redstone <at> fb.com> writes:

> To get a bit abstract for a moment, in an ideal world, it doesn't seem like
> performance constraints of a source-control-system should dictate how we
> choose to structure our code. Ideally, seems like we should be able to choose
> to structure our code in whatever way we feel maximizes developer
> productivity. If development and code/release management seem easier in a
> single repo, than why not make an SCM that can handle it? This is one reason
> I've been leaning towards figuring out an SCM approach that can work well with
> our current practices rather than changing them as a prerequisite for good SCM
> performance.

I certainly agree with this perspective---that our tools should support our
use cases and not the other way around. However, I'd like you to consider that
the size of this hypothetical repository might be giving you some useful
information on the health of the code it contains. You might consider creating
separate repositories simply to promote good modularization. It would involve
some up-front effort and certainly some pain, but this work itself might be
beneficial to your codebase without even considering the improved performance
of the version control system.

My concern here is that it may be extremely difficult to make a single piece
of software scale for a project that can grow arbitrarily large. You may add
some great performance improvements to git to then find that your bottleneck
is the filesystem. That would enlarge the scope of your work and would likely
make the project more difficult to manage.

If you are able to prove me wrong, the entire software community will benefit
from this work. However, before you embark upon a technical solution to your
problem, I would urge you to consider the possible benefits of a non-technical
solution, specifically restructuring your code and/or teams into more
independent modules. You might find benefits from this approach that extend
beyond source code control, which could make it the solution with the least
amount of overall risk.

Thanks for starting this valuable discussion.

-David

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

Joey Hess
In reply to this post by Duy Nguyen
Nguyen Thai Ngoc Duy wrote:
> The "interface to report which files have changed" is exactly "git
> update-index --[no-]assume-unchanged" is for. Have a look at the man
> page. Basically you can mark every file "unchanged" in the beginning
> and git won't bother lstat() them. What files you change, you have to
> explicitly run "git update-index --no-assume-unchanged" to tell git.
>
> Someone on HN suggested making assume-unchanged files read-only to
> avoid 90% accidentally changing a file without telling git. When
> assume-unchanged bit is cleared, the file is made read-write again.

That made me think about using assume-unchanged with git-annex since it
already has read-only files.

But, here's what seems a misfeature... If an assume-unstaged file has
modifications and I git add it, nothing happens. To stage a change, I
have to explicitly git update-index --no-assume-unchanged and only then
git add, and then I need to remember to reset the assume-unstaged bit
when I'm done working on that file for now. Compare with running git mv
on the same file, which does stage the move despite assume-unstaged. (So
does git rm.)

--
see shy jo

signature.asc (845 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

Matt Graham
In reply to this post by Joshua Redstone
On Sat, Feb 4, 2012 at 18:05, Joshua Redstone <[hidden email]> wrote:
> [ wanted to reply to my initial msg, but wasn't subscribed to the list at time of mailing, so replying to most recent post instead ]
>
> Matt Graham:  I don't have file stats at the moment.  It's mostly code files, with a few larger data files here and there.    We also don't do sparse checkouts, primarily because most people use git (whether on top of SVN or not), which doesn't support it.


This doesn't help your original goal, but while you're still working
with git-svn, you can do sparse checkouts. Use --ignore-paths when you
do the original clone and it will filter out directories that are not
of interest.

We used this at Etsy to keep git svn checkouts manageable when we
still had a gigantic svn repo.  You've repeatedly said you don't want
to reorganize your repos but you may find this writeup informative
about how Etsy migrated to git (which included a health amount of repo
manipuation).
http://codeascraft.etsy.com/2011/12/02/moving-from-svn-to-git-in-1000-easy-steps/
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

Joshua Redstone
Hi all,

Nguyen, thanks for pointing out the assume-unchanged part.  That, and
especially the suggestion of making assume-unchanged files read-only is
interesting.  It does require explicit specification of what's changed.
Hmm, I wonder if that could be a candidate API through which something
like  CoW file system could let git know what's changed.  Btw, I think you
asked earlier, but the index compresses from 158MB to 58MB - keep in mind
that the majority of file names in the repo are synthetic, so take with
big grain of salt.

Joey, it sounds like it might be good if git-mv and other commands where
consistent in how they treat the assume-unchanged bit.

David Mohs:  Yeah, it's an open question whether we'd be better off
somehow forcing the repos the split apart more.  As a practical matter,
what may happen is that we incrementally solve our problem by addressing
pain points as they come up (e.g., git status being slow).  One risk with
that approach is that it leads to overly short-term thinking and we get
stuck in a local minimum.  I totally agree that good modularization and
code health is valuable.  I think sometimes that getting to good
modularization does involve some technical work - like maybe moving
functionality between systems so they split apart better, having some
notion of versioning and dependency and managing that, and so forth.    I
suppose the other aspect to the problem is that we want to make sure we
have a good source-control story even if the modularization effort takes a
long time - we'd rather not end up in a race between long-term
modularization efforts and source-control performance going south too
fast.  I suppose this comes back to the desire that modularization not be
a prerequisite for good source-control performance.  Oh, and in case I
didn't mention it - we are working on modularization and splitting off
large chunks of code, both into separable libraries as well as into
separate services, but it's a long-term process.

Matt, some of our repos are still on SVN, many are on pure-git.  One of
the main ones that is on SVN is, at least at the moment, not amenable to
sparse checkouts because of it's structure.

Tomas, yeah, I think one of the big questions is how little technical work
can we get away with, and where's the point of maximum leverage in terms
of how much engineering time we invest.

Greg,  'git commit' does some stat'ing of every file, even with all those
flags - for example, I think one instance it does it is, just in case any
pre-commit hooks touched any files, it re-stats everything.  Regarding the
perf numbers, I ran it on a beefy linux box.  Have you tried doing your
measurements with the drop_caches trick to make sure the file cache is
totally cold?  Sorry for the dumb question, but how do I check the vnode
cache size?

David Lang and David Barr, I generated the pack files by doing a repack:
"git repack -a -d -f --max-pack-size=10g --depth=100 --window=250"  after
generating the repo.

One other update, the command I was running to get a histogram of all
files in the repo finally completed.  The histogram (counting file size in
bytes) is:

[       0.0 -        6.4): 3
[       6.4 -       41.3): 27
[      41.3 -      265.7): 6
[     265.7 -     1708.1): 652594
[    1708.1 -    10980.6): 673482
[   10980.6 -    70591.6): 19519
[   70591.6 -   453814.3): 1583
[  453814.3 -  2917451.4): 276
[ 2917451.4 - 18755519.0): 61
[18755519.0 - 120574242.0]: 4
n=1347555 mean=3697.917708, median=1770.000000, stddev=122940.890559

The smaller files are all text (code), and the large ones are probably
binary.

Cheers,
Josh



On 2/6/12 11:23 AM, "Matt Graham" <[hidden email]> wrote:

>On Sat, Feb 4, 2012 at 18:05, Joshua Redstone <[hidden email]>
>wrote:
>> [ wanted to reply to my initial msg, but wasn't subscribed to the list
>>at time of mailing, so replying to most recent post instead ]
>>
>> Matt Graham:  I don't have file stats at the moment.  It's mostly code
>>files, with a few larger data files here and there.    We also don't do
>>sparse checkouts, primarily because most people use git (whether on top
>>of SVN or not), which doesn't support it.
>
>
>This doesn't help your original goal, but while you're still working
>with git-svn, you can do sparse checkouts. Use --ignore-paths when you
>do the original clone and it will filter out directories that are not
>of interest.
>
>We used this at Etsy to keep git svn checkouts manageable when we
>still had a gigantic svn repo.  You've repeatedly said you don't want
>to reorganize your repos but you may find this writeup informative
>about how Etsy migrated to git (which included a health amount of repo
>manipuation).
>http://codeascraft.etsy.com/2011/12/02/moving-from-svn-to-git-in-1000-easy
>-steps/

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

Greg Troxel

Joshua Redstone <[hidden email]> writes:

> Greg,  'git commit' does some stat'ing of every file, even with all those
> flags - for example, I think one instance it does it is, just in case any
> pre-commit hooks touched any files, it re-stats everything.

That seems ripe for skipping.  If I understand correctly, what's being
committed is the index, not the working dir contents, so it would follow
that a pre-commit hook changing a file is a bug.

> Regarding the perf numbers, I ran it on a beefy linux box.  Have you
> tried doing your measurements with the drop_caches trick to make sure
> the file cache is totally cold?

On NetBSD, there should be a clear cache command for just this reason,
but I'm not sure there is.  So I did

  sysctl -w kern.maxvnodes=1000 # seemed to take a while
  ls -lR # wait for those to be faulted in
  sysctl -w kern.maxvnodes=500000

Then, git status on my repo churned the disk for a long time.

  real    2m7.121s
  user    0m3.086s
  sys     0m7.577s

and then again right away

  real    0m6.497s
  user    0m2.533s
  sys     0m3.010s

That repo has 217852 files (a real source tree with a few binaries, not
synthetic).

> Sorry for the dumb question, but how do I check the vnode cache size?

On BSD, sysctl kern.maxvnodes.  I would aasume that on Linux there is
some max size for the the vnode cache, and that stat of a file in that
cache is faster than going to the filesystem (even if reading from
cached disk blocks).  But I really don't know how that works in Linux.

I was going to say that if your vnode cache isn't big enough, then the
hot run won't be so much faster than the warm run, but that's not true,
because the fs blocks will be in the block cache and it will still help.

attachment0 (200 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

Sam Vilain
In reply to this post by Joshua Redstone
 > Sam Vilain: Thanks for the pointer, i didn't realize that
 > fast-import was bi-directional.  I used it for generating the
 > synthetic repo.  Will look into using it the other way around.
 > Though that still won't speed up things like git-blame,
 > presumably?

It could, because blame is an operation which primarily works on
the source history with little reference to the working copy.  Of
course this will depend on the quality of the implementation
server-side.  Blame should suit distribution over a cluster, as
it is mostly involved with scanning candidate revisions for
string matches which is the compute intensive part.  Coming up
with candidate revisions has its own cost and can probably also
be distributed, but just working on the lowest loop level might
be a good place to start.

What it doesn't help with is local filesystem operations.  For
this I think a different approach is required, if you can tie
into fam or a similar inode change notification system, then you
should be able to avoid the entire recursive stat on 'git
status'.  I'm not sure --assume-unchanged on its own is a good
idea, you could easily miss things.  Those stat's are useful.

Making the index able to hold just changes to the checked-out
tree, as others have mentioned, would also save the massive reads
and writes you've identified.  Perhaps a more high performance
back-end could be developed.

 > The sparse-checkout issue you mention is a good one.

It's actually been on the table since at least GitTogether 2008;
there's been some design discussion on it and I think it's just
one of those features which doesn't have enough demand yet for it
to be built.  It keeps coming up but not from anyone with the
inclination or resources to make it happen.  There is a protocol
issue, but this should be able to fit into the current extension
system.

 > There is a good question of how to support quick checkout,
 > branch switching, clone, push and so forth.

Sure.  It will be much more network intensive as you are
replacing the part which normally has a very fast link through
the buffercache to pack files etc.  A hybrid approach is also
possible, where objects are fetched individually via fast-import
and cached in a local .git repo.  And I have a hunch that LZOP
compression of the stream may also be a win, but as with all of
these ideas, it would be after profiling identifies it as a choke point
than just because it sounds good.

 > I'll look into the approaches you suggest.  One consideration
 > is coming up with a high-leverage approach - i.e. not doing
 > heavy dev work if we can avoid it.

Right.  You don't actually need to port the whole of git to Hadoop
initially, to begin with it can just pass through all commands to a
server-side git fast-import process.  When you find specific operations
which are slow then these specific operations can be implemented using a
Hadoop back-end, and the rest backed to the standard git.  If done using
a useful plug-in system, these systems could be accepted by the core
project as an enterprise scaling option.

This could let you get going with the knowledge that the scaling option
is there should it come out.

 > On the other hand, it would be nice if we (including the entire
 > community:) ) improve git in areas that others that share
 > similar issues benefit from as well.

Like I say, a lot of people have run into this already...

HTH,
Sam
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

Duy Nguyen
In reply to this post by Sam Vilain
On Sat, Feb 4, 2012 at 5:40 AM, Sam Vilain <[hidden email]> wrote:
> There have also been designs at various times for sparse check–outs; ie
> check–outs where you don't check out the root of the repository but a
> sub–tree.

There is a sparse checkout feature in git (hopefully from one of the
designs you mentioned) and it can checkout subtrees. The only problem
in this case is it maintains full index. So it only solves half of the
problem (stat calls), reading/writing large index just slows
everything down.
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

David Lang
In reply to this post by Joshua Redstone
On Mon, 6 Feb 2012, Joshua Redstone wrote:

> David Lang and David Barr, I generated the pack files by doing a repack:
> "git repack -a -d -f --max-pack-size=10g --depth=100 --window=250"  after
> generating the repo.

how many pack files does this end up creating?

I think that doing a full repack the way you did will group all revisions
of a given file into a pack.

while what I'm saying is that if you create the packs based on time,
rather than space efficiency of the resulting pack files, you may end up
not having to go through as much date when doing things like a git blame.

what you did was

initialize repo
4M commits
repack

what I'm saying is

initialize repo
loop
    500K commits
    repack (and set pack to .keep so it doesn't get overwritten)

so you will end up with ~8 sets of pack files, but time based so that when
you only need recent information you only look at the most recent pack
file. If you need to go back through all time, the multiple pack files
will be a little more expensive to process.

this has the added advantage that the 8 small repacks should be cheaper
than the one large repack as it isn't trying to cover all commits each
time.

David Lang
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

Emanuele Zattin
In reply to this post by Joshua Redstone
Joshua Redstone <joshua.redstone <at> fb.com> writes:

>
> Hi Git folks,
>

Hello everybody!

I would just like to contribute a small set of blog posts
about this issue and a possible solution.
Sorry for the tone in which I wrote those posts,
but I think there are some valid points in there.

https://gist.github.com/1758346

BR,

Emanuele Zattin

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

Duy Nguyen
In reply to this post by Joey Hess
On Mon, Feb 6, 2012 at 10:40 PM, Joey Hess <[hidden email]> wrote:
>> Someone on HN suggested making assume-unchanged files read-only to
>> avoid 90% accidentally changing a file without telling git. When
>> assume-unchanged bit is cleared, the file is made read-write again.
>
> That made me think about using assume-unchanged with git-annex since it
> already has read-only files.
>
> But, here's what seems a misfeature...

because, well.. assume-unchanged was designed to avoid stat() and
nothing else. We are basing a new feature on top of it.

> If an assume-unstaged file has
> modifications and I git add it, nothing happens. To stage a change, I
> have to explicitly git update-index --no-assume-unchanged and only then
> git add, and then I need to remember to reset the assume-unstaged bit
> when I'm done working on that file for now. Compare with running git mv
> on the same file, which does stage the move despite assume-unstaged. (So
> does git rm.)

This is normal in the lock-based "checkout/edit/checkin" model. mv/rm
operates on directory content, which is not "locked - no edit allowed"
(in our case --assume-unchanged) in git. But lock-based model does not
map really well to git anyway. It does not have the index (which may
make things more complicated). Also at index level, git does not
really understand directories.

I think we could add a protection layer to index, where any changes
(including removal) to an index entry are only allowed if the entry is
"unlocked" (i.e no assume-unchanged bit). Locked entries are read-only
and have assume-unchanged bit set. "git (un)lock" are introduced as
new UI. Does that make assume-unchanged friendlier?
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

Joshua Redstone
Hi Nguyen,
I like the notion of using --assume-unchanged to cut down the set of
things that git considers may have changed.
It seems to me that there may still be situations that require operations
on the order of the # of files in the repo and hence may still be slow.
Following is a list of potential candidates that occur to me.

1. Switching branches, especially if you switch to an old branch.
Sometimes I've seen branch switching taking a long time for what I thought
was close to where HEAD was.

2. Interactive rebase in which you reorder a few commits close to the tip
of the branch (I observed this taking a long time, but haven't profiled it
yet).  I include here other types of cherry-picking of commits.

3. Any working directory operations that fail part-way through and make
you want to do a 'git reset --hard' or at least a full 'git-status'.  That
is, when you have reason to believe that files with 'assume-unchange' may
have accidentally changed.

4. Operations that require rewriting the index - I think git-add is one?

If the working-tree representation is the full set of all files
materialized on disk and it's the same as the representation of files
changed, then I'm not sure how to avoid some of these without playing file
system games or using wrapper scripts.

What do you (or others) think?


Josh


On 2/7/12 8:43 AM, "Nguyen Thai Ngoc Duy" <[hidden email]> wrote:

>On Mon, Feb 6, 2012 at 10:40 PM, Joey Hess <[hidden email]> wrote:
>>> Someone on HN suggested making assume-unchanged files read-only to
>>> avoid 90% accidentally changing a file without telling git. When
>>> assume-unchanged bit is cleared, the file is made read-write again.
>>
>> That made me think about using assume-unchanged with git-annex since it
>> already has read-only files.
>>
>> But, here's what seems a misfeature...
>
>because, well.. assume-unchanged was designed to avoid stat() and
>nothing else. We are basing a new feature on top of it.
>
>> If an assume-unstaged file has
>> modifications and I git add it, nothing happens. To stage a change, I
>> have to explicitly git update-index --no-assume-unchanged and only then
>> git add, and then I need to remember to reset the assume-unstaged bit
>> when I'm done working on that file for now. Compare with running git mv
>> on the same file, which does stage the move despite assume-unstaged. (So
>> does git rm.)
>
>This is normal in the lock-based "checkout/edit/checkin" model. mv/rm
>operates on directory content, which is not "locked - no edit allowed"
>(in our case --assume-unchanged) in git. But lock-based model does not
>map really well to git anyway. It does not have the index (which may
>make things more complicated). Also at index level, git does not
>really understand directories.
>
>I think we could add a protection layer to index, where any changes
>(including removal) to an index entry are only allowed if the entry is
>"unlocked" (i.e no assume-unchanged bit). Locked entries are read-only
>and have assume-unchanged bit set. "git (un)lock" are introduced as
>new UI. Does that make assume-unchanged friendlier?
>--
>Duy
>--
>To unsubscribe from this list: send the line "unsubscribe git" in
>the body of a message to [hidden email]
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

Duy Nguyen
On Fri, Feb 10, 2012 at 4:06 AM, Joshua Redstone <[hidden email]> wrote:

> Hi Nguyen,
> I like the notion of using --assume-unchanged to cut down the set of
> things that git considers may have changed.
> It seems to me that there may still be situations that require operations
> on the order of the # of files in the repo and hence may still be slow.
> Following is a list of potential candidates that occur to me.
>
> 1. Switching branches, especially if you switch to an old branch.
> Sometimes I've seen branch switching taking a long time for what I thought
> was close to where HEAD was.
>
> 2. Interactive rebase in which you reorder a few commits close to the tip
> of the branch (I observed this taking a long time, but haven't profiled it
> yet).  I include here other types of cherry-picking of commits.
>
> 3. Any working directory operations that fail part-way through and make
> you want to do a 'git reset --hard' or at least a full 'git-status'.  That
> is, when you have reason to believe that files with 'assume-unchange' may
> have accidentally changed.

All these involve unpack_trees(), which is full tree operation. The
bigger your worktree is, the slower it is. Another good reason to
split unrelated parts into separate repositories.


> 4. Operations that require rewriting the index - I think git-add is one?
>
> If the working-tree representation is the full set of all files
> materialized on disk and it's the same as the representation of files
> changed, then I'm not sure how to avoid some of these without playing file
> system games or using wrapper scripts.
>
> What do you (or others) think?
>
>
> Josh
>
>
> On 2/7/12 8:43 AM, "Nguyen Thai Ngoc Duy" <[hidden email]> wrote:
>
>>On Mon, Feb 6, 2012 at 10:40 PM, Joey Hess <[hidden email]> wrote:
>>>> Someone on HN suggested making assume-unchanged files read-only to
>>>> avoid 90% accidentally changing a file without telling git. When
>>>> assume-unchanged bit is cleared, the file is made read-write again.
>>>
>>> That made me think about using assume-unchanged with git-annex since it
>>> already has read-only files.
>>>
>>> But, here's what seems a misfeature...
>>
>>because, well.. assume-unchanged was designed to avoid stat() and
>>nothing else. We are basing a new feature on top of it.
>>
>>> If an assume-unstaged file has
>>> modifications and I git add it, nothing happens. To stage a change, I
>>> have to explicitly git update-index --no-assume-unchanged and only then
>>> git add, and then I need to remember to reset the assume-unstaged bit
>>> when I'm done working on that file for now. Compare with running git mv
>>> on the same file, which does stage the move despite assume-unstaged. (So
>>> does git rm.)
>>
>>This is normal in the lock-based "checkout/edit/checkin" model. mv/rm
>>operates on directory content, which is not "locked - no edit allowed"
>>(in our case --assume-unchanged) in git. But lock-based model does not
>>map really well to git anyway. It does not have the index (which may
>>make things more complicated). Also at index level, git does not
>>really understand directories.
>>
>>I think we could add a protection layer to index, where any changes
>>(including removal) to an index entry are only allowed if the entry is
>>"unlocked" (i.e no assume-unchanged bit). Locked entries are read-only
>>and have assume-unchanged bit set. "git (un)lock" are introduced as
>>new UI. Does that make assume-unchanged friendlier?
>>--
>>Duy
>>--
>>To unsubscribe from this list: send the line "unsubscribe git" in
>>the body of a message to [hidden email]
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

Christian Couder-2
Hi,

On Fri, Feb 10, 2012 at 8:12 AM, Nguyen Thai Ngoc Duy <[hidden email]> wrote:
>
> All these involve unpack_trees(), which is full tree operation. The
> bigger your worktree is, the slower it is. Another good reason to
> split unrelated parts into separate repositories.

Maybe having different "views" would be enough to make a smaller
worktree and history, so that things are much faster for a developper?

(I already suggested "views" based on "git replace" in this thread:
http://thread.gmane.org/gmane.comp.version-control.git/177146/focus=177639)

Best regards,
Christian.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Git performance results on a large repository

Duy Nguyen
On Fri, Feb 10, 2012 at 4:39 PM, Christian Couder
<[hidden email]> wrote:

> Hi,
>
> On Fri, Feb 10, 2012 at 8:12 AM, Nguyen Thai Ngoc Duy <[hidden email]> wrote:
>>
>> All these involve unpack_trees(), which is full tree operation. The
>> bigger your worktree is, the slower it is. Another good reason to
>> split unrelated parts into separate repositories.
>
> Maybe having different "views" would be enough to make a smaller
> worktree and history, so that things are much faster for a developper?
>
> (I already suggested "views" based on "git replace" in this thread:
> http://thread.gmane.org/gmane.comp.version-control.git/177146/focus=177639)

That's more or less what I did with the subtree clone series [1] and
ended up doing narrow clone [2]. The only difference between the two
are how to handle partial worktree/index. The former uses git-replace
to seal any holes, the latter tackles at pathspec level and is
generally more elegant.

The worktree part from that work should be usable in full clone too. I
am reviving the series and going to repost it soon. Have a look [3] if
you are interested.

[1] http://thread.gmane.org/gmane.comp.version-control.git/152347
[2] http://thread.gmane.org/gmane.comp.version-control.git/155427
[3] https://github.com/pclouds/git/commits/narrow-clone
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
12