Hi Git folks,
We (Facebook) have been investigating source control systems to meet our growing needs. We already use git fairly widely, but have noticed it getting slower as we grow, and we want to make sure we have a good story going forward. We're debating how to proceed and would like to solicit people's thoughts. To better understand git scalability, I've built up a large, synthetic repository and measured a few git operations on it. I summarize the results here. The test repo has 4 million commits, linear history and about 1.3 million files. The size of the .git directory is about 15GB, and has been repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 --window=250'. This repack took about 2 days on a beefy machine (I.e., lots of ram and flash). The size of the index file is 191 MB. I can share the script that generated it if people are interested - It basically picks 2-5 files, modifies a line or two and adds a few lines at the end consisting of random dictionary words, occasionally creates a new file, commits all the modifications and repeats. I timed a few common operations with both a warm OS file cache and a cold cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did the operation in question a few times (first timing is the cold timing, the next few are the warm timings). The following results are on a server with average hard drive (I.e., not flash) and > 10GB of ram. 'git status' : 39 minutes cold, and 24 seconds warm. 'git blame': 44 minutes cold, 11 minutes warm. 'git add' (appending a few chars to the end of a file and adding it): 7 seconds cold and 5 seconds warm. 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet --no-status': 41 minutes cold, 20 seconds warm. I also hacked a version of git to remove the three or four places where 'git commit' stats every file in the repo, and this dropped the times to 30 minutes cold and 8 seconds warm. The git performance we observed here is too slow for our needs. So the question becomes, if we want to keep using git going forward, what's the best way to improve performance. It seems clear we'll probably need some specialized servers (e.g., to perform git-blame quickly) and maybe specialized file system integration to detect what files have changed in a working tree. One way to get there is to do some deep code modifications to git internals, to, for example, create some abstractions and interfaces that allow plugging in the specialized servers. Another way is to leave git internals as they are and develop a layer of wrapper scripts around all the git commands that do the necessary interfacing. The wrapper scripts seem perhaps easier in the short-term, but may lead to increasing divergence from how git behaves natively and also a layer of complexity. Thoughts? Cheers, Josh -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
On Fri, Feb 3, 2012 at 15:20, Joshua Redstone <[hidden email]> wrote:
> We (Facebook) have been investigating source control systems to meet our > growing needs. We already use git fairly widely, but have noticed it > getting slower as we grow, and we want to make sure we have a good story > going forward. We're debating how to proceed and would like to solicit > people's thoughts. Where I work we also have a relatively large Git repository. Around 30k files, a couple of hundred thousand commits, clone size around half a GB. You haven't supplied background info on this but it really seems to me like your testcase is converting something like a humongous Perforce repository directly to Git. While you /can/ do this it's not a good idea, you should split up repositories at the boundaries code or data doesn't directly cross over, e.g. there's no reason why you need HipHop PHP in the same repository as Cassandra or the Facebook chat system, is there? While Git could better with large repositories (in particular applying commits in interactive rebase seems to be to slow down on bigger repositories) there's only so much you can do about stat-ing 1.3 million files. A structure that would make more sense would be to split up that giant repository into a lot of other repositories, most of them probably have no direct dependencies on other components, but even those that do can sometimes just use some other repository as a submodule. Even if you have the requirement that you'd like to roll out *everything* at a certain point in time you can still solve that with a super-repository that has all the other ones as submodules, and creates a tag for every rollout or something like that. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
Hi Ævar,
Thanks for the comments. I've included a bunch more info on the test repo below. It is based on a growth model of two of our current repositories (I.e., it's not a perforce import). We already have some of the easily separable projects in separate repositories, like HPHP. If we could split our largest repos into multiple ones, that would help the scaling issue. However, the code in those repos is rather interdependent and we believe it'd hurt more than help to split it up, at least for the medium-term future. We derive a fair amount of benefit from the code sharing and keeping things together in a single repo, so it's not clear when it'd make sense to get more aggressive splitting things up. Some more information on the test repository: The working directory is 9.5 GB, the median file size is 2 KB. The average depth of a directory (counting the number of '/'s) is 3.6 levels and the average depth of a file is 4.6. More detailed histograms of the repository composition is below: ------------------------ Histogram of depth of every directory in the repo (dirs=`find . -type d` ; (for dir in $dirs; do t=${dir//[^\/]/}; echo ${#t} ; done) | ~/tmp/histo.py) * The .git directory itself has only 161 files, so although included, doesn't affect the numbers significantly) [0.0 - 1.3): 271 [1.3 - 2.6): 9966 [2.6 - 3.9): 56595 [3.9 - 5.2): 230239 [5.2 - 6.5): 67394 [6.5 - 7.8): 22868 [7.8 - 9.1): 6568 [9.1 - 10.4): 420 [10.4 - 11.7): 45 [11.7 - 13.0]: 21 n=394387 mean=4.671830, median=5.000000, stddev=1.272658 Histogram of depth of every file in the repo (files=`git ls-files` ; (for file in $files; do t=${file//[^\/]/}; echo ${#t} ; done) | ~/tmp/histo.py) * 'git ls-files' does not prefix entries with ./, like the 'find' command above, does, hence why the average appears to be the same as the directory stats [0.0 - 1.3]: 1274 [1.3 - 2.6]: 35353 [2.6 - 3.9]: 196747 [3.9 - 5.2]: 786647 [5.2 - 6.5]: 225913 [6.5 - 7.8]: 77667 [7.8 - 9.1]: 22130 [9.1 - 10.4]: 1599 [10.4 - 11.7]: 164 [11.7 - 13.0]: 118 n=1347612 mean=4.655750, median=5.000000, stddev=1.278399 Histogram of file sizes (for first 50k files - this command takes a while): files=`git ls-files` ; (for file in $files; do stat -c%s $file ; done) | ~/tmp/histo.py [ 0.0 - 4.7): 0 [ 4.7 - 22.5): 2 [ 22.5 - 106.8): 0 [ 106.8 - 506.8): 0 [ 506.8 - 2404.7): 31142 [ 2404.7 - 11409.9): 17837 [ 11409.9 - 54137.1): 942 [ 54137.1 - 256866.9): 53 [ 256866.9 - 1218769.7): 18 [ 1218769.7 - 5782760.0]: 5 n=49999 mean=3590.953239, median=1772.000000, stddev=42835.330259 Cheers, Josh On 2/3/12 9:56 AM, "Ævar Arnfjörð Bjarmason" <[hidden email]> wrote: >On Fri, Feb 3, 2012 at 15:20, Joshua Redstone <[hidden email]> >wrote: > >> We (Facebook) have been investigating source control systems to meet our >> growing needs. We already use git fairly widely, but have noticed it >> getting slower as we grow, and we want to make sure we have a good story >> going forward. We're debating how to proceed and would like to solicit >> people's thoughts. > >Where I work we also have a relatively large Git repository. Around >30k files, a couple of hundred thousand commits, clone size around >half a GB. > >You haven't supplied background info on this but it really seems to me >like your testcase is converting something like a humongous Perforce >repository directly to Git. > >While you /can/ do this it's not a good idea, you should split up >repositories at the boundaries code or data doesn't directly cross >over, e.g. there's no reason why you need HipHop PHP in the same >repository as Cassandra or the Facebook chat system, is there? > >While Git could better with large repositories (in particular applying >commits in interactive rebase seems to be to slow down on bigger >repositories) there's only so much you can do about stat-ing 1.3 >million files. > >A structure that would make more sense would be to split up that giant >repository into a lot of other repositories, most of them probably >have no direct dependencies on other components, but even those that >do can sometimes just use some other repository as a submodule. > >Even if you have the requirement that you'd like to roll out >*everything* at a certain point in time you can still solve that with >a super-repository that has all the other ones as submodules, and >creates a tag for every rollout or something like that. N�����r��y���b�X��ǧv�^�){.n�+����ا���ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?����&�)ߢf� |
Joshua,
You have an interesting use case. If I were you I'd consider investigating the git fast-import protocol. It has become bi–directional, and is essentially socket access to a git repository with read and transactional update capability. With a few more commands implemented, it may even be capable of providing all functionality required for command–line git use. It is already possible that the ".git" directory can be a file: this case is used for submodules in git 1.7.8 and higher. For this use case, there would be an extra field to the ".git" file which is created. It would indicate the hostname (and port) to connect its internal 'fast-import' stream to. 'clone' would consist of creating this file, and then getting the server to stream the objects from its pack to the client. With the hard–working part of git on the other end of a network service, you could back it by a re–implementation of git which is written to be distributed in Hadoop. There are at least two similar implementations of git that are like this: one for cassandra which was written by github as a research project, and Google's implementation on top of their BigTable/GFS/whatever. As the git object storage model is write–only and content–addressed, it should git this kind of scaling well. There have also been designs at various times for sparse check–outs; ie check–outs where you don't check out the root of the repository but a sub–tree. With both of these features, clients could easily check out a small part of the repository very quickly. This is probably the only case which SVN still does better than git at, which is a particular blocker for use cases like repositories with large binaries in them and for projects such as the one you have (another one with a similar problem was KDE, where their projects moved around the repository a lot, and refactoring touched many projects simultaneously at times). It's a large undertaking, alright. Sam, just another git community propeller–head. On 2/3/12 9:00 AM, Joshua Redstone wrote: > Hi Ævar, > > > Thanks for the comments. I've included a bunch more info on the test repo > below. It is based on a growth model of two of our current repositories > (I.e., it's not a perforce import). We already have some of the easily > separable projects in separate repositories, like HPHP. If we could > split our largest repos into multiple ones, that would help the scaling > issue. However, the code in those repos is rather interdependent and we > believe it'd hurt more than help to split it up, at least for the > medium-term future. We derive a fair amount of benefit from the code > sharing and keeping things together in a single repo, so it's not clear > when it'd make sense to get more aggressive splitting things up. > > Some more information on the test repository: The working directory is > 9.5 GB, the median file size is 2 KB. The average depth of a directory > (counting the number of '/'s) is 3.6 levels and the average depth of a > file is 4.6. More detailed histograms of the repository composition is > below: > > ------------------------ > > Histogram of depth of every directory in the repo (dirs=`find . -type d` ; > (for dir in $dirs; do t=${dir//[^\/]/}; echo ${#t} ; done) | > ~/tmp/histo.py) > * The .git directory itself has only 161 files, so although included, > doesn't affect the numbers significantly) > > [0.0 - 1.3): 271 > [1.3 - 2.6): 9966 > [2.6 - 3.9): 56595 > [3.9 - 5.2): 230239 > [5.2 - 6.5): 67394 > [6.5 - 7.8): 22868 > [7.8 - 9.1): 6568 > [9.1 - 10.4): 420 > [10.4 - 11.7): 45 > [11.7 - 13.0]: 21 > n=394387 mean=4.671830, median=5.000000, stddev=1.272658 > > > Histogram of depth of every file in the repo (files=`git ls-files` ; (for > file in $files; do t=${file//[^\/]/}; echo ${#t} ; done) | ~/tmp/histo.py) > * 'git ls-files' does not prefix entries with ./, like the 'find' command > above, does, hence why the average appears to be the same as the directory > stats > > [0.0 - 1.3]: 1274 > [1.3 - 2.6]: 35353 > [2.6 - 3.9]: 196747 > [3.9 - 5.2]: 786647 > [5.2 - 6.5]: 225913 > [6.5 - 7.8]: 77667 > [7.8 - 9.1]: 22130 > [9.1 - 10.4]: 1599 > [10.4 - 11.7]: 164 > [11.7 - 13.0]: 118 > n=1347612 mean=4.655750, median=5.000000, stddev=1.278399 > > > Histogram of file sizes (for first 50k files - this command takes a > while): files=`git ls-files` ; (for file in $files; do stat -c%s $file ; > done) | ~/tmp/histo.py > > [ 0.0 - 4.7): 0 > [ 4.7 - 22.5): 2 > [ 22.5 - 106.8): 0 > [ 106.8 - 506.8): 0 > [ 506.8 - 2404.7): 31142 > [ 2404.7 - 11409.9): 17837 > [ 11409.9 - 54137.1): 942 > [ 54137.1 - 256866.9): 53 > [ 256866.9 - 1218769.7): 18 > [ 1218769.7 - 5782760.0]: 5 > n=49999 mean=3590.953239, median=1772.000000, stddev=42835.330259 > > Cheers, > Josh > > > > > > > On 2/3/12 9:56 AM, "Ævar Arnfjörð Bjarmason"<[hidden email]> wrote: > >> On Fri, Feb 3, 2012 at 15:20, Joshua Redstone<[hidden email]> >> wrote: >> >>> We (Facebook) have been investigating source control systems to meet our >>> growing needs. We already use git fairly widely, but have noticed it >>> getting slower as we grow, and we want to make sure we have a good story >>> going forward. We're debating how to proceed and would like to solicit >>> people's thoughts. >> >> Where I work we also have a relatively large Git repository. Around >> 30k files, a couple of hundred thousand commits, clone size around >> half a GB. >> >> You haven't supplied background info on this but it really seems to me >> like your testcase is converting something like a humongous Perforce >> repository directly to Git. >> >> While you /can/ do this it's not a good idea, you should split up >> repositories at the boundaries code or data doesn't directly cross >> over, e.g. there's no reason why you need HipHop PHP in the same >> repository as Cassandra or the Facebook chat system, is there? >> >> While Git could better with large repositories (in particular applying >> commits in interactive rebase seems to be to slow down on bigger >> repositories) there's only so much you can do about stat-ing 1.3 >> million files. >> >> A structure that would make more sense would be to split up that giant >> repository into a lot of other repositories, most of them probably >> have no direct dependencies on other components, but even those that >> do can sometimes just use some other repository as a submodule. >> >> Even if you have the requirement that you'd like to roll out >> *everything* at a certain point in time you can still solve that with >> a super-repository that has all the other ones as submodules, and >> creates a tag for every rollout or something like that. > > N�����r��y���b�X��ǧv�^�){.n�+����ا���ܨ}���Ơz�&j:+v�������zZ+��+zf���h���~����i���z��w���?����&�)ߢfl=== -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
On 2/3/12 2:40 PM, Sam Vilain wrote:
> As the git object storage model is write–only and content–addressed, > it should git this kind of scaling well. ^^^ Could have sworn I typed 'suit' there. My fingers have auto–correct ;-) Sam -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
In reply to this post by Joshua Redstone
Hi Josh,
On Fri, Feb 3, 2012 at 17:00, Joshua Redstone <[hidden email]> wrote: > Thanks for the comments. I've included a bunch more info on the test repo > below. It is based on a growth model of two of our current repositories > (I.e., it's not a perforce import). We already have some of the easily > separable projects in separate repositories, like HPHP. If we could > split our largest repos into multiple ones, that would help the scaling > issue. However, the code in those repos is rather interdependent and we > believe it'd hurt more than help to split it up, at least for the > medium-term future. We derive a fair amount of benefit from the code > sharing and keeping things together in a single repo, so it's not clear > when it'd make sense to get more aggressive splitting things up. > > Some more information on the test repository: The working directory is > 9.5 GB, the median file size is 2 KB. The average depth of a directory > (counting the number of '/'s) is 3.6 levels and the average depth of a > file is 4.6. More detailed histograms of the repository composition is > below: Do you have a histogram of the types of files in the repo? And as suggested earlier, is svn working for you now because it allows sparse checkout? I imagine the stats for svn on the full repo would be comparable or worse to what you measured with git? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
In reply to this post by Joshua Redstone
On Fri, Feb 3, 2012 at 6:20 AM, Joshua Redstone <[hidden email]> wrote:
> [snip] > > The git performance we observed here is too slow for our needs. So the > question becomes, if we want to keep using git going forward, what's the > best way to improve performance. It seems clear we'll probably need some > specialized servers (e.g., to perform git-blame quickly) and maybe > specialized file system integration to detect what files have changed in a > working tree. Have you considered upgrading all of engineering to SSDs? 200+GB SSDs are under $400USD nowadays. -clee -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
In reply to this post by Joshua Redstone
> The test repo has 4 million commits, linear history and about 1.3 million > files. The size of the .git directory is about 15GB, and has been > repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 > --window=250'. This repack took about 2 days on a beefy machine (I.e., > lots of ram and flash). The size of the index file is 191 MB. I can share Are you willing to give up all or part of your history in your working repository? I've heard of larger projects starting from scratch (i.e. copy all of your files into a brand new repo.) You can keep your old repo around for archival purposes. Also, how much of your repo is code, versus static assets. You could move all of your static assets (images, css, maybe some js?) into another repo, and then merge the two repo's together at build time if you absolutely need them deployed together. Here are a couple strategies for doing a partial truncate: http://stackoverflow.com/questions/4515580/how-do-i-remove-the-old-history-from-a-git-repository http://bogdan.org.ua/2011/03/28/how-to-truncate-git-history-sample-script-included.html -Zeki -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
In reply to this post by Ævar Arnfjörð Bjarmason
On Feb 3, 2012, at 9:56 AM, Ævar Arnfjörð Bjarmason wrote: > On Fri, Feb 3, 2012 at 15:20, Joshua Redstone <[hidden email]> wrote: > >> We (Facebook) have been investigating source control systems to meet our >> growing needs. We already use git fairly widely, but have noticed it >> getting slower as we grow, and we want to make sure we have a good story >> going forward. We're debating how to proceed and would like to solicit >> people's thoughts. > > Where I work we also have a relatively large Git repository. Around > 30k files, a couple of hundred thousand commits, clone size around > half a GB. > > You haven't supplied background info on this but it really seems to me > like your testcase is converting something like a humongous Perforce > repository directly to Git. > > While you /can/ do this it's not a good idea, you should split up > repositories at the boundaries code or data doesn't directly cross > over, e.g. there's no reason why you need HipHop PHP in the same > repository as Cassandra or the Facebook chat system, is there? > > While Git could better with large repositories (in particular applying > commits in interactive rebase seems to be to slow down on bigger > repositories) there's only so much you can do about stat-ing 1.3 > million files. > > A structure that would make more sense would be to split up that giant > repository into a lot of other repositories, most of them probably > have no direct dependencies on other components, but even those that > do can sometimes just use some other repository as a submodule. > > Even if you have the requirement that you'd like to roll out > *everything* at a certain point in time you can still solve that with > a super-repository that has all the other ones as submodules, and > creates a tag for every rollout or something like that. > -- > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to [hidden email] > More majordomo info at http://vger.kernel.org/majordomo-info.html I concur. I'm working in the company with many years of development history with several huge CVS repos and we are slowly but surely migrating the codebase from CVS to Git. Split the things up. This will allow you to reorganize things better and there is IMHO no downsides. As for rollout - i think this job should be given to build/release system that will have an ability to gather necessary code from different repos and tag it properly. just my 2 cents Thanks, Eugene -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
In reply to this post by Joshua Redstone
Joshua Redstone wrote:
> The test repo has 4 million commits, linear history and about 1.3 million > files. Have you tried separating these two factors, to see how badly each is affecting performance? If the number of commits is the problem (seems likely for git blame at least), a shallow clone would avoid that overhead. I think that git often writes .git/index inneficiently when staging files (though your `git add` is pretty fast) and committing. It rewrites the whole file to .git/index.lck and the renames it over .git/index at the end. I have code that keeps a journal of changes to avoid rewriting the index repeatedly, but it's application specific. Fixing git to write the index more intelligently is something I'd like to see. Hint for git status: `git status .` in a smaller subdirectory will be much faster than the default that stats everything. -- see shy jo |
In reply to this post by Joshua Redstone
On Fri, Feb 3, 2012 at 9:20 PM, Joshua Redstone <[hidden email]> wrote:
> I timed a few common operations with both a warm OS file cache and a cold > cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did > the operation in question a few times (first timing is the cold timing, > the next few are the warm timings). The following results are on a server > with average hard drive (I.e., not flash) and > 10GB of ram. > > 'git status' : 39 minutes cold, and 24 seconds warm. > > 'git blame': 44 minutes cold, 11 minutes warm. > > 'git add' (appending a few chars to the end of a file and adding it): 7 > seconds cold and 5 seconds warm. > > 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet > --no-status': 41 minutes cold, 20 seconds warm. I also hacked a version > of git to remove the three or four places where 'git commit' stats every > file in the repo, and this dropped the times to 30 minutes cold and 8 > seconds warm. Have you tried "git update-index --assume-unchaged"? That should reduce mass lstat() and hopefully improve the above numbers. The interface is not exactly easy-to-use, but if it has significant gain, then we can try to improve UI. On the index size issue, ideally we should make minimum writes to index instead of rewriting 191 MB index. An improvement we could do now is to compress it, reduce disk footprint, thus disk I/O. If you compress the index with gzip, how big is it? -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
In reply to this post by Joshua Redstone
Joshua Redstone <joshua.redstone <at> fb.com> writes:
> The git performance we observed here is too slow for our needs. So the > question becomes, if we want to keep using git going forward, what's the > best way to improve performance. It seems clear we'll probably need some > specialized servers (e.g., to perform git-blame quickly) and maybe > specialized file system integration to detect what files have changed in a > working tree. Hi Joshua, sounds like you have everything in a single .git. Split up the massive repository to separate smaller .git repositories. For example, Android code base is quite big. They use the repo tool to manage a number of separate .git repositories as one big aggregate "repository". Cheers, Slinky -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
In reply to this post by Duy Nguyen
[ wanted to reply to my initial msg, but wasn't subscribed to the list at time of mailing, so replying to most recent post instead ]
Thanks to everyone for the questions and suggestions. I'll try to respond here. One high-level clarification - this synthetic repo for which I've reported perf times is representative of where we think we'll be in the future. Git is slow but marginally acceptable for today. We want to start planning now for any big changes we need to make going forward. Evgeny Sazhin, Slinky and Ævar Arnfjörð Bjarmason suggested splitting up the repo into multiple, smaller repos. I indicated before that we have a lot of cross-dependencies. Our largest repo by number of files and commits is the repo containing the front-end server. It is a large code base in which the tight integration of various components results in many of the cross dependencies. We are working slowly to split things up more, for example into services, but that is a long-term process. To get a bit abstract for a moment, in an ideal world, it doesn't seem like performance constraints of a source-control-system should dictate how we choose to structure our code. Ideally, seems like we should be able to choose to structure our code in whatever way we feel maximizes developer productivity. If development and code/release management seem easier in a single repo, than why not make an SCM that can handle it? This is one reason I've been leaning towards figuring out an SCM approach that can work well with our current practices rather than changing them as a prerequisite for good SCM performance. Sam Vilain: Thanks for the pointer, i didn't realize that fast-import was bi-directional. I used it for generating the synthetic repo. Will look into using it the other way around. Though that still won't speed up things like git-blame, presumably? The sparse-checkout issue you mention is a good one. There is a good question of how to support quick checkout, branch switching, clone, push and so forth. I'll look into the approaches you suggest. One consideration is coming up with a high-leverage approach - i.e. not doing heavy dev work if we can avoid it. On the other hand, it would be nice if we (including the entire community :) ) improve git in areas that others that share similar issues benefit from as well. Matt Graham: I don't have file stats at the moment. It's mostly code files, with a few larger data files here and there. We also don't do sparse checkouts, primarily because most people use git (whether on top of SVN or not), which doesn't support it. Chris Lee: When I was building up the repo (e.g., doing lots of commits, before I started using fast-import), i noticed that flash was not much faster - stat'ing the whole repo takes a lot of kernel time, even with flash. My hunch is that we'd see similar issues with other operations, like git-blame. Zeki Mokhtarzada: Dumping history I think would speed up operations for which we don't care about old history, like git-blame in which we only want to see recent modifications. We'd also need a good story for other kinds of operations. In my mental model of git scalability, I categorize git structures into three kinds: those for reasoning about history, those for the index and those for the working directory (yeah, I know these don't map precisely to actual on-disk things like the object store, including trees, etc.). One scaling approach we've been thinking of is to focus on each individually: develop a specialized thing to handle history commands efficiently (git-blame, git-log, git-diff, etc.), something to speed up or bypass the index, and something to make large changes to the working directly quickly. Joey Hess: Separating the factors is a good suggestion. My hunch is that the various git operations test the performance issues in isolation. For example, git-status performance depends just on the number of files, not on the depth of history. On the other hand, my guess is that git-blame performance is more a function of the length of history rather than the number of files. Though, certainly with compression and indexing in pack files, I could imagine there being cross-effects between length of history and number of files. The git-status suggestion definitely helps when you know which directory you are concerned about. Often I'm lazy and stat the repo root so I trade-off slowness for being more sure I'm not missing anything. @Joey, I think you're also touching on a good meta point which is that, there's probably no silver bullet here. If we want git to efficiently handle repos that are large across a number of dimensions (size, # commits, # files, etc.), there's multiple parts of git that would need enhancement of some form. Nguyen Thai Ngoc Duy: At which point in the test flow should I insert git-update-index? I'm happy to try it out. Will compress index when I next get to a terminal. My guess is it'll compress a bunch. It's also conceivable that, if there were an external interface in git to attach other systems to efficiently report which files have changed (e.g., via file-system integration), it's possible that we could omit managing the index in many cases. I know that would be a big change, but the benefits are intriguing. Cheers, Josh ________________________________________ From: Nguyen Thai Ngoc Duy [[hidden email]] Sent: Friday, February 03, 2012 10:53 PM To: Joshua Redstone Cc: [hidden email] Subject: Re: Git performance results on a large repository On Fri, Feb 3, 2012 at 9:20 PM, Joshua Redstone <[hidden email]> wrote: > I timed a few common operations with both a warm OS file cache and a cold > cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did > the operation in question a few times (first timing is the cold timing, > the next few are the warm timings). The following results are on a server > with average hard drive (I.e., not flash) and > 10GB of ram. > > 'git status' : 39 minutes cold, and 24 seconds warm. > > 'git blame': 44 minutes cold, 11 minutes warm. > > 'git add' (appending a few chars to the end of a file and adding it): 7 > seconds cold and 5 seconds warm. > > 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet > --no-status': 41 minutes cold, 20 seconds warm. I also hacked a version > of git to remove the three or four places where 'git commit' stats every > file in the repo, and this dropped the times to 30 minutes cold and 8 > seconds warm. Have you tried "git update-index --assume-unchaged"? That should reduce mass lstat() and hopefully improve the above numbers. The interface is not exactly easy-to-use, but if it has significant gain, then we can try to improve UI. On the index size issue, ideally we should make minimum writes to index instead of rewriting 191 MB index. An improvement we could do now is to compress it, reduce disk footprint, thus disk I/O. If you compress the index with gzip, how big is it? -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
In reply to this post by Duy Nguyen
One more follow-on thought. I imagine that most consumers of git are nowhere near the scale of the test repo that I described. They may still enjoy benefit from efforts to improve git support for large repos. A few possible reasons:
1. The performance improvements should speed things up for smaller repos as well. 2. They may find their repos growing to a 'large scale' at some point in the future. 3. Any code cleanup as part of an effort to support git scalability is good for code base health and e.g., would facilitate future modifications that may more directly affect them. Cheers, Josh ________________________________________ From: Nguyen Thai Ngoc Duy [[hidden email]] Sent: Friday, February 03, 2012 10:53 PM To: Joshua Redstone Cc: [hidden email] Subject: Re: Git performance results on a large repository On Fri, Feb 3, 2012 at 9:20 PM, Joshua Redstone <[hidden email]> wrote: > I timed a few common operations with both a warm OS file cache and a cold > cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did > the operation in question a few times (first timing is the cold timing, > the next few are the warm timings). The following results are on a server > with average hard drive (I.e., not flash) and > 10GB of ram. > > 'git status' : 39 minutes cold, and 24 seconds warm. > > 'git blame': 44 minutes cold, 11 minutes warm. > > 'git add' (appending a few chars to the end of a file and adding it): 7 > seconds cold and 5 seconds warm. > > 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet > --no-status': 41 minutes cold, 20 seconds warm. I also hacked a version > of git to remove the three or four places where 'git commit' stats every > file in the repo, and this dropped the times to 30 minutes cold and 8 > seconds warm. Have you tried "git update-index --assume-unchaged"? That should reduce mass lstat() and hopefully improve the above numbers. The interface is not exactly easy-to-use, but if it has significant gain, then we can try to improve UI. On the index size issue, ideally we should make minimum writes to index instead of rewriting 191 MB index. An improvement we could do now is to compress it, reduce disk footprint, thus disk I/O. If you compress the index with gzip, how big is it? -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
In reply to this post by Joshua Redstone
Joshua Redstone <[hidden email]> writes: > The test repo has 4 million commits, linear history and about 1.3 million > files. The size of the .git directory is about 15GB, and has been > repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 > --window=250'. This repack took about 2 days on a beefy machine (I.e., > lots of ram and flash). The size of the index file is 191 MB. I can share > the script that generated it if people are interested - It basically picks > 2-5 files, modifies a line or two and adds a few lines at the end > consisting of random dictionary words, occasionally creates a new file, > commits all the modifications and repeats. I have a repository with about 500K files, 3.3G checkout, 1.5G .git, and about 10K commits. (This is a real repository, not a test case.) So not as many commits by a lot, but the size seems not so far off. > I timed a few common operations with both a warm OS file cache and a cold > cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did > the operation in question a few times (first timing is the cold timing, > the next few are the warm timings). The following results are on a server > with average hard drive (I.e., not flash) and > 10GB of ram. > > 'git status' : 39 minutes cold, and 24 seconds warm. Both of these numbers surprise me. I'm using NetBSD, whose stat implementation isn't as optimized as Linux (you didn't say, but assuming). On a years-old desktop, git status seems to be about a minute semi-cold and 5s warm (once I set the vnode cache big over 500K, vs 350K default for a 2G ram machine). So on the warm status, I wonder how big your vnode cache is, and if you've exceeded it, and I don't follow the cold time at all. Probably some sort of profiling within git status would be illuminating. > 'git blame': 44 minutes cold, 11 minutes warm. > > 'git add' (appending a few chars to the end of a file and adding it): 7 > seconds cold and 5 seconds warm. > > 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet > --no-status': 41 minutes cold, 20 seconds warm. I also hacked a version > of git to remove the three or four places where 'git commit' stats every > file in the repo, and this dropped the times to 30 minutes cold and 8 > seconds warm. > One way to get there is to do some deep code modifications to git > internals, to, for example, create some abstractions and interfaces that > allow plugging in the specialized servers. Another way is to leave git > internals as they are and develop a layer of wrapper scripts around all > the git commands that do the necessary interfacing. The wrapper scripts > seem perhaps easier in the short-term, but may lead to increasing > divergence from how git behaves natively and also a layer of complexity. Having hooks for a blame server cache, etc. sounds sensible. Having a way to call blames sort of like with --since and then keep updating it (eg. in emacs) to earlier times sounds useful. |
In reply to this post by Joshua Redstone
On Sun, Feb 5, 2012 at 1:05 AM, Joshua Redstone <[hidden email]> wrote:
> It's also conceivable that, if there were an external interface in git to attach other > systems to efficiently report which files have changed (e.g., via file-system integration), > it's possible that we could omit managing the index in many cases. > I know that would be a big change, but the benefits are intriguing. The "interface to report which files have changed" is exactly "git update-index --[no-]assume-unchanged" is for. Have a look at the man page. Basically you can mark every file "unchanged" in the beginning and git won't bother lstat() them. What files you change, you have to explicitly run "git update-index --no-assume-unchanged" to tell git. Someone on HN suggested making assume-unchanged files read-only to avoid 90% accidentally changing a file without telling git. When assume-unchanged bit is cleared, the file is made read-write again. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
In reply to this post by Joshua Redstone
On Fri, 3 Feb 2012, Joshua Redstone wrote:
> The test repo has 4 million commits, linear history and about 1.3 million > files. The size of the .git directory is about 15GB, and has been > repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 > --window=250'. This repack took about 2 days on a beefy machine (I.e., > lots of ram and flash). The size of the index file is 191 MB. This may be a silly thought, but what if instead of one pack file of your entire history (4 million commits) you create multiple packs (say every half million commits) and mark all but the most recent pack as .keep (so that they won't be modified by a repack) that way things that only need to worry about recent history (blame, etc) will probably never have to go past the most recent pack file or two I may be wrong, but I think that when git is looking for 'similar files' for delta compression, it limits it's search to the current pack, so this will also keep you from searching the entire project history. David Lang -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
On Sun, Feb 5, 2012 at 3:30 PM, <[hidden email]> wrote:
> On Fri, 3 Feb 2012, Joshua Redstone wrote: > >> The test repo has 4 million commits, linear history and about 1.3 million >> files. The size of the .git directory is about 15GB, and has been >> repacked with 'git repack -a -d -f --max-pack-size=10g --depth=100 >> --window=250'. This repack took about 2 days on a beefy machine (I.e., >> lots of ram and flash). The size of the index file is 191 MB. > > > This may be a silly thought, but what if instead of one pack file of your > entire history (4 million commits) you create multiple packs (say every half > million commits) and mark all but the most recent pack as .keep (so that > they won't be modified by a repack) > > that way things that only need to worry about recent history (blame, etc) > will probably never have to go past the most recent pack file or two > > I may be wrong, but I think that when git is looking for 'similar files' for > delta compression, it limits it's search to the current pack, so this will > also keep you from searching the entire project history. I don't know if there is an easy way to determine with the with the current tools in git but one useful statistic for tuning packing performance is the size of the largest component in the delta-chain graph. The significance of this number is that the product of window-size and maximum depth need not be larger than it. I've found that with some older repositories I could have a depth as low as 3 and still get good performance from a moderate window size. -- David Barr -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
In reply to this post by Duy Nguyen
On 2/4/12 7:53 AM, Nguyen Thai Ngoc Duy wrote:
> On Fri, Feb 3, 2012 at 9:20 PM, Joshua Redstone<[hidden email]> wrote: >> I timed a few common operations with both a warm OS file cache and a cold >> cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then did >> the operation in question a few times (first timing is the cold timing, >> the next few are the warm timings). The following results are on a server >> with average hard drive (I.e., not flash) and> 10GB of ram. >> >> 'git status' : 39 minutes cold, and 24 seconds warm. >> >> 'git blame': 44 minutes cold, 11 minutes warm. >> >> 'git add' (appending a few chars to the end of a file and adding it): 7 >> seconds cold and 5 seconds warm. >> >> 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet >> --no-status': 41 minutes cold, 20 seconds warm. I also hacked a version >> of git to remove the three or four places where 'git commit' stats every >> file in the repo, and this dropped the times to 30 minutes cold and 8 >> seconds warm. > Have you tried "git update-index --assume-unchaged"? That should > reduce mass lstat() and hopefully improve the above numbers. The > interface is not exactly easy-to-use, but if it has significant gain, > then we can try to improve UI. > > On the index size issue, ideally we should make minimum writes to > index instead of rewriting 191 MB index. An improvement we could do > now is to compress it, reduce disk footprint, thus disk I/O. If you > compress the index with gzip, how big is it? leverage the btrfs find-new command (or use the ioctl directly) to quickly find changed files since a certain point in time. Other CoW filesystems may have similar mechanisms. You could for example store the last generation id in an index extension, that's what those extensions are for, right? tom -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
On Sun, Feb 5, 2012 at 10:01 PM, Tomas Carnecky <[hidden email]> wrote:
> On 2/4/12 7:53 AM, Nguyen Thai Ngoc Duy wrote: >> >> On Fri, Feb 3, 2012 at 9:20 PM, Joshua Redstone<[hidden email]> >> wrote: >>> >>> I timed a few common operations with both a warm OS file cache and a cold >>> cache. i.e., I did a 'echo 3 | tee /proc/sys/vm/drop_caches' and then >>> did >>> the operation in question a few times (first timing is the cold timing, >>> the next few are the warm timings). The following results are on a >>> server >>> with average hard drive (I.e., not flash) and> 10GB of ram. >>> >>> 'git status' : 39 minutes cold, and 24 seconds warm. >>> >>> 'git blame': 44 minutes cold, 11 minutes warm. >>> >>> 'git add' (appending a few chars to the end of a file and adding it): 7 >>> seconds cold and 5 seconds warm. >>> >>> 'git commit -m "foo bar3" --no-verify --untracked-files=no --quiet >>> --no-status': 41 minutes cold, 20 seconds warm. I also hacked a version >>> of git to remove the three or four places where 'git commit' stats every >>> file in the repo, and this dropped the times to 30 minutes cold and 8 >>> seconds warm. >> >> Have you tried "git update-index --assume-unchaged"? That should >> reduce mass lstat() and hopefully improve the above numbers. The >> interface is not exactly easy-to-use, but if it has significant gain, >> then we can try to improve UI. >> >> On the index size issue, ideally we should make minimum writes to >> index instead of rewriting 191 MB index. An improvement we could do >> now is to compress it, reduce disk footprint, thus disk I/O. If you >> compress the index with gzip, how big is it? > > If you're not afraid to add filesystem-specific code to git, you could > leverage the btrfs find-new command (or use the ioctl directly) to quickly > find changed files since a certain point in time. Other CoW filesystems may > have similar mechanisms. You could for example store the last generation id > in an index extension, that's what those extensions are for, right? Sure they could be stored as index extensions. I'm more concerned of the index size. I guess fs-specific code, if properly implemented (e.g. clean, handling repos crossing fs boundaries, moving repos...), may get Junio's approval. There were also talks of implementing NTFS's journal (or something) on msysgit for similar goal. -- Duy -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to [hidden email] More majordomo info at http://vger.kernel.org/majordomo-info.html |
Free forum by Nabble | Edit this page |