inotify to minimize stat() calls

classic Classic list List threaded Threaded
88 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

inotify to minimize stat() calls

artagnon
Hi,

For large repositories, many simple git commands like `git status`
take a while to respond.  I understand that this is because of large
number of stat() calls to figure out which files were changed.  I
overheard that Mercurial wants to solve this problem using itnotify,
but the idea bothers me because it's not portable.  Will Git ever
consider using inotify on Linux?  What is the downside?

Ram
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

Junio C Hamano
Ramkumar Ramachandra <[hidden email]> writes:

> ...  Will Git ever
> consider using inotify on Linux?  What is the downside?

I think this has come up from time to time, but my understanding is
that nobody thought things through to find a good layer in the
codebase to interface to an external daemon that listens to inotify
events yet.  It is not something like "somebody decreed that we
would never consider because of such and such downsides."  We are
not there yet.




--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

Junio C Hamano
Junio C Hamano <[hidden email]> writes:

> Ramkumar Ramachandra <[hidden email]> writes:
>
>> ...  Will Git ever
>> consider using inotify on Linux?  What is the downside?
>
> I think this has come up from time to time, but my understanding is
> that nobody thought things through to find a good layer in the
> codebase to interface to an external daemon that listens to inotify
> events yet.  It is not something like "somebody decreed that we
> would never consider because of such and such downsides."  We are
> not there yet.

I checked read-cache.c and preload-index.c code.  To get the
discussion rolling, I think something like the outline below may be
a good starting point and a feasible weekend hack for somebody
competent:

 * At the beginning of preload_index(), instead of spawning the
   worker thread and doing the lstat() check ourselves, we open a
   socket to our daemon (see below) that watches this repository and
   make a request for lstat update.  The request will contain:

    - The SHA1 checksum of the index file we just read (to ensure
      that we and our daemon share the same baseline to
      communicate); and

    - the pathspec data.

   Our daemon, if it already has a fresh data available, will give
   us a list of <path, lstat result>.  Our main process runs a loop
   that is equivalent to what preload_thread() runs but uses the
   lstat() data we obtained from the daemon.  If our daemon says it
   does not have a fresh data (or somehow our daemon is dead), we do
   the work ourselves.

 * Our daemon watches the index file and the working tree, and
   waits for the above consumer.  First it reads the index (and
   remembers what it read), and whenever an inotify event comes,
   does the lstat() and remembers the result.  It never writes
   to the index, and does not hold the index lock.  Whenever the
   index file changes, it needs to reload the index, and discard
   lstat() data it already has for paths that are lost from the
   updated index.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

Duy Nguyen
On Sat, Feb 9, 2013 at 5:45 AM, Junio C Hamano <[hidden email]> wrote:

> Junio C Hamano <[hidden email]> writes:
>
>> Ramkumar Ramachandra <[hidden email]> writes:
>>
>>> ...  Will Git ever
>>> consider using inotify on Linux?  What is the downside?
>>
>> I think this has come up from time to time, but my understanding is
>> that nobody thought things through to find a good layer in the
>> codebase to interface to an external daemon that listens to inotify
>> events yet.  It is not something like "somebody decreed that we
>> would never consider because of such and such downsides."  We are
>> not there yet.
>
> I checked read-cache.c and preload-index.c code.  To get the
> discussion rolling, I think something like the outline below may be
> a good starting point and a feasible weekend hack for somebody
> competent:
>
>  * At the beginning of preload_index(), instead of spawning the
>    worker thread and doing the lstat() check ourselves, we open a
>    socket to our daemon (see below) that watches this repository and

Can we replace "open a socket to our daemon" with "open a special file
in .git to get stat data written by our daemon"? TCP/IP socket means
system-wide daemon, not attractive. UNIX socket is not available on
Windows (although there may be named pipe, I don't know).

>    make a request for lstat update.  The request will contain:
>
>     - The SHA1 checksum of the index file we just read (to ensure
>       that we and our daemon share the same baseline to
>       communicate); and
>
>     - the pathspec data.
>
>    Our daemon, if it already has a fresh data available, will give
>    us a list of <path, lstat result>.  Our main process runs a loop
>    that is equivalent to what preload_thread() runs but uses the
>    lstat() data we obtained from the daemon.  If our daemon says it
>    does not have a fresh data (or somehow our daemon is dead), we do
>    the work ourselves.
>
>  * Our daemon watches the index file and the working tree, and
>    waits for the above consumer.  First it reads the index (and
>    remembers what it read), and whenever an inotify event comes,
>    does the lstat() and remembers the result.  It never writes
>    to the index, and does not hold the index lock.  Whenever the
>    index file changes, it needs to reload the index, and discard
>    lstat() data it already has for paths that are lost from the
>    updated index.


--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

Junio C Hamano
Duy Nguyen <[hidden email]> writes:

> Can we replace "open a socket to our daemon" with "open a special file
> in .git to get stat data written by our daemon"? TCP/IP socket means
> system-wide daemon, not attractive. UNIX socket is not available on
> Windows (although there may be named pipe, I don't know).

I do not think TCP/IP socket is too bad (you have to be able to read
the index file to be able to ask questions to the daemon to begin
with, so you must have list of paths already; the answer from the
daemon would not leak anything more sensitive than you can already
know), and UNIX domain socket is not too bad either.

Just like the implementation detail of the daemon itself may differ
on platforms (does Windows have the identical inotify interface?  I
doubt it), I expect the RPC mechanism between the daemon and the
client would be platform dependent.  So take that "open a socket" as
a generic way to say "have these two communicate with some magic",
nothing more.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

Junio C Hamano
In reply to this post by Junio C Hamano
Junio C Hamano <[hidden email]> writes:

> I checked read-cache.c and preload-index.c code.  To get the
> discussion rolling, I think something like the outline below may be
> a good starting point and a feasible weekend hack for somebody
> competent:
>
>  * At the beginning of preload_index(), instead of spawning the
>    worker thread and doing the lstat() check ourselves, we open a
>    socket to our daemon (see below) that watches this repository and
>    make a request for lstat update.  The request will contain:
>
>     - The SHA1 checksum of the index file we just read (to ensure
>       that we and our daemon share the same baseline to
>       communicate); and
>
>     - the pathspec data.
>
>    Our daemon, if it already has a fresh data available, will give
>    us a list of <path, lstat result>.  Our main process runs a loop
>    that is equivalent to what preload_thread() runs but uses the
>    lstat() data we obtained from the daemon.  If our daemon says it
>    does not have a fresh data (or somehow our daemon is dead), we do
>    the work ourselves.
>
>  * Our daemon watches the index file and the working tree, and
>    waits for the above consumer.  First it reads the index (and
>    remembers what it read), and whenever an inotify event comes,
>    does the lstat() and remembers the result.  It never writes
>    to the index, and does not hold the index lock.  Whenever the
>    index file changes, it needs to reload the index, and discard
>    lstat() data it already has for paths that are lost from the
>    updated index.

I left the details unsaid in thee above because I thought it was
fairly obvious from the nature of the "outline", but let me spend a
few more lines to avoid confusion.

 - The way the daemon "watches" the changes to the working tree and
   the index may well be very platform dependent.  I said "inotify"
   above, but the mechanism does not have to be inotify.

 - The channel the daemon and the client communicates would also be
   system dependent.  UNIX domain socket in $GIT_DIR/ with a
   well-known name would be one possibility but it does not have to
   be the only option.

 - The data given from the daemon to the client does not have to
   include full lstat() information.  They start from the same index
   info, and the only thing preload_index() wants to know is for
   which paths it should call ce_mark_uptodate(ce), so the answer
   given by our daemon can be a list of paths.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

Robert Zeh-2
The delay for commands like git status is much worse on Windows than Linux; for my workflow I would be happy with a Windows only implementation.

From the description so far, I have some question: how does the daemon get started and stopped?  Is there one per repository --- this seems to be implied by putting the unix domain socket in $GIT_DIR. Could we automatically reject connections from anything other than localhost when using TCP?

Robert Zeh

On Feb 8, 2013, at 8:56 PM, Junio C Hamano <[hidden email]> wrote:

> Junio C Hamano <[hidden email]> writes:
>
>> I checked read-cache.c and preload-index.c code.  To get the
>> discussion rolling, I think something like the outline below may be
>> a good starting point and a feasible weekend hack for somebody
>> competent:
>>
>> * At the beginning of preload_index(), instead of spawning the
>>   worker thread and doing the lstat() check ourselves, we open a
>>   socket to our daemon (see below) that watches this repository and
>>   make a request for lstat update.  The request will contain:
>>
>>    - The SHA1 checksum of the index file we just read (to ensure
>>      that we and our daemon share the same baseline to
>>      communicate); and
>>
>>    - the pathspec data.
>>
>>   Our daemon, if it already has a fresh data available, will give
>>   us a list of <path, lstat result>.  Our main process runs a loop
>>   that is equivalent to what preload_thread() runs but uses the
>>   lstat() data we obtained from the daemon.  If our daemon says it
>>   does not have a fresh data (or somehow our daemon is dead), we do
>>   the work ourselves.
>>
>> * Our daemon watches the index file and the working tree, and
>>   waits for the above consumer.  First it reads the index (and
>>   remembers what it read), and whenever an inotify event comes,
>>   does the lstat() and remembers the result.  It never writes
>>   to the index, and does not hold the index lock.  Whenever the
>>   index file changes, it needs to reload the index, and discard
>>   lstat() data it already has for paths that are lost from the
>>   updated index.
>
> I left the details unsaid in thee above because I thought it was
> fairly obvious from the nature of the "outline", but let me spend a
> few more lines to avoid confusion.
>
> - The way the daemon "watches" the changes to the working tree and
>   the index may well be very platform dependent.  I said "inotify"
>   above, but the mechanism does not have to be inotify.
>
> - The channel the daemon and the client communicates would also be
>   system dependent.  UNIX domain socket in $GIT_DIR/ with a
>   well-known name would be one possibility but it does not have to
>   be the only option.
>
> - The data given from the daemon to the client does not have to
>   include full lstat() information.  They start from the same index
>   info, and the only thing preload_index() wants to know is for
>   which paths it should call ce_mark_uptodate(ce), so the answer
>   given by our daemon can be a list of paths.
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to [hidden email]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

artagnon
In reply to this post by Junio C Hamano
Junio C Hamano wrote:

> Junio C Hamano <[hidden email]> writes:
>
>> I checked read-cache.c and preload-index.c code.  To get the
>> discussion rolling, I think something like the outline below may be
>> a good starting point and a feasible weekend hack for somebody
>> competent:
>>
>>  * At the beginning of preload_index(), instead of spawning the
>>    worker thread and doing the lstat() check ourselves, we open a
>>    socket to our daemon (see below) that watches this repository and
>>    make a request for lstat update.  The request will contain:
>>
>>     - The SHA1 checksum of the index file we just read (to ensure
>>       that we and our daemon share the same baseline to
>>       communicate); and
>>
>>     - the pathspec data.
>>
>>    Our daemon, if it already has a fresh data available, will give
>>    us a list of <path, lstat result>.  Our main process runs a loop
>>    that is equivalent to what preload_thread() runs but uses the
>>    lstat() data we obtained from the daemon.  If our daemon says it
>>    does not have a fresh data (or somehow our daemon is dead), we do
>>    the work ourselves.
>>
>>  * Our daemon watches the index file and the working tree, and
>>    waits for the above consumer.  First it reads the index (and
>>    remembers what it read), and whenever an inotify event comes,
>>    does the lstat() and remembers the result.  It never writes
>>    to the index, and does not hold the index lock.  Whenever the
>>    index file changes, it needs to reload the index, and discard
>>    lstat() data it already has for paths that are lost from the
>>    updated index.
>
> I left the details unsaid in thee above because I thought it was
> fairly obvious from the nature of the "outline", but let me spend a
> few more lines to avoid confusion.
>
>  - The way the daemon "watches" the changes to the working tree and
>    the index may well be very platform dependent.  I said "inotify"
>    above, but the mechanism does not have to be inotify.

Is the BSD kernel's inotify the same as the one on Linux?  Must we
design something that's generic enough from the start?

More importantly, do you know of a platform-independent inotify
implementation in C?  A quick Googling turned up QFileSystemWatcher
[1], a part of QT.

[1]: http://qt-project.org/doc/qt-4.8/qfilesystemwatcher.html

>  - The channel the daemon and the client communicates would also be
>    system dependent.  UNIX domain socket in $GIT_DIR/ with a
>    well-known name would be one possibility but it does not have to
>    be the only option.

UNIX domain sockets are also preferred because we'd never want to
connect to a watch daemon over the network?

Then the communication channel code also has to be generic enough.

>  - The data given from the daemon to the client does not have to
>    include full lstat() information.  They start from the same index
>    info, and the only thing preload_index() wants to know is for
>    which paths it should call ce_mark_uptodate(ce), so the answer
>    given by our daemon can be a list of paths.

Right.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

artagnon
In reply to this post by Robert Zeh-2
Robert Zeh wrote:
> From the description so far, I have some question: how does the daemon get started and stopped?  Is there one per repository ...

What about getting systemd to watch everything for us?  Then we can
just have one daemon reporting filesystem changes over one global
socket.  It's API should be the inotify subset:

   systemd_add_watch
   systemd_remove_watch

Except systemd_add_watch also accepts a UNIX socket to send lstat
events to.  Our preload_index() is just reduced to making one
systemd_add_watch() call the very first time and updating the index as
necessary.  Now, what about desktops with huge uptimes (like mine)?
Won't they get polluted with too many useless watches over time?
Simple: timeout.  If nobody reads from the UNIX socket for two hours
after a systemd_add_watch, execute systemd_remove_watch automatically.

Someone must implement a similar daemon on other platforms reporting
information in exactly the same way (although with different
internals).  IP sockets are system-wide and all platforms have them,
so the communication channel is also standardized.

This is much better than Junio's suggestion to study possible
implementations on all platforms and designing a generic daemon/
communication channel.  That's no weekend project.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

artagnon
Ramkumar Ramachandra wrote:
> What about getting systemd to watch everything for us?  Then we can
> just have one daemon reporting filesystem changes over one global
> socket.  It's API should be the inotify subset:

Er, not one global socket: many little sockets as described later.
(The idea was just forming while I was writing this paragraph)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

artagnon
In reply to this post by artagnon
Ramkumar Ramachandra wrote:
> What about getting systemd to watch everything for us?

systemd is the perfect candidate!  It already has an inotify watcher:
see systemd.path(5).  Can't be used as-is because it spawns processes
on events, which is a non-scalable design.  Secondly, it uses static
.path files to define the rules which is no good for us.  So, we need
to add an API to it, and ask it to report events over IP sockets.  The
API part is simple too, because it already has a DBUS API for many
things [1]; it's just a matter of extending it.

Yes, I know.  This introduces dbus as an additional optional
non-portable dependency.  Do you have suggestions for alternatives
that aren't complicated?

[1]: http://www.freedesktop.org/wiki/Software/systemd/dbus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

Duy Nguyen
On Sat, Feb 9, 2013 at 7:53 PM, Ramkumar Ramachandra <[hidden email]> wrote:
> Ramkumar Ramachandra wrote:
>> What about getting systemd to watch everything for us?
>
> systemd is the perfect candidate!

How about this as a start? I did not really check what it does, but it
does not look complicate enough to pull systemd in.

http://article.gmane.org/gmane.comp.version-control.git/151934

Youo may want to search the mail archive. This topic has come up a few
times before, there may be other similar patches.
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

artagnon
Duy Nguyen wrote:
> How about this as a start? I did not really check what it does, but it
> does not look complicate enough to pull systemd in.
>
> http://article.gmane.org/gmane.comp.version-control.git/151934

Clever hack.  I didn't know that there was a switch called
core.ignoreStat which will disable automatic lstat() calls altogether.
 So, Finn advises that we set this switch and run igit instead of git.
 There's a git-inotify-daemon which runs inotifywait with -m forever,
updating a modified_files hash.  When it is sent a TERM from igit
(which is what happens immediately upon execution), it writes all this
collected information about modified files to a named pipe that igit
passes to it.  igit then does a git update-index --assume-unchained
--stdin to read the data from the pipe.  Towards the end of its life,
igit starts up a fresh git-inotify-daemon for future invocations.

Finn notes in the commit message that it offers no speedup, because
.gitignore files in every directory still have to be read.  I think
this is silly: we really should be caching .gitignore, and touching it
only when lstat() reports that the file has changed.

As far as a real implementation that we'd want to merge into git.git
is concerned, I have a few comments:
Running multiple daemons on-the-fly for monitoring filesystem changes
is not elegant at all.  Keeping track of the state of so many loose
daemons is a hard problem: how do we ensure any semblance of
reliability without that?  Systemd is a very big improvement over the
legacy of a hundred loose shell scripts that SysVInit demanded.  It
monitors and babysits daemons; it uses cgroups to even kill
misbehaving daemons.  I can inspect running daemons at any time, and
have a uniform way to start/ stop/ restart them.

Okay, now you're asking me to consider a system-wide daemon
independent of systemd.  It has to run with root privileges so it has
access to everyone's repositories, which means that people have to
trust it beyond doubt.  What does it do?  It has a generic API to
watch filesystem paths and report events over an IP socket.  Do you
think that this will only be useful to git?  Every other version
control system (and presumably many other pieces of software) will
want to use it.  One huge downside I see of making this part of
systemd is Ubuntu.  They've decided not to use systemd for some
unfathomable reason.

Really, the elephant in the room right now seems to be .gitignore.
Until that is fixed, there is really no use of writing this inotify
daemon, no?  Can someone enlighten me on how exactly .gitignore files
are processed?

> Youo may want to search the mail archive. This topic has come up a few
> times before, there may be other similar patches.

The thread you linked me to is a 2010 email, and now it's 2013.  We've
been silent about inotify for three years?

Thanks for your inputs, Duy.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

artagnon
Ramkumar Ramachandra wrote:

> Okay, now you're asking me to consider a system-wide daemon
> independent of systemd.  It has to run with root privileges so it has
> access to everyone's repositories, which means that people have to
> trust it beyond doubt.  What does it do?  It has a generic API to
> watch filesystem paths and report events over an IP socket.  Do you
> think that this will only be useful to git?  Every other version
> control system (and presumably many other pieces of software) will
> want to use it.  One huge downside I see of making this part of
> systemd is Ubuntu.  They've decided not to use systemd for some
> unfathomable reason.

After some thought, I've decided that extending systemd is not the way
to go.  And the dbus API is really an overkill.  Writing a simple
system-wide daemon shouldn't be a challenge; the hard part is getting
git to use it properly.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

Junio C Hamano
In reply to this post by artagnon
Ramkumar Ramachandra <[hidden email]> writes:

> This is much better than Junio's suggestion to study possible
> implementations on all platforms and designing a generic daemon/
> communication channel.  That's no weekend project.

It appears that you misunderstood what I wrote.  That was not "here
is a design; I want it in my system.  Go implemment it".

It was "If somebody wants to discuss it but does not know where to
begin, doing a small experiment like this and reporting how well it
worked here may be one way to do so.", nothing more.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

Duy Nguyen
In reply to this post by artagnon
On Sun, Feb 10, 2013 at 12:10 AM, Ramkumar Ramachandra
<[hidden email]> wrote:

> Finn notes in the commit message that it offers no speedup, because
> .gitignore files in every directory still have to be read.  I think
> this is silly: we really should be caching .gitignore, and touching it
> only when lstat() reports that the file has changed.
>
> ...
>
> Really, the elephant in the room right now seems to be .gitignore.
> Until that is fixed, there is really no use of writing this inotify
> daemon, no?  Can someone enlighten me on how exactly .gitignore files
> are processed?

.gitignore is a different issue. I think it's mainly used with
read_directory/fill_directory to collect ignored files (or not-ignored
files). And it's not always used (well, status and add does, but diff
should not). I think wee need to measure how much mass lstat
elimination gains us (especially on big repos) and how much
.gitignore/.gitattributes caching does. I don't think .gitignore has
such a big impact though. strace on git.git tells me "git status"
issues about 2500 lstat calls, and just 740 open+getdents calls (on
total 3800 syscalls). I will think if we can do something about
.gitignore/.gitattributes.
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

Duy Nguyen
On Sun, Feb 10, 2013 at 12:24:58PM +0700, Duy Nguyen wrote:

> On Sun, Feb 10, 2013 at 12:10 AM, Ramkumar Ramachandra
> <[hidden email]> wrote:
> > Finn notes in the commit message that it offers no speedup, because
> > .gitignore files in every directory still have to be read.  I think
> > this is silly: we really should be caching .gitignore, and touching it
> > only when lstat() reports that the file has changed.
> >
> > ...
> >
> > Really, the elephant in the room right now seems to be .gitignore.
> > Until that is fixed, there is really no use of writing this inotify
> > daemon, no?  Can someone enlighten me on how exactly .gitignore files
> > are processed?
>
> .gitignore is a different issue. I think it's mainly used with
> read_directory/fill_directory to collect ignored files (or not-ignored
> files). And it's not always used (well, status and add does, but diff
> should not). I think wee need to measure how much mass lstat
> elimination gains us (especially on big repos) and how much
> .gitignore/.gitattributes caching does.

OK let's count. I start with a "standard" repository, linux-2.6. This
is the number from strace -T on "git status" (*). The first column is
accumulated time, the second the number of syscalls.

top syscalls sorted     top syscalls sorted
by acc. time            by number
----------------------------------------------
0.401906 40950 lstat    0.401906 40950 lstat
0.190484 5343 getdents 0.150055 5374 open
0.150055 5374 open 0.190484 5343 getdents
0.074843 2806 close 0.074843 2806 close
0.003216 157 read 0.003216 157 read

The following patch pretends every entry is uptodate without
lstat. With the patch, we can see refresh code is the cause of mass
lstat, as lstat disappears:

0.185347 5343 getdents  0.144173 5374 open
0.144173 5374 open 0.185347 5343 getdents
0.071844 2806 close 0.071844 2806 close
0.004918 135 brk 0.003378 157 read
0.003378 157 read 0.004918 135 brk

-- 8< --
diff --git a/read-cache.c b/read-cache.c
index 827ae55..94d8ed8 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1018,6 +1018,10 @@ static struct cache_entry *refresh_cache_ent(struct index_state *istate,
  if (ce_uptodate(ce))
  return ce;

+#if 1
+ ce_mark_uptodate(ce);
+ return ce;
+#endif
  /*
  * CE_VALID or CE_SKIP_WORKTREE means the user promised us
  * that the change to the work tree does not matter and told
-- 8< --

The following patch eliminates untracked search code. As we can see,
open+getdents also disappears with this patch:

0.462909 40950 lstat   0.462909 40950 lstat
0.003417 129 brk       0.003417 129 brk
0.000762 53 read       0.000762 53 read
0.000720 36 open       0.000720 36 open
0.000544 12 munmap     0.000454 33 close

So from syscalls point of view, we know what code issues most of
them. Let's see how much time we gain be these patches, which is an
approximate of the gain by inotify support. This time I measure on
gentoo-x86.git [1] because this one has really big worktree (100k
files)

        unmodified  read-cache.c  dir.c     both
real    0m0.550s    0m0.479s      0m0.287s  0m0.213s
user    0m0.305s    0m0.315s  0m0.201s  0m0.182s
sys     0m0.240s    0m0.157s  0m0.084s  0m0.030s

and the syscall picture on gentoo-x86.git:

1.106615 101942 lstat    1.106615 101942 lstat
0.667235 47083 getdents 0.641604 47114 open
0.641604 47114 open 0.667235 47083 getdents
0.286711 23573 close 0.286711 23573 close
0.005842 350 brk 0.005842 350 brk

We can see that shortcuting untracked code gives bigger gain than
index refresh code. So I have to agree that .gitignore may be the big
elephant in this particular case.

Bear in mind though this is Linux, where lstat is fast. On systems
with slow lstat, these timings could look very different due to the
large number of lstat calls compared to open+getdents. I really like
to see similar numbers on Windows.

read_directory/fill_directory code is mostly used by "git add" (not
with -u) and "git status", while refresh code is executed in add,
checkout, commit/status, diff, merge. So while smaller gain, reducing
lstat calls could benefit in more cases.

A relatively slow "git add" is acceptable. "git status" should be
fast. Although in my workflow, I do "git diff [--stat] [--cached]"
much more often than "git status" so relatively slow "git status" does
not hurt me much. But people may do it differently.

On speeding up read_directory with inotify support. I haven't thought
it through, but I think we could save (or get it via socket) a list of
untracked files in .git, regardless ignore status, with the help from
inotify. When this list is verified valid, read_directory could be
modified to traverse the tree using this list (plus the index) instead
of opendir+readdir. Not sure how the change might look though.


[1] http://git-exp.overlays.gentoo.org/gitweb/?p=exp/gentoo-x86.git;a=summary

(*) the script to produce those numbers is

-- 8< --
#!/bin/sh

export LANG=C
strace -T "$@" 2>&1 >/dev/null |
        sed 's/\(^[^(]*\)(.*<\([0-9.]*\)>$/\1 \2/' |
        awk '{
          sec[$1]+=$2;
          count[$1]++;
        }
        END {
          for (i in sec)
            printf("%f %d %s\n", sec[i], count[i], i);
          }' >/tmp/s

sort -nr /tmp/s | head -n5
sort -nrk2 /tmp/s | head -n5
-- 8< --
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

Duy Nguyen
On Sun, Feb 10, 2013 at 06:17:32PM +0700, Duy Nguyen wrote:
> The following patch eliminates untracked search code. As we can see,
> open+getdents also disappears with this patch:
>
> 0.462909 40950 lstat   0.462909 40950 lstat
> 0.003417 129 brk       0.003417 129 brk
> 0.000762 53 read       0.000762 53 read
> 0.000720 36 open       0.000720 36 open
> 0.000544 12 munmap     0.000454 33 close

.. and the patch is missing:

-- 8< --
diff --git a/dir.c b/dir.c
index 57394e4..1963c6f 100644
--- a/dir.c
+++ b/dir.c
@@ -1439,8 +1439,10 @@ int read_directory(struct dir_struct *dir, const char *path, int len, const char
  return dir->nr;
 
  simplify = create_simplify(pathspec);
+#if 0
  if (!len || treat_leading_path(dir, path, len, simplify))
  read_directory_recursive(dir, path, len, 0, simplify);
+#endif
  free_simplify(simplify);
  qsort(dir->entries, dir->nr, sizeof(struct dir_entry *), cmp_name);
  qsort(dir->ignored, dir->ignored_nr, sizeof(struct dir_entry *), cmp_name);
-- 8< --
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

demerphq
In reply to this post by Duy Nguyen
On 10 February 2013 12:17, Duy Nguyen <[hidden email]> wrote:
> Bear in mind though this is Linux, where lstat is fast. On systems
> with slow lstat, these timings could look very different due to the
> large number of lstat calls compared to open+getdents. I really like
> to see similar numbers on Windows.

Is windows stat really so slow? I encountered this perception in
windows Perl in the past, and I know that on windows Perl stat
*appears* slow compared to *nix, because in order to satisfy the full
*nix stat interface, specifically the nlink field, it must open and
close the file*. As of 5.10 this can be disabled by setting a magic
var ${^WIN32_SLOPPY_STAT} to a true value, which makes a significant
improvement to the performance of the Perl level stat implementation.
I would not be surprised if the cygwin implementation of stat() has
the same issue as Perl did, and that stat appears much slower than it
actually need be if you don't care about the nlink field.

Yves
* http://perl5.git.perl.org/perl.git/blob/HEAD:/win32/win32.c#l1492

--
perl -Mre=debug -e "/just|another|perl|hacker/"
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: inotify to minimize stat() calls

Duy Nguyen
On Sun, Feb 10, 2013 at 8:26 PM, demerphq <[hidden email]> wrote:
> On 10 February 2013 12:17, Duy Nguyen <[hidden email]> wrote:
>> Bear in mind though this is Linux, where lstat is fast. On systems
>> with slow lstat, these timings could look very different due to the
>> large number of lstat calls compared to open+getdents. I really like
>> to see similar numbers on Windows.
>
> Is windows stat really so slow?

I can't say. I haven't used Windows for months (and git on Windows for years)..

> I encountered this perception in
> windows Perl in the past, and I know that on windows Perl stat
> *appears* slow compared to *nix, because in order to satisfy the full
> *nix stat interface, specifically the nlink field, it must open and
> close the file*. As of 5.10 this can be disabled by setting a magic
> var ${^WIN32_SLOPPY_STAT} to a true value, which makes a significant
> improvement to the performance of the Perl level stat implementation.
> I would not be surprised if the cygwin implementation of stat() has
> the same issue as Perl did, and that stat appears much slower than it
> actually need be if you don't care about the nlink field.

The native port of git uses get_file_attr (in
compat/mingw.c:do_lstat()) to simulate lstat and always sets nlink to
1. I assume this means git does not care about nlink field. I don't
know about cygwin though.

> Yves
> * http://perl5.git.perl.org/perl.git/blob/HEAD:/win32/win32.c#l1492
--
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
12345