Make the git codebase thread-safe

classic Classic list List threaded Threaded
49 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Stefan Zager-2
On Wed, Feb 12, 2014 at 12:06 PM, Junio C Hamano <[hidden email]> wrote:

> Stefan Zager <[hidden email]> writes:
>
>> ...  I used the Very Sleepy profiler
>> to see where all the time was spent on Windows: 55% of the time was
>> spent in OpenFile, and 25% in CloseFile (both in win32).
>
> This is somewhat interesting.
>
> When we check things out, checkout_paths() has a list of paths to be
> checked out, and iterates over them and call checkout_entry().
>
> I wonder if you can:
>
>  - introduce a version of checkout_entry() that takes file
>    descriptors to write to;
>
>  - have an asynchronous helper threads that pre-open the paths to be
>    written out and feed <ce, file descriptor to be written> to a
>    queue;
>
>  - restructure that loop so that it reads the <ce, file descriptor
>    to be written> from the queue, performs the actual writing out,
>    and then feeds <file descriptor to be closed> to another queue; and
>
>  - have another asynchronous helper threads that reads <file
>    descriptor to be closed> from the queue and close them.
>
> Calls to write (and preparation of data to be written) will then
> remain single-threaded, but it sounds like that codepath is not the
> bottleneck in your measurement, so....

Yes, I considered that as well.  At a minimum, that would still
require attr.c to implement thread locking, since attribute files must
be parsed to look for stream filters.  I have already done that work.

But I'm not sure it's the best long-term approach to add convoluted
custom threading solutions to each git operation as it appears on the
performance radar.  I'm hoping to make the entire code base more
thread-friendly, so that threading can be added in a more natural and
idiomatic (and less painful) way.

For example, the most natural way to add threading to checkout would
be in the loops over the index in check_updates() in unpack-trees.c.
If attr.c and sha1_file.c were thread-safe, then it would be possible
to thread checkout entirely in check_updates(), with a pretty compact
code change.  I have already done the work in attr.c; sha1_file.c is
hairier, but do-able.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Mike Hommey-3
In reply to this post by Karsten Blees-2
On Wed, Feb 12, 2014 at 12:00:19PM +0100, Karsten Blees wrote:

> Am 12.02.2014 04:43, schrieb Duy Nguyen:
> > On Wed, Feb 12, 2014 at 9:02 AM, Robin H. Johnson <[hidden email]> wrote:
> >> On Tue, Feb 11, 2014 at 05:54:51PM -0800,  Stefan Zager wrote:
> >>> We in the chromium project have a keen interest in adding threading to
> >>> git in the pursuit of performance for lengthy operations (checkout,
> >>> status, blame, ...).  Our motivation comes from hitting some
> >>> performance walls when working with repositories the size of chromium
> >>> and blink:
> >> +1 from Gentoo on performance improvements for large repos.
> >>
> >> The main repository in the ongoing Git migration project looks to be in
> >> the 1.5GB range (and for those that want to propose splitting it up, we
> >> have explored that option and found it lacking), with very deep history
> >> (but no branches of note, and very few tags).
> >
> > From v1.9 shallow clone should work for all push/pull/clone... so
> > history depth does not matter (on the client side). As for
> > gentoo-x86's large worktree, using index v4 and avoid full-tree
> > operations (e.g. "status .", not "status"..) should make all
> > operations reasonably fast. I plan to make "status" fast even without
> > path limiting with the help of inotify, but that's not going to be
> > finished soon. Did I miss anything else?
> >
>
> Regarding git-status on msysgit, enable core.preloadindex and core.fscache (as of 1.8.5.2).
>
> There's no inotify on Windows, and I gave up using ReadDirectoryChangesW to
> keep fscache up to date, as it _may_ report DOS file names (e.g. C:\PROGRA~1
> instead of C:\Program Files).

You can use GetLongPathNameW to get the latter from the former.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Junio C Hamano
In reply to this post by Stefan Zager-2
On Wed, Feb 12, 2014 at 12:27 PM, Stefan Zager <[hidden email]> wrote:

> On Wed, Feb 12, 2014 at 12:06 PM, Junio C Hamano <[hidden email]> wrote:
>> Stefan Zager <[hidden email]> writes:
>>
>> Calls to write (and preparation of data to be written) will then
>> remain single-threaded, but it sounds like that codepath is not the
>> bottleneck in your measurement, so....
>
> Yes, I considered that as well.  At a minimum, that would still
> require attr.c to implement thread locking, since attribute files must
> be parsed to look for stream filters.  I have already done that work.

I would have imagined that use of the attribute system belongs to "write and
preparation of data to be written" category, i.e. the single threaded
part of the
kludge I outlined.

> But I'm not sure it's the best long-term approach to add convoluted
> custom threading solutions to each git operation as it appears on the
> performance radar.

Yeah, it depends on how clean and non-intrusive an abstraction we can make.
The kludge I outlined is certainly not very pretty.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Mike Hommey-3
In reply to this post by David Kastrup
On Wed, Feb 12, 2014 at 08:15:24PM +0100, David Kastrup wrote:

> Stefan Zager <[hidden email]> writes:
>
> > On Wed, Feb 12, 2014 at 10:50 AM, David Kastrup <[hidden email]> wrote:
> >
> >> Really, give the above patch a try.  I am taking longer to finish it
> >> than anticipated (with a lot due to procrastination but that is,
> >> unfortunately, a large part of my workflow), and it's cutting into my
> >> "paychecks" (voluntary donations which to a good degree depend on timely
> >> and nontrivial progress reports for my freely available work on GNU
> >> LilyPond).
> >
> > I will give that a try.  How much of a performance improvement have
> > you clocked?
>
> Depends on file type and size.  With large files with lots of small
> changes, performance improvements get more impressive.
>
> Some ugly real-world examples are the Emacs repository, src/xdisp.c
> (performance improvement about a factor of 3), a large file in the style
> of /usr/share/dict/words clocking in at a factor of about 5.
>
> Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
> are no improvements in system time (I/O) except for patch 4 of the
> series which helps perhaps 20% or so.
>
> So the benefits of the patch will come into play mostly for big, bad
> files on Windows: other than that, the I/O time is likely to be the
> dominant player anyway.

How much fragmentation does that add to the files, though?

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Karsten Blees-2
In reply to this post by Mike Hommey-3
Am 13.02.2014 00:03, schrieb Mike Hommey:

> On Wed, Feb 12, 2014 at 12:00:19PM +0100, Karsten Blees wrote:
>> Am 12.02.2014 04:43, schrieb Duy Nguyen:
>>> On Wed, Feb 12, 2014 at 9:02 AM, Robin H. Johnson <[hidden email]> wrote:
>>>> On Tue, Feb 11, 2014 at 05:54:51PM -0800,  Stefan Zager wrote:
>>>>> We in the chromium project have a keen interest in adding threading to
>>>>> git in the pursuit of performance for lengthy operations (checkout,
>>>>> status, blame, ...).  Our motivation comes from hitting some
>>>>> performance walls when working with repositories the size of chromium
>>>>> and blink:
>>>> +1 from Gentoo on performance improvements for large repos.
>>>>
>>>> The main repository in the ongoing Git migration project looks to be in
>>>> the 1.5GB range (and for those that want to propose splitting it up, we
>>>> have explored that option and found it lacking), with very deep history
>>>> (but no branches of note, and very few tags).
>>>
>>> From v1.9 shallow clone should work for all push/pull/clone... so
>>> history depth does not matter (on the client side). As for
>>> gentoo-x86's large worktree, using index v4 and avoid full-tree
>>> operations (e.g. "status .", not "status"..) should make all
>>> operations reasonably fast. I plan to make "status" fast even without
>>> path limiting with the help of inotify, but that's not going to be
>>> finished soon. Did I miss anything else?
>>>
>>
>> Regarding git-status on msysgit, enable core.preloadindex and core.fscache (as of 1.8.5.2).
>>
>> There's no inotify on Windows, and I gave up using ReadDirectoryChangesW to
>> keep fscache up to date, as it _may_ report DOS file names (e.g. C:\PROGRA~1
>> instead of C:\Program Files).
>
> You can use GetLongPathNameW to get the latter from the former.
>
> Mike
>

Except if its a delete or rename notification...my final ReadDirectoryChangesW version cached the files by their long _and_ short names, but was so complex that it slowed most commands down rather than speeding them up :-)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

brian m. carlson
In reply to this post by Stefan Zager-2
On Tue, Feb 11, 2014 at 05:54:51PM -0800, Stefan Zager wrote:
> To this end, I'd like to start submitting patches that make the code
> base generally more thread-safe and thread-friendly.  Right after this
> email, I'm going to send the first such patch, which makes the global
> list of pack files (packed_git) internal to sha1_file.c.

I'm definitely interested in this if it also works on POSIX systems.  At
work, we have a 7.6 GiB repo (packed)[0], so while performance is not
bad, I certainly wouldn't object if it were better.

[0] Using du -sh.  For comparison, the Linux kernel repo is 1.4 GiB.

--
brian m. carlson / brian with sandals: Houston, Texas, US
+1 832 623 2791 | http://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: RSA v4 4096b: 88AC E9B2 9196 305B A994 7552 F1BA 225C 0223 B187

signature.asc (836 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

David Kastrup
In reply to this post by Mike Hommey-3
Mike Hommey <[hidden email]> writes:

> On Wed, Feb 12, 2014 at 08:15:24PM +0100, David Kastrup wrote:
>> Stefan Zager <[hidden email]> writes:
>>
>> > On Wed, Feb 12, 2014 at 10:50 AM, David Kastrup <[hidden email]> wrote:
>> >
>> >> Really, give the above patch a try.  I am taking longer to finish it
>> >> than anticipated (with a lot due to procrastination but that is,
>> >> unfortunately, a large part of my workflow), and it's cutting into my
>> >> "paychecks" (voluntary donations which to a good degree depend on timely
>> >> and nontrivial progress reports for my freely available work on GNU
>> >> LilyPond).
>> >
>> > I will give that a try.  How much of a performance improvement have
>> > you clocked?
>>
>> Depends on file type and size.  With large files with lots of small
>> changes, performance improvements get more impressive.
>>
>> Some ugly real-world examples are the Emacs repository, src/xdisp.c
>> (performance improvement about a factor of 3), a large file in the style
>> of /usr/share/dict/words clocking in at a factor of about 5.
>>
>> Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
>> are no improvements in system time (I/O) except for patch 4 of the
>> series which helps perhaps 20% or so.
>>
>> So the benefits of the patch will come into play mostly for big, bad
>> files on Windows: other than that, the I/O time is likely to be the
>> dominant player anyway.
>
> How much fragmentation does that add to the files, though?

Uh, git-blame is a read-only operation.  It does not add fragmentation
to any file.  The patch will add a diff of probably a few dozen hunks to
builtin/blame.c.  Do you call that "fragmentation"?  It is small enough
that I expect even

    git blame builtin/blame.c

to be faster than before.  But that interpretation of your question
probably tries to make too much sense out of what is just nonsense in
the given context.

--
David Kastrup
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Johannes Sixt-2
In reply to this post by Stefan Zager
Am 2/12/2014 20:30, schrieb Stefan Zager:
> On Wed, Feb 12, 2014 at 11:22 AM, Karsten Blees <[hidden email]> wrote:
>> Am 12.02.2014 19:37, schrieb Erik Faye-Lund:
>>> ReOpenFile, that's fantastic. Thanks a lot!
>>
>> ...but should be loaded dynamically via GetProcAddress, or are we ready to drop XP support?
>
> Right, that is an issue.  From our perspective, it's well past time to
> drop XP support.

Not from mine.

-- Hannes
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

David Kastrup
In reply to this post by David Kastrup
David Kastrup <[hidden email]> writes:

> Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
> are no improvements in system time (I/O) except for patch 4 of the
> series which helps perhaps 20% or so.
>
> So the benefits of the patch will come into play mostly for big, bad
> files on Windows: other than that, the I/O time is likely to be the
> dominant player anyway.
>
> If you have benchmarked the stuff, for annoying cases expect I/O time
> to go down maybe 10-20%, and user time to drop by a factor of 4.
> Under GNU/Linux, that makes for a significant overall improvement.  On
> Windows, the payback is likely quite less because of the worse I/O
> performance.  Pity.

But of course, you can significantly reduce the relevant file
open/close/search times by running

    git gc --aggressive

While this does not actually help with performance in GNU/Linux (though
with file space), dealing with few but compressed files under Windows is
likely a reasonably big win since the uncompression happens in user
space and cannot be bungled by Microsoft (apart from bad memory
management strategies).

--
David Kastrup
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

David Kastrup
In reply to this post by Johannes Sixt-2
Johannes Sixt <[hidden email]> writes:

> Am 2/12/2014 20:30, schrieb Stefan Zager:
>> On Wed, Feb 12, 2014 at 11:22 AM, Karsten Blees <[hidden email]> wrote:
>>> Am 12.02.2014 19:37, schrieb Erik Faye-Lund:
>>>> ReOpenFile, that's fantastic. Thanks a lot!
>>>
>>> ...but should be loaded dynamically via GetProcAddress, or are we
>>> ready to drop XP support?
>>
>> Right, that is an issue.  From our perspective, it's well past time to
>> drop XP support.
>
> Not from mine.

XP has not even reached end of life yet.  As a point of comparison,
there are tensions on the Emacs developer list several times a decade
because some people suggest it might be time to drop the MSDOS port
and/or the associated restriction of having filenames be unique in the
8+3 naming scheme.

--
David Kastrup
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Mike Hommey-3
In reply to this post by David Kastrup
On Thu, Feb 13, 2014 at 07:04:02AM +0100, David Kastrup wrote:

> Mike Hommey <[hidden email]> writes:
>
> > On Wed, Feb 12, 2014 at 08:15:24PM +0100, David Kastrup wrote:
> >> Stefan Zager <[hidden email]> writes:
> >>
> >> > On Wed, Feb 12, 2014 at 10:50 AM, David Kastrup <[hidden email]> wrote:
> >> >
> >> >> Really, give the above patch a try.  I am taking longer to finish it
> >> >> than anticipated (with a lot due to procrastination but that is,
> >> >> unfortunately, a large part of my workflow), and it's cutting into my
> >> >> "paychecks" (voluntary donations which to a good degree depend on timely
> >> >> and nontrivial progress reports for my freely available work on GNU
> >> >> LilyPond).
> >> >
> >> > I will give that a try.  How much of a performance improvement have
> >> > you clocked?
> >>
> >> Depends on file type and size.  With large files with lots of small
> >> changes, performance improvements get more impressive.
> >>
> >> Some ugly real-world examples are the Emacs repository, src/xdisp.c
> >> (performance improvement about a factor of 3), a large file in the style
> >> of /usr/share/dict/words clocking in at a factor of about 5.
> >>
> >> Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
> >> are no improvements in system time (I/O) except for patch 4 of the
> >> series which helps perhaps 20% or so.
> >>
> >> So the benefits of the patch will come into play mostly for big, bad
> >> files on Windows: other than that, the I/O time is likely to be the
> >> dominant player anyway.
> >
> > How much fragmentation does that add to the files, though?
>
> Uh, git-blame is a read-only operation.  It does not add fragmentation
> to any file.  The patch will add a diff of probably a few dozen hunks to
> builtin/blame.c.  Do you call that "fragmentation"?  It is small enough
> that I expect even
>
>     git blame builtin/blame.c
>
> to be faster than before.  But that interpretation of your question
> probably tries to make too much sense out of what is just nonsense in
> the given context.

Sorry, I thought you were talking about write operations, not reads.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Mike Hommey-3
On Thu, Feb 13, 2014 at 06:34:39PM +0900, Mike Hommey wrote:

> On Thu, Feb 13, 2014 at 07:04:02AM +0100, David Kastrup wrote:
> > Mike Hommey <[hidden email]> writes:
> >
> > > On Wed, Feb 12, 2014 at 08:15:24PM +0100, David Kastrup wrote:
> > >> Stefan Zager <[hidden email]> writes:
> > >>
> > >> > On Wed, Feb 12, 2014 at 10:50 AM, David Kastrup <[hidden email]> wrote:
> > >> >
> > >> >> Really, give the above patch a try.  I am taking longer to finish it
> > >> >> than anticipated (with a lot due to procrastination but that is,
> > >> >> unfortunately, a large part of my workflow), and it's cutting into my
> > >> >> "paychecks" (voluntary donations which to a good degree depend on timely
> > >> >> and nontrivial progress reports for my freely available work on GNU
> > >> >> LilyPond).
> > >> >
> > >> > I will give that a try.  How much of a performance improvement have
> > >> > you clocked?
> > >>
> > >> Depends on file type and size.  With large files with lots of small
> > >> changes, performance improvements get more impressive.
> > >>
> > >> Some ugly real-world examples are the Emacs repository, src/xdisp.c
> > >> (performance improvement about a factor of 3), a large file in the style
> > >> of /usr/share/dict/words clocking in at a factor of about 5.
> > >>
> > >> Again, that's with an SSD and ext4 filesystem on GNU/Linux, and there
> > >> are no improvements in system time (I/O) except for patch 4 of the
> > >> series which helps perhaps 20% or so.
> > >>
> > >> So the benefits of the patch will come into play mostly for big, bad
> > >> files on Windows: other than that, the I/O time is likely to be the
> > >> dominant player anyway.
> > >
> > > How much fragmentation does that add to the files, though?
> >
> > Uh, git-blame is a read-only operation.  It does not add fragmentation
> > to any file.  The patch will add a diff of probably a few dozen hunks to
> > builtin/blame.c.  Do you call that "fragmentation"?  It is small enough
> > that I expect even
> >
> >     git blame builtin/blame.c
> >
> > to be faster than before.  But that interpretation of your question
> > probably tries to make too much sense out of what is just nonsense in
> > the given context.
>
> Sorry, I thought you were talking about write operations, not reads.

Specifically, I thought you were talking about git checkout.

Mike
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Zachary Turner
In reply to this post by Karsten Blees-2
Karsten Blees <karsten.blees <at> gmail.com> writes:

>
> Am 12.02.2014 19:37, schrieb Erik Faye-Lund:
> > On Wed, Feb 12, 2014 at 7:34 PM, Stefan Zager <szager <at> google.com>
wrote:
> >> On Wed, Feb 12, 2014 at 10:27 AM, Erik Faye-Lund <kusmabite <at>
gmail.com> wrote:
> >>> On Wed, Feb 12, 2014 at 7:20 PM, Stefan Zager <szager <at> google.com>
wrote:
> >>>>
> >>>> I don't want to steal the thunder of my coworker, who wrote the
> >>>> implementation.  He plans to submit it upstream soon-ish.  It relies
> >>>> on using the lpOverlapped argument to ReadFile(), with some
additional

> >>>> tomfoolery to make sure that the implicit position pointer for the
> >>>> file descriptor doesn't get modified.
> >>>
> >>> Is the code available somewhere? I'm especially interested in the
> >>> "additional tomfoolery to make sure that the implicit position pointer
> >>> for the file descriptor doesn't get modified"-part, as this was what I
> >>> ended up butting my head into when trying to do this myself.
> >>
> >> https://chromium-review.googlesource.com/#/c/186104/
> >
> > ReOpenFile, that's fantastic. Thanks a lot!
>
> ...but should be loaded dynamically via GetProcAddress, or are we ready to
drop XP support?
>
>

Original patch author here.  In trying to prepare this patch to use
GetProcAddress to load dynamically, I've run into a bit of a snag.  
NO_THREAD_SAFE_PREAD is a compile-time flag, which will be incompatible with
any attempt to make this a runtime decision a la LoadLibrary /
GetProcAddress.  On XP, we would need to fallback to the single-threaded
path, and on Vista+ we would use the thread-able path, and obviously this
decision could not be made until runtime.

If MinGW were the only configuration using NO_THREAD_SAFE_PREAD, I would
just remove it entirely, but it appears Cygwin configuration uses it also.

Suggestions?  

One possibility is to disallow (by convention, perhaps), the use of pread()
and read() against the same fd.  The only reason ReOpenFile is necessary at
all is because some code somewhere is mixing read-styles against the same
fd.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Stefan Zager
In reply to this post by Johannes Sixt-2
On Thu, Feb 13, 2014 at 12:27 AM, Johannes Sixt <[hidden email]> wrote:

> Am 2/12/2014 20:30, schrieb Stefan Zager:
>> On Wed, Feb 12, 2014 at 11:22 AM, Karsten Blees <[hidden email]> wrote:
>>> Am 12.02.2014 19:37, schrieb Erik Faye-Lund:
>>>> ReOpenFile, that's fantastic. Thanks a lot!
>>>
>>> ...but should be loaded dynamically via GetProcAddress, or are we ready to drop XP support?
>>
>> Right, that is an issue.  From our perspective, it's well past time to
>> drop XP support.
>
> Not from mine.

All this really means is that the build config will test WIN_VER, and
there will need to be an additional binary distribution of msysgit for
newer Windows.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Karsten Blees-2
In reply to this post by Zachary Turner
Am 13.02.2014 19:38, schrieb Zachary Turner:

> The only reason ReOpenFile is necessary at
> all is because some code somewhere is mixing read-styles against the same
> fd.
>

I don't understand...ReadFile with OVERLAPPED parameter doesn't modify the HANDLE's file position, so you should be able to mix read()/pread() however you like (as long as read() is only called from one thread).

I tried without ReOpenFile and it seems to work like a charm, or am I missing something?

----8<----
ssize_t mingw_pread(int fd, void *buf, size_t count, off64_t offset)
{
        DWORD bytes_read;
        OVERLAPPED overlapped;
        memset(&overlapped, 0, sizeof(overlapped));
        overlapped.Offset = (DWORD) offset;
        overlapped.OffsetHigh = (DWORD) (offset >> 32);

        if (!ReadFile((HANDLE) _get_osfhandle(fd), buf, count, &bytes_read,
            &overlapped)) {
                errno = err_win_to_posix(GetLastError());
                return -1;
        }
        return (ssize_t) bytes_read;
}

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Stefan Zager
On Thu, Feb 13, 2014 at 2:51 PM, Karsten Blees <[hidden email]> wrote:
> Am 13.02.2014 19:38, schrieb Zachary Turner:
>
>> The only reason ReOpenFile is necessary at
>> all is because some code somewhere is mixing read-styles against the same
>> fd.
>>
>
> I don't understand...ReadFile with OVERLAPPED parameter doesn't modify the HANDLE's file position, so you should be able to mix read()/pread() however you like (as long as read() is only called from one thread).

That is, apparently, a bald-faced lie in the ReadFile API doc.  First
implementation didn't use ReOpenFile, and it crashed all over the
place.  ReOpenFile fixed it.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Zachary Turner
To elaborate a little bit more, you can verify with a sample program
that ReadFile with OVERLAPPED does in fact modify the HANDLE's file
position.  The documentation doesn't actually state one way or
another.   My original attempt at a patch didn't have the ReOpenFile,
and we experienced regular read corruption.  We scratched our heads
over it for a bit, and then hypothesized that someone must be mixing
read styles, which led to this ReOpenFile workaround, which
incidentally also solved the corruption problems.  We wrote a similar
sample program to verify that when using ReOpenHandle, and changing
the file pointer of the duplicated handle, that the file pointer of
the original handle is not modified.

We did not actually try to identify the source of the mixed read
styles, but it seems like the only possible explanation.

On Thu, Feb 13, 2014 at 2:53 PM, Stefan Zager <[hidden email]> wrote:

> On Thu, Feb 13, 2014 at 2:51 PM, Karsten Blees <[hidden email]> wrote:
>> Am 13.02.2014 19:38, schrieb Zachary Turner:
>>
>>> The only reason ReOpenFile is necessary at
>>> all is because some code somewhere is mixing read-styles against the same
>>> fd.
>>>
>>
>> I don't understand...ReadFile with OVERLAPPED parameter doesn't modify the HANDLE's file position, so you should be able to mix read()/pread() however you like (as long as read() is only called from one thread).
>
> That is, apparently, a bald-faced lie in the ReadFile API doc.  First
> implementation didn't use ReOpenFile, and it crashed all over the
> place.  ReOpenFile fixed it.
>
> Stefan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Karsten Blees-2
Am 14.02.2014 00:09, schrieb Zachary Turner:

> To elaborate a little bit more, you can verify with a sample program
> that ReadFile with OVERLAPPED does in fact modify the HANDLE's file
> position.  The documentation doesn't actually state one way or
> another.   My original attempt at a patch didn't have the ReOpenFile,
> and we experienced regular read corruption.  We scratched our heads
> over it for a bit, and then hypothesized that someone must be mixing
> read styles, which led to this ReOpenFile workaround, which
> incidentally also solved the corruption problems.  We wrote a similar
> sample program to verify that when using ReOpenHandle, and changing
> the file pointer of the duplicated handle, that the file pointer of
> the original handle is not modified.
>
> We did not actually try to identify the source of the mixed read
> styles, but it seems like the only possible explanation.
>
> On Thu, Feb 13, 2014 at 2:53 PM, Stefan Zager <[hidden email]> wrote:
>> On Thu, Feb 13, 2014 at 2:51 PM, Karsten Blees <[hidden email]> wrote:
>>> Am 13.02.2014 19:38, schrieb Zachary Turner:
>>>
>>>> The only reason ReOpenFile is necessary at
>>>> all is because some code somewhere is mixing read-styles against the same
>>>> fd.
>>>>
>>>
>>> I don't understand...ReadFile with OVERLAPPED parameter doesn't modify the HANDLE's file position, so you should be able to mix read()/pread() however you like (as long as read() is only called from one thread).
>>
>> That is, apparently, a bald-faced lie in the ReadFile API doc.  First
>> implementation didn't use ReOpenFile, and it crashed all over the
>> place.  ReOpenFile fixed it.
>>
>> Stefan

Damn...you're right, multi-threaded git-index-pack works fine, but some tests fail badly. Mixed reads would have to be from git_mmap, which is the only other caller of pread().

A simple alternative to ReOpenHandle is to reset the file pointer to its original position, as in compat/pread.c::git_pread. Thus single-theaded code can mix read()/pread() at will, but multi-threaded code has to use pread() exclusively (which is usually the case anyway). A main thread using read() and background threads using pread() (which is technically allowed by POSIX) will fail with this solution.

This version passes the test suite on msysgit:

----8<----
ssize_t mingw_pread(int fd, void *buf, size_t count, off64_t offset)
{
        DWORD bytes_read;
        OVERLAPPED overlapped;
        off64_t current;
        memset(&overlapped, 0, sizeof(overlapped));
        overlapped.Offset = (DWORD) offset;
        overlapped.OffsetHigh = (DWORD) (offset >> 32);

        current = lseek64(fd, 0, SEEK_CUR);

        if (!ReadFile((HANDLE)_get_osfhandle(fd), buf, count, &bytes_read, &overlapped)) {
                errno = err_win_to_posix(GetLastError());
                return -1;
        }

        lseek64(fd, current, SEEK_SET);

        return (ssize_t) bytes_read;
}

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Zachary Turner
(Gah, sorry if you're receiving multiple emails to your personal
addresses, I need to get used to manually setting Plain-text mode
every time I send a message).

For the mixed read, we wouldn't be looking for another caller of
pread() (since it doesn't care what the file pointer is), but instead
a caller of read() or lseek() (since those do depend on the current
file pointer).  In index-pack.c, I see two possible culprits:

1) A call to xread() from inside fill()
2) A call to lseek in parse_pack_objects()

Do you think these could be related?  If so, maybe that opens up some
other solutions?

BTW, the version you posted isn't thread safe.  Suppose thread A and
thread B execute this function at the same time.  A executes through
the ReadFile(), but does not yet reset the second lseek64.  B then
executes the first lseek64(), storing off the modified file pointer.
Then A finishes, then B finishes.  At the end, the file pointer is
still modified.

On Fri, Feb 14, 2014 at 11:15 AM, Zachary Turner <[hidden email]> wrote:

> For the mixed read, we wouldn't be looking for another caller of pread()
> (since it doesn't care what the file pointer is), but instead a caller of
> read() or lseek().  In index-pack.c, I see two possible culprits:
>
> 1) A call to xread() from inside fill()
> 2) A call to lseek in parse_pack_objects()
>
> Do you think these could be related?  If so, maybe that opens up some other
> solutions?
>
> BTW, the version you posted isn't thread safe.  Suppose thread A and thread
> B execute this function at the same time.  A executes through the
> ReadFile(), but does not yet reset the second lseek64.  B then executes the
> first lseek64(), storing off the modified file pointer.  Then A finishes,
> then B finishes.  At the end, the file pointer is still modified.
>
>
>
> On Fri, Feb 14, 2014 at 11:04 AM, Karsten Blees <[hidden email]>
> wrote:
>>
>> Am 14.02.2014 00:09, schrieb Zachary Turner:
>> > To elaborate a little bit more, you can verify with a sample program
>> > that ReadFile with OVERLAPPED does in fact modify the HANDLE's file
>> > position.  The documentation doesn't actually state one way or
>> > another.   My original attempt at a patch didn't have the ReOpenFile,
>> > and we experienced regular read corruption.  We scratched our heads
>> > over it for a bit, and then hypothesized that someone must be mixing
>> > read styles, which led to this ReOpenFile workaround, which
>> > incidentally also solved the corruption problems.  We wrote a similar
>> > sample program to verify that when using ReOpenHandle, and changing
>> > the file pointer of the duplicated handle, that the file pointer of
>> > the original handle is not modified.
>> >
>> > We did not actually try to identify the source of the mixed read
>> > styles, but it seems like the only possible explanation.
>> >
>> > On Thu, Feb 13, 2014 at 2:53 PM, Stefan Zager <[hidden email]> wrote:
>> >> On Thu, Feb 13, 2014 at 2:51 PM, Karsten Blees
>> >> <[hidden email]> wrote:
>> >>> Am 13.02.2014 19:38, schrieb Zachary Turner:
>> >>>
>> >>>> The only reason ReOpenFile is necessary at
>> >>>> all is because some code somewhere is mixing read-styles against the
>> >>>> same
>> >>>> fd.
>> >>>>
>> >>>
>> >>> I don't understand...ReadFile with OVERLAPPED parameter doesn't modify
>> >>> the HANDLE's file position, so you should be able to mix read()/pread()
>> >>> however you like (as long as read() is only called from one thread).
>> >>
>> >> That is, apparently, a bald-faced lie in the ReadFile API doc.  First
>> >> implementation didn't use ReOpenFile, and it crashed all over the
>> >> place.  ReOpenFile fixed it.
>> >>
>> >> Stefan
>>
>> Damn...you're right, multi-threaded git-index-pack works fine, but some
>> tests fail badly. Mixed reads would have to be from git_mmap, which is the
>> only other caller of pread().
>>
>> A simple alternative to ReOpenHandle is to reset the file pointer to its
>> original position, as in compat/pread.c::git_pread. Thus single-theaded code
>> can mix read()/pread() at will, but multi-threaded code has to use pread()
>> exclusively (which is usually the case anyway). A main thread using read()
>> and background threads using pread() (which is technically allowed by POSIX)
>> will fail with this solution.
>>
>> This version passes the test suite on msysgit:
>>
>> ----8<----
>> ssize_t mingw_pread(int fd, void *buf, size_t count, off64_t offset)
>> {
>>         DWORD bytes_read;
>>         OVERLAPPED overlapped;
>>         off64_t current;
>>         memset(&overlapped, 0, sizeof(overlapped));
>>         overlapped.Offset = (DWORD) offset;
>>         overlapped.OffsetHigh = (DWORD) (offset >> 32);
>>
>>         current = lseek64(fd, 0, SEEK_CUR);
>>
>>         if (!ReadFile((HANDLE)_get_osfhandle(fd), buf, count, &bytes_read,
>> &overlapped)) {
>>                 errno = err_win_to_posix(GetLastError());
>>                 return -1;
>>         }
>>
>>         lseek64(fd, current, SEEK_SET);
>>
>>         return (ssize_t) bytes_read;
>> }
>>
>
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: Make the git codebase thread-safe

Stefan Zager
In reply to this post by Karsten Blees-2
On Fri, Feb 14, 2014 at 11:15 AM, Zachary Turner <[hidden email]> wrote:
> For the mixed read, we wouldn't be looking for another caller of pread()
> (since it doesn't care what the file pointer is), but instead a caller of
> read() or lseek().  In index-pack.c, I see two possible culprits:
>
> 1) A call to xread() from inside fill()
> 2) A call to lseek in parse_pack_objects()
>
> Do you think these could be related?  If so, maybe that opens up some other
> solutions?

From my observations, it's not that simple.  As you pointed out to me
before, fill() is only called before the threading part of the code,
and lseek is only called after the threading part; and the lseek() is
lseek(fd, 0, SEEK_CUR), so it's purely advisory.

Also, here is the error output we got before you added ReOpenFile:

remote: Total 2514467 (delta 1997300), reused 2513040 (delta 1997113)
Checking connectivity... error: packfile
d:/src/chromium2/_gclient_src_a6y1bf/.git/objects/pack/pack-3b8d06040ac37f14d0b43859926f1153fea61a7a.pack
does not match index
warning: packfile
d:/src/chromium2/_gclient_src_a6y1bf/.git/objects/pack/pack-3b8d06040ac37f14d0b43859926f1153fea61a7a.pack
cannot be accessed
error: packfile
d:/src/chromium2/_gclient_src_a6y1bf/.git/objects/pack/pack-3b8d06040ac37f14d0b43859926f1153fea61a7a.pack
does not match index
warning: packfile
d:/src/chromium2/_gclient_src_a6y1bf/.git/objects/pack/pack-3b8d06040ac37f14d0b43859926f1153fea61a7a.pack
cannot be accessed
error: packfile
d:/src/chromium2/_gclient_src_a6y1bf/.git/objects/pack/pack-3b8d06040ac37f14d0b43859926f1153fea61a7a.pack
does not match index
warning: packfile
d:/src/chromium2/_gclient_src_a6y1bf/.git/objects/pack/pack-3b8d06040ac37f14d0b43859926f1153fea61a7a.pack
cannot be accessed
fatal: bad object e0f9f23f765a45e6d80863a8f881ee735c9347fe


The 'Checking connectivity...' message comes from builtin/clone.c,
which runs in a separate process from builtin/index-pack.c.  What this
suggests to me is that file descriptors for the loose object files are
not being flushed or closed properly before index-pack finishes.


> BTW, the version you posted isn't thread safe.  Suppose thread A and thread
> B execute this function at the same time.  A executes through the
> ReadFile(), but does not yet reset the second lseek64.  B then executes the
> first lseek64(), storing off the modified file pointer.  Then A finishes,
> then B finishes.  At the end, the file pointer is still modified.

Yes, that.  I would also point out that in our experiments, ReOpenFile
is not nearly as expensive as its name might suggest.  Since the
solution using ReOpenFile is pretty solidly thread-safe (at least as
far as we can tell), I'm in favor of using it unless or until we
properly root-case the failure.


Stefan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
123