clone hang prevention / timeout?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

clone hang prevention / timeout?

Jason Vas Dias
It appears GIT has no way of specifying a timeout for a clone operation -
if the server decides not to complete a get request, the clone can
hang forever -
is this correct ?
This appears to be what I am seeing, in a script that is attempting to do many
successive clone operations, eg. of
git://anongit.freedesktop.org/xorg/* , the script
occasionally hangs in a clone - I can see with netstat + strace that the TCP
connection is open and GIT is trying to read .
Is there any option I can specify to get the clone to timeout, or do I manually
have to strace the git process and send it a signal after a hang is detected?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: clone hang prevention / timeout?

Eric Wong
Jason Vas Dias <[hidden email]> wrote:
> It appears GIT has no way of specifying a timeout for a clone operation -
> if the server decides not to complete a get request, the clone can
> hang forever -
> is this correct ?

git uses SO_KEEPALIVE for all connections it makes, so whatever
your kernel TCP keepalive knobs are set at.

By default, it's very long (around 2 hours), but you can change them
using the tcp_keepalive_* knobs in /proc/sys/net/ipv4/ under Linux.

I suppose we can do shorter timeouts (at least under Linux) via
setsockopt(.. TCP_KEEP*) knobs, or we can call poll() ourselves
to timeout connections.  However, git packing operations on the
server can take a long time; so it might be bad to timeout
manually unless we know the connection is really dead.

> This appears to be what I am seeing, in a script that is attempting to do many
> successive clone operations, eg. of
> git://anongit.freedesktop.org/xorg/* , the script
> occasionally hangs in a clone - I can see with netstat + strace that the TCP
> connection is open and GIT is trying to read .
> Is there any option I can specify to get the clone to timeout, or do I manually
> have to strace the git process and send it a signal after a hang is detected?

I added git:// support for SO_KEEPALIVE in commit e47a8583a202
("enable SO_KEEPALIVE for connected TCP sockets")
back in 2011 (v1.7.10),
and http:// support later in 2013 (v1.8.5) with
commit a15d069a1986 ("http: enable keepalive on TCP sockets")
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: clone hang prevention / timeout?

Jeff King
In reply to this post by Jason Vas Dias
On Mon, Apr 11, 2016 at 10:49:19PM +0100, Jason Vas Dias wrote:

> It appears GIT has no way of specifying a timeout for a clone operation -
> if the server decides not to complete a get request, the clone can
> hang forever -
> is this correct ?

Yes. Git's protocol has no timeouts, though each side is generally
either writing or reading at any moment, and so an interrupted
connection should cause either EPIPE or EOF, ending the process. The
exceptions I have seen are:

 - protocol / implementation bugs that cause a true deadlock. At this
   we've fixed all known cases, but that doesn't mean there aren't bugs
   lurking.

 - the network drops out in such a way that the OS doesn't realize the
   connection is gone, and the reading side is left waiting for input
   forever

I think the TCP keepalive stuff that Eric mentioned should address the
latter, though I don't know how well it works in practice. We used to
sometimes see processes hung for days on GitHub, but it's been a long
time. I don't recall if it was pre-v1.8.5 (which introduced
SO_KEEPALIVE), or if we made some other change (we have a load-balancing
layer in front that has more aggressive timeouts).

> This appears to be what I am seeing, in a script that is attempting to do many
> successive clone operations, eg. of
> git://anongit.freedesktop.org/xorg/* , the script
> occasionally hangs in a clone - I can see with netstat + strace that the TCP
> connection is open and GIT is trying to read .
> Is there any option I can specify to get the clone to timeout, or do I manually
> have to strace the git process and send it a signal after a hang is detected?

There are periods where a git client may have to wait for a while in
read() while the other side is quiet (e.g., when the other side is badly
packed and needs to do a lot of up-front CPU work to prepare the
packfile). Since v1.8.4.2, the server side of a clone should generate
application-level keepalive packets, so that the client never sees
silence for more than ~5 seconds. The freedesktop servers appear to be
on v2.1.4, so a long read() as you're seeing probably is a real hang.

Note that pushing has a similar problem (the client may wait a long time
while the server chews on the uploaded packfile before reporting
status). There are no keepalives in that direction, though I have a
series there that I need to polish and submit.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: clone hang prevention / timeout?

Jeff King
In reply to this post by Jason Vas Dias
On Mon, Apr 11, 2016 at 10:49:19PM +0100, Jason Vas Dias wrote:

> Is there any option I can specify to get the clone to timeout, or do I manually
> have to strace the git process and send it a signal after a hang is detected?

Oh, one other thing you might consider, it something like "timeout" from
GNU coreutils, which puts a hard cap on the length of time a process can
run.

It's totally unaware of the state of the process, though, so if you
really do have a clone which takes an hour, it might very well kill it
at 99% complete. It has no mechanism for "gee, this process looks like
it hasn't done anything for 5 minutes".

I don't know offhand of a general tool for that.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: clone hang prevention / timeout?

Jason Vas Dias
Thanks very much Eric & Jeff for your reply .

Personally, I would recommend setting the SO_RECVTIMEO for GIT server
sockets to a fixed default (eg. 5mins) , settable by a
'--receive-timeout'   argument or configuration parameter .

The problem I was trying to overcome was cloning all the repositories under
https://anongit.freedesktop.org/xorg/* .

About 4 git clones would succeed in succession, but then typically the 5th
would hang in read() forever - I left one such hung 'git clone' for nearly an
hour and it had not progressed or timed out . I tried inserting a delay of
up to 30 seconds between clones, but this did not help.

Maybe freedesktop.org's GIT server is too overloaded and they have
to resort to disabling 1 out of 5 GIT successive clone operations from
same connection or something.

Here is my solution, in case anyone else needs it :

<quote><pre>
      eips=()
       counts=()
       declare -i failed=0;
       { echo "$BASHPID" >/tmp/git.pid;
         GIT_TRACE=2 exec git clone
${proto}://${user}anongit.freedesktop.org/${repo}$name; }&
       while [ ! -f /tmp/git.pid ]; do sleep 1; done
       git_pid="$(cat /tmp/git.pid)";
       while [ -d /proc/$git_pid ]; do
           IFS=$'\n';
           declare -a kids=($(ps --ppid $git_pid -o 'pid=,eip='));
           unset IFS;
           declare -i n_kids=${#kids[@]} kid_n;
           for ((kid_n=0; kid_n < n_kids; kid_n+=1)); do
             declare -a ke=(${kids[kid_n]});
             kid=${ke[0]}
             eip=${ke[1]}
             if [ ! -v 'eips['$kid']' ]; then
                eips[$kid]="$eip";
             elif [ "${eips[$kid]}" = "$eip" ]; then
                if [ x = x"${counts[$kid]}" ]; then
                   counts[$kid]=1;
                else
                   counts[$kid]=$((${counts[$kid]}+1));
                   if (( ${counts[$kid]} >= 30 )); then
                      echo 'child process '$kid' of git main process
'$git_pid' appears to be stuck - killing it.';
                      kill -TERM $kid;
                      ((failed=1));
                   fi
                fi
             else
                eips[$kid]="$eip";
                counts[$kid]='';
             fi
          done ;
          sleep 1;
       done
       wait
</quote></pre>

This is part of a script that reads a list of the Xorg projects,
sets $repo to top level subdirectory, and $name to the project name,
and initiates the GIT clone .
It deems any GIT _CHILD_ process (eg. git-index-pack) that have not
changed their instruction pointer register (EIP)  for 30 seconds to be
"hung" .
There is logic at the end to retry all the failed clones.
It does work, but is far from pretty .
It sure would be nice if GIT had a timeout mechanism !

Thanks & Regards,
Jason






On 13/04/2016, Jeff King <[hidden email]> wrote:

> On Mon, Apr 11, 2016 at 10:49:19PM +0100, Jason Vas Dias wrote:
>
>> Is there any option I can specify to get the clone to timeout, or do I
>> manually
>> have to strace the git process and send it a signal after a hang is
>> detected?
>
> Oh, one other thing you might consider, it something like "timeout" from
> GNU coreutils, which puts a hard cap on the length of time a process can
> run.
>
> It's totally unaware of the state of the process, though, so if you
> really do have a clone which takes an hour, it might very well kill it
> at 99% complete. It has no mechanism for "gee, this process looks like
> it hasn't done anything for 5 minutes".
>
> I don't know offhand of a general tool for that.
>
> -Peff
>
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Reply | Threaded
Open this post in threaded view
|

Re: clone hang prevention / timeout?

Eric Wong
Jason Vas Dias <[hidden email]> wrote:
> Thanks very much Eric & Jeff for your reply .
>
> Personally, I would recommend setting the SO_RECVTIMEO for GIT server
> sockets to a fixed default (eg. 5mins) , settable by a
> '--receive-timeout'   argument or configuration parameter .

(apologies for the delay, I thought I replied earlier :x)

SO_RCVTIMEO only triggers EAGAIN, and AFAIK the git read/write
wrappers are used to transparently retry on EAGAIN...  So it's
not so simple as doing a single setsockopt.

> The problem I was trying to overcome was cloning all the repositories under
> https://anongit.freedesktop.org/xorg/* .
>
> About 4 git clones would succeed in succession, but then typically the 5th
> would hang in read() forever - I left one such hung 'git clone' for nearly an
> hour and it had not progressed or timed out . I tried inserting a delay of
> up to 30 seconds between clones, but this did not help.

Are you in contact with any of the admins of that server to
help?  Is the problematic repo any larger or in any way
stranger than the others?

> Maybe freedesktop.org's GIT server is too overloaded and they have
> to resort to disabling 1 out of 5 GIT successive clone operations from
> same connection or something.

Anyways I've been thinking about overloaded git servers, lately.
Pack generation on big repos is painful, and having lots of slow
clients can tie up server memory.  So maybe an HTTP server
which can switch between dumb and smart operation depending on
load could be useful for the resource-constrained.

> Here is my solution, in case anyone else needs it :

It'd be nice to get an strace to know where in the clone process
it hangs to help the admin figure out how far things got.

And please don't top-post, it's a waste of resources.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [hidden email]
More majordomo info at  http://vger.kernel.org/majordomo-info.html