"GOT", but the "O" is a cute, smiling pufferfish. Index | Thread | Search

From:
Stefan Sperling <stsp@stsp.name>
Subject:
Re: make 'got send' detect closed connections
To:
Omar Polo <op@omarpolo.com>
Cc:
gameoftrees@openbsd.org
Date:
Wed, 13 Sep 2023 22:19:13 +0200

Download raw body.

Thread
On Wed, Sep 13, 2023 at 08:08:42PM +0200, Omar Polo wrote: > On 2023/09/13 12:28:53 +0200, Stefan Sperling <stsp@stsp.name> wrote:
> > On Wed, Sep 13, 2023 at 12:18:12AM +0200, Omar Polo wrote:
> > > On 2023/09/12 21:17:38 +0200, Stefan Sperling <stsp@stsp.name> wrote:
> > The idea certainly works (see diff below):
> > 
> > $  got send -b qwx gothub
> > Connecting to "gothub" ssh://stsp@gothub.org/src.git
> > Host key fingerprint is SHA256:HT7ohAIc4knlmUm3vFUyyK9/DW/sFccZuPG0oNwBWYA
> > 15671 commits coloredgotsh: operation timed out
> > 
> > got: server unexpectedly closed the connection
> > $
> > 
> > However, the patch as I wrote it has unintended side effects.
> > For some reason I am seeing sporadic errors during test runs once
> > we start catching sigchld.
> > There are two chldren involved (got-send-pack and ssh), which might
> > be relatad to this issue.
> 
> Yeah, I forgot that we have other children other than ssh... sorry.
> 
> It starts to get tricky... 
> 
> > I don't have time right now to dig into how we could avoid this.
> > 
> > $ ./send.sh
> > test_send_basic got: ppoll: Interrupted system call
> > got clone command failed unexpectedly
> > test failed; leaving test data in /tmp/got-test-send_basic-uyH6MHTmyu
> > test_send_rebase_required ok
> > test_send_rebase_required_overwrite got: ppoll: Interrupted system call
> 
> This is SIGCHLD interrupting ppoll() in lib/pollfd.c...
> 
> We'd need something like this on top of your diff but then I'm not
> sure.  Your initial diff was more self-contained.
> 
> If you really want to stop the coloring early then your original diff
> is ok op@.  I remembered that the progress callback is ratelimited, so
> there's no need to worry about issuing too many poll()s.

Thanks!

Here is a working diff using SIGCHLD. It contains your pollfd fix
and also switches from ERR_CANCELLED to ERR_EOF such that we actually
show the error message. (My original draft was already using ERR_EOF
but I sent out a version where I had changed this code for some reason.)

There were subtle test failures from test_send_delete and test_send_config.
In these tests ssh exits early under normal conditions. I have added a
check that prevents dead connection detection for pack files that contain
no objects. This seems fine since we will never spend a huge amount of
time on an empty pack file so there's no need to detect broken connections.

In theory we could use sigaction(2) to avoid getting sigchld when a
child is stopped rather than exiting. We could ten also use siginfo to
find the PID the signal was delivered for, ignoring CHLD signals that
are sent for got-send-pack rather than ssh.
However, the OpenBSD kernel does not implement siginfo_t for SIGCHLD so
all of this would be rather pointless at present. And I don't know how
well sigaction would work in -portable.

-----------------------------------------------
 make 'got send' detect dead ssh connections
 
 In case the ssh process dies, stop immediately instead of generating a
 pack file and uploading it into the void.
 
diff fa9997e790c81002782c0bb2747fa2050576dbb3 971e9c458f41642dc6536996119a20741a2d8e35
commit - fa9997e790c81002782c0bb2747fa2050576dbb3
commit + 971e9c458f41642dc6536996119a20741a2d8e35
blob - 73812ddbb0f2e3a0fee927fc0ae5b873c04e385a
blob + 590c12f1e0ba86dee29097b453ac6ea0a789cc4e
--- got/got.c
+++ got/got.c
@@ -70,6 +70,7 @@
 
 static volatile sig_atomic_t sigint_received;
 static volatile sig_atomic_t sigpipe_received;
+static volatile sig_atomic_t sigchld_received;
 
 static void
 catch_sigint(int signo)
@@ -83,6 +84,11 @@ catch_sigpipe(int signo)
 	sigpipe_received = 1;
 }
 
+static void
+catch_sigchld(int signo)
+{
+	sigchld_received = 1;
+}
 
 struct got_cmd {
 	const char	*cmd_name;
@@ -246,6 +252,7 @@ main(int argc, char *argv[])
 
 	signal(SIGINT, catch_sigint);
 	signal(SIGPIPE, catch_sigpipe);
+	signal(SIGCHLD, catch_sigchld);
 
 	for (i = 0; i < nitems(got_commands); i++) {
 		const struct got_error *error;
@@ -9537,6 +9544,7 @@ struct got_send_progress_arg {
 	int printed_something;
 	int sent_something;
 	struct got_pathlist_head *delete_branches;
+	pid_t *sendpid;
 };
 
 static const struct got_error *
@@ -9552,7 +9560,21 @@ send_progress(void *arg, int ncolored, int nfound, int
 	int print_colored = 0, print_found = 0, print_trees = 0;
 	int print_searching = 0, print_total = 0;
 	int print_deltify = 0, print_written = 0, print_sent = 0;
+	int sendstatus;
 
+	if (sigchld_received) {
+		sigchld_received = 0;
+		if (nobj_total > 0) {
+			if (waitpid(*a->sendpid, &sendstatus, WNOHANG) == -1)
+				return got_error_from_errno("waitpid");
+			if (WIFEXITED(sendstatus)) {
+				*a->sendpid = -1;
+				return got_error_msg(GOT_ERR_EOF, "server "
+				    "unexpectedly closed the connection");
+			}
+		}
+	}
+
 	if (a->verbosity < 0)
 		return NULL;
 
@@ -10030,6 +10052,7 @@ cmd_send(int argc, char *argv[])
 	spa.last_p_written = -1;
 	spa.verbosity = verbosity;
 	spa.delete_branches = &delete_branches;
+	spa.sendpid = &sendpid;
 	error = got_send_pack(remote_name, &branches, &tags, &delete_branches,
 	    verbosity, overwrite_refs, sendfd, repo, send_progress, &spa,
 	    check_cancelled, NULL);
blob - 32efcd7f1a512fe6f1444aaba4ff7e8aa6d051b1
blob + c99cc04285b05fd46dad35c51ef3ac7896672ad4
--- lib/pollfd.c
+++ lib/pollfd.c
@@ -43,6 +43,8 @@ got_poll_fd(int fd, int events, int timeout)
 		return got_error_from_errno("sigemptyset");
 	if (sigaddset(&sigset, SIGWINCH) == -1)
 		return got_error_from_errno("sigaddset");
+	if (sigaddset(&sigset, SIGCHLD) == -1)
+		return got_error_from_errno("sigaddset");
 
 	n = ppoll(pfd, 1, timeout == INFTIM ? NULL : &ts, &sigset);
 	if (n == -1)