"GOT", but the "O" is a cute, smiling pufferfish. Index | Thread | Search

From:
Stefan Sperling <stsp@stsp.name>
Subject:
Re: gotd: Don't use WNOHANG when waiting for child processes
To:
Omar Polo <op@omarpolo.com>
Cc:
Josiah Frentsos <jfrent@tilde.team>, gameoftrees@openbsd.org
Date:
Mon, 19 Jun 2023 23:39:04 +0200

Download raw body.

Thread
On Sat, Mar 04, 2023 at 11:58:46AM +0100, Omar Polo wrote:
> On 2023/03/03 09:35:59 -0500, Josiah Frentsos <jfrent@tilde.team> wrote:
> > Fixes this:
> > 
> >     2023-03-03T13:51:15.155Z silicon gotd[7715]: child PID 0 terminated; signal 12
> >     2023-03-03T13:51:15.186Z silicon last message repeated 419 times
> 
> I don't think the diff is wrong per-se, wait_for_child is just a
> wrapper around waitpid and we're interested in one child only anyway,
> so I *think* WNOHANG is not much important.  I'm curious why it's
> needed though.
> 
> waitpid(WNOHANG) returns 0 when "there are no stopped or exited
> children".  Since we call it always after kill_proc() maybe it's a
> timing issue: we call waitpid(WNOHANG) before the signal is handled
> and the other process exits?

Given that wait_for_child() runs in libevent handlers (via client_disconnect)
I don't think we want to call waitpid() in blocking mode. We should stay in
control over how long we are going to wait for the child to exit.
Otherwise, if waitpid() never returns for some reason (for example, because
the child is spinning on the CPU and doesn't exit) then all of gotd will
hang because further events will not be processed.

The diff below gives up after 5 seconds. Good enough? Too long?

I did consider retrying with kill -9 upon timeout. But for now I would rather
be able to attach gdb to a child that doesn't terminate to figure out why.

diff /home/stsp/src/got
commit - 9628f36dac5ed5319b482a020f06ff9737a0c1f0
path + /home/stsp/src/got
blob - f15597dbd0e7299cbf018874f9d81d6c8dcaf403
file + gotd/gotd.c
--- gotd/gotd.c
+++ gotd/gotd.c
@@ -271,15 +271,23 @@ wait_for_child(pid_t child_pid)
 {
 	pid_t pid;
 	int status;
+	int tries = 5;
 
-	log_debug("waiting for child PID %ld to terminate",
-	    (long)child_pid);
-
 	do {
+		log_debug("waiting for child PID %ld to terminate",
+		    (long)child_pid);
+
 		pid = waitpid(child_pid, &status, WNOHANG);
 		if (pid == -1) {
 			if (errno != EINTR && errno != ECHILD)
 				fatal("wait");
+		} else if (pid == 0) {
+			if (--tries <= 0) {
+				log_warnx("child PID %ld did not terminate",
+				    (long)child_pid);
+				break;
+			}
+			sleep(1);
 		} else if (WIFSIGNALED(status)) {
 			log_warnx("child PID %ld terminated; signal %d",
 			    (long)pid, WTERMSIG(status));