got-notify-http: fix unicode handling

From:: Stefan Sperling <stsp@stsp.name>
Subject:: Re: got-notify-http: fix unicode handling
To:: Omar Polo <op@omarpolo.com>
Cc:: gameoftrees@openbsd.org
Date:: Thu, 28 Mar 2024 09:45:41 +0100

Download raw body.

Thread

2024-03-28 07:21 Omar Polo:
got-notify-http: fix unicode handling
- 2024-03-28 08:45 Stefan Sperling:
  got-notify-http: fix unicode handling
- - 2024-04-04 09:44 op@omarpolo.com:
    got-notify-http: fix unicode handling
  - - 2024-04-04 10:32 Stefan Sperling:
      got-notify-http: fix unicode handling

On Thu, Mar 28, 2024 at 08:21:17AM +0100, Omar Polo wrote:
> JSON strings are made of UNICODE codepoints, of which only the control
> characters, \ and " need to be escaped.  Furthermore, per RFC8259:
> 
> : JSON text exchanged between systems that are not part of a closed
> : ecosystem MUST be encoded using UTF-8.
> 
> so when POSTing the notifications the JSON text has to be encoded in
> UTF-8.
> 
> The current code is wrong because it escapes with \uXXXX *byte* over
> 0x7F, and this causes mis-decodings issues.
> 
> isu8cont() as far as I can see will happily accept surrogate pairs and
> overlong sequences (since it doesn't parse), which will cause an error
> on the receiving side while decoding the JSON.

Right, such sequences should be filtered and/or replaced.
Eventually we should do this for STMP notifications, too.

> I don't think I can reasonably use mbtowc() either since it will use the
> current locale which is problematic in -portable.
> 
> So, I'm bundling my favourite utf8 decoder (DFAs are lovely) and using
> that to read the text.  Upon decoding error the replacement character
> U+FFFD is emitted in the JSON string, all the bytes considered so far
> discarded and the decoder restarted with the next byte.  (Not the only
> technique, just the simpler to implement.)

I didn't know about this decoder, it is interesting!

Does the use of U+FFFD do something specific in JSON?
What about using '?' like we do in openbsd base tools?

In any case, ok by me. We can keep tweaking in-tree.

2024-03-28 07:21 Omar Polo:
got-notify-http: fix unicode handling
- 2024-03-28 08:45 Stefan Sperling:
  got-notify-http: fix unicode handling
- - 2024-04-04 09:44 op@omarpolo.com:
    got-notify-http: fix unicode handling
  - - 2024-04-04 10:32 Stefan Sperling:
      got-notify-http: fix unicode handling