"GOT", but the "O" is a cute, smiling pufferfish. Index | Thread | Search

Stefan Sperling <stsp@stsp.name>
Re: got-notify-http: fix unicode handling
Omar Polo <op@omarpolo.com>
Thu, 28 Mar 2024 09:45:41 +0100

Download raw body.

On Thu, Mar 28, 2024 at 08:21:17AM +0100, Omar Polo wrote:
> JSON strings are made of UNICODE codepoints, of which only the control
> characters, \ and " need to be escaped.  Furthermore, per RFC8259:
> : JSON text exchanged between systems that are not part of a closed
> : ecosystem MUST be encoded using UTF-8.
> so when POSTing the notifications the JSON text has to be encoded in
> UTF-8.
> The current code is wrong because it escapes with \uXXXX *byte* over
> 0x7F, and this causes mis-decodings issues.
> isu8cont() as far as I can see will happily accept surrogate pairs and
> overlong sequences (since it doesn't parse), which will cause an error
> on the receiving side while decoding the JSON.

Right, such sequences should be filtered and/or replaced.
Eventually we should do this for STMP notifications, too.

> I don't think I can reasonably use mbtowc() either since it will use the
> current locale which is problematic in -portable.
> So, I'm bundling my favourite utf8 decoder (DFAs are lovely) and using
> that to read the text.  Upon decoding error the replacement character
> U+FFFD is emitted in the JSON string, all the bytes considered so far
> discarded and the decoder restarted with the next byte.  (Not the only
> technique, just the simpler to implement.)

I didn't know about this decoder, it is interesting!

Does the use of U+FFFD do something specific in JSON?
What about using '?' like we do in openbsd base tools?

In any case, ok by me. We can keep tweaking in-tree.