"GOT", but the "O" is a cute, smiling pufferfish. Index | Thread | Search

From:
Omar Polo <op@omarpolo.com>
Subject:
Re: got patch: add flag to ignore whitespace?
To:
Stefan Sperling <stsp@stsp.name>
Cc:
Christian Weisgerber <naddy@mips.inka.de>, gameoftrees@openbsd.org
Date:
Sun, 03 Jul 2022 11:46:15 +0200

Download raw body.

Thread
Stefan Sperling <stsp@stsp.name> wrote:
> On Sun, Jul 03, 2022 at 09:24:10AM +0200, Omar Polo wrote:
> > However, this got me thinking: maybe ignoring _every_ whitespace is too
> > lax?  What about ignoring only the leading and trailing ones?  Regress
> > passes and the mangled patch sent by Mark still applies.
> 
> Hmm. Not sure about leading whitespace, this could get complicated quickly.
> 
> Some email clients convert tabs ('\t') to spaces (' ') and mangle patches.
> We should have a regress test involving code which is indented by tabs
> in the original and target file, and indented by spaces in the patch.

this is already partially covered in test_patch_whitespace, where the
patch has

 1) one context line with whitespace removed
 2) one context line without the leading space
 3) one context line indented with four spaces where the file only uses tabs
 4) one added line with spaces instead of tabs

it produces what I think it's the best we can hope to: only the added
line has spaces, the others are kept verbatim from the original file (or
blob in case of a merge.)

> It would be difficult for 'got patch' to repair such indentation problems
> automatically and meet user expectations. Some styles, including our own,
> mix tabs and spaces on the same line. 'got patch' would have to infer the
> indentation style from the target file that is being patched, and then apply
> lines with bad whitespace such that tabs vs. spaces are preserved in a
> way the user expects.

I'm not trying to turn 'got patch' into a code formatter!

> Which is impossible to get right in general, since
> we don't know which whitespace was added by the author or the email program.
> This may be why other patch tools leave this behaviour off by default.

Exactly, we don't know wether a whitespace was there as intended or
mangled.  That's why the diff in test_patch_whitespace ends up adding a
line that's indented with spaces in a C program otherwise indented with
tabs: we can't know if it was intended or not.

Let me stress this point: the laxiness is only applied when matching the
context of the diff and when we use it we notify the user.  For the
other lines (i.e. additions) we will never know.

The only possibility I can think of is a patch where the '+' lines have
mangled whitespaces and the other don't.  This would be the only case
where we apply the patch without complaining, but so would do every
other patch implementation, with or without laxiness on whitespaces.

> Because this is just whitespace, and our output already hints that whitespace
> was mangled, I still think -w should be the default for 'got patch'. But
> we should definitely provide an option to disable it.

If the general agreement is to provide an option to disable it (or have
it disabled by default and opt in with -w) it's fine for me.  I can work
on it, no problem.

However, I'm still not completely convinced it's a good idea.  We
already inform the user when we have to use heuristics to apply the diff
(at offset or now with ignoring whitespaces) quite clearly.  In a script
one could do something like this

	if got patch | grep ^@; then
		echo woops, patch not applied cleanly
		exit 1
	fi

and if a patch was mangled beyond the point of sanity and was still
applied, one can still use 'got revert' to get rid of the edits and
either complain to the sender or work their way manually somehow.

(Admittedly one annoyance would be to have some local edits, then apply
a diff and not being able to revert to just the prior local edits.  This
is already an "issue" today and can be trivially solved by commiting the
local edits before applying a patch.)

Furthermore, if one really wants badly a "perfect application" of diffs,
they should also make sure that the 'got patch' used only the diff3
merge and not the search/replace, as it can be confused easily.

(still easy to do from a script: got patch | grep '^[^G]')

> Another problem with having -w on by default is that it is possible to
> encode harmful things in unicode which end up rendering only whitespace
> on the terminal. We should make sure that we are really looking at ASCII
> characters ' ' and \t, instead of using isspace() (which may do
> who-knows-what on systems running got-portable).

I thought it was safe.  I'm calling isspace with a char, so I'm
expecting only 255 possible values, none of which should be mapped to
weird codepoints.  (rethinking, I probably don't want to know what other
system do :D)

Even if it were safe however, isspace is still too much here since it
also match \f and some other characters.  Checking only ' ' and \t seems
better.