"GOT", but the "O" is a cute, smiling pufferfish. Index | Thread | Search

From:
Omar Polo <op@omarpolo.com>
Subject:
Re: diff vs CRLF
To:
gameoftrees@openbsd.org
Date:
Tue, 26 Jul 2022 12:49:03 +0200

Download raw body.

Thread
  • Omar Polo:

    diff vs CRLF

    • Omar Polo:

      diff vs CRLF

    • Stefan Sperling:

      diff vs CRLF

Omar Polo <op@omarpolo.com> wrote:
> `got diff' has some issues when dealing with CRLF files.  I stumbled on
> this when looking at the history of
> 
> 	https://github.com/fcorbelli/zpaqfranz
> 
> for example:
> 
> 	% got log -l1 -p -c 90eafd8 | head -n30
> 	...
> 	--- zpaqfranz.cpp
> 	+++ zpaqfranz.cpp
> 	@@ -1,87483 +1,45670 @@
> 	-///////// EXPERIMENTAL BUILD
> 	\ No newline at end of file
> 	+///////// EXPERIMENTAL BUILD
> 	...
> 	+///////// https://sourceforge.net/projects/zpaqfranz/
> 	
> 	-///////// The source is a mess
> 	\ No newline at end of file
> 	+///////// I had to reluctantly move parts of the comments ...
> 
> the "\ Now newline at end of file" is spammed very frequently across the
> diff.
> 
> i tracked this down to diff_data_atomize_text_lines_{fd,mmap}: there's
> an error in accounting for \r\n.  it's easy to reproduce with a smaller
> file:
> 
> 	% printf 'test\r\n' > crlf
> 	% got add crlf && got ci -m +crlf
> 	% printf 'test 2\r\n' > crlf
> 	% got di
> 	...
> 	@@ -1,2 +1,2 @@
> 	-test
> 	+test 2
> 
> 	%
> 
> notice the empty line after 'test 2' and the wrong line offsets in the
> hunk header, should be '@@ -1 +1 @@'.

I failed to explain how this error becomes the "\ No newline" previously
mentioned.  The issue is that we fail to correctly compute the end of
the line in CRLF files.  The code is written to support them, but in
reality it stops at the \r.

later, diff_output_trailing_newline_msg sees that the character at the
end of the line is not a line feed and assumes that the original file
was lacking a final newline.

> (not that it matters that much; `got patch', Larry' patch(1) and GNU'
> patch don't work on CRLF files anyway, this is just aesthetic for the
> reader)
> 
> patch below fixes the accounting in `got di' so it doesn't produce these
> strange results on CRLF-delimited files.  Should apply in diff.git too.
> 
> ok?
> 
> P.S.: it's just me or the regress of diff.git is broken?  i can't get it
> to run cleanly after a fresh checkout.  got regress is happy though.
> 
> P.P.S: we care enough about CRLF to add some regress?
> 
> diff /home/op/w/got
> commit - 10a16316a03005a07c45b2bbf1b5644b64e846fb
> path + /home/op/w/got
> blob - 0531fabe30530664069034f1c3166fe9d18a6e54
> file + lib/diff_atomize_text.c
> --- lib/diff_atomize_text.c
> +++ lib/diff_atomize_text.c
> @@ -92,7 +92,7 @@ diff_data_atomize_text_lines_fd(struct diff_data *d)
>  			r = fread(buf, sizeof(char), sizeof(buf), d->root->f);
>  			if (r == 0 && ferror(d->root->f))
>  				return errno;
> -			if (r == 1 && buf[0] == '\n' )
> +			if (r > 0 && buf[0] == '\n')
>  				line_end++;
>  		}
>  
> @@ -151,11 +151,9 @@ diff_data_atomize_text_lines_mmap(struct diff_data *d)
>  
>  		/* When not at the end of data, the line ending char ('\r' or
>  		 * '\n') must follow */
> -		if (line_end < end)
> +		if (line_end < end && *line_end == '\r')
>  			line_end++;
> -		/* If that was an '\r', also pull in any following '\n' */
> -		if (line_end < end - 1 && line_end[0] == '\r' &&
> -		    line_end[1] == '\n')
> +		if (line_end < end && *line_end == '\n')
>  			line_end++;
>  
>  		/* Record the found line as diff atom */