From: neeels <got@kleinekatze.de>
Subject: Re: diff algo implementation ("duff")
To: gameoftrees@openbsd.org
Date: Wed, 22 Jan 2020 17:37:55 +0100

On Wed, Jan 22, 2020 at 12:56:33PM +0100, Martin Pieuchot wrote:
> On 22/01/20(Wed) 04:32, neeels wrote:
> > [...] 
> > The first step would probably be some automated profiling with large amounts of
> > test data, so that we get feedback on whether things improve or are getting
> > worse. Any simple ideas?
> 
> What kind of tests cases do you want?  Huge diffs in term of lines
> added/removed like sys/dev/drm or Mesa updates?  Or moving a function in
> a driver file and making sure the diff is as small as possible?

We don't necessarily want minimal number of diff lines, because a highly
fragmented diff is worse than one that shows humans intuitively what is going
on. Could be interesting to show as a metric at best.

The most interesting part is how long does it take to calculate a diff.

And maybe how much peak RAM did it consume?

We could use real world data, like produce diffs for the entire history of
project X. Not sure how best to do that, first dump lots of source files from a
real git repos onto disk and run the diff cmdline tool on it, or first
integrate into got and measure blame performance? Anyhow we'd ideally want to
measure the pure diff time, not the process mangement / object parsing / disk
wait.

> That means on top of the algorithms you implemented we could start
> looking at the integration of this new pieces of code :o)  Did you look
> at the current diffreg() interface?  It is used in:
> 
> 	usr.bin/cvs
> 	usr.bin/diff
> 	usr.bin/rcs
> 	got/lib/

I looked at diffreg but not much / forgot all about it already; anyhow, I'm
confident that with a diff_result struct that the current diff_main() spits
out, you can easily generate any kind of result structure.

(As an optimization we can maybe later directly store in the final result
format? But that shouldn't have any measurable impact, really.)

(Also now thinking that a streamy way of printing results could be nice, to
start seeing output before the entire diff is completed. That is possible with
some code changes: instead of passing out one final result, provide a callback
that gets solved diff chunks as they come up. But I'd also save that for
later?)

> > Another thing, so far it is called just "diff", which is asking for huge name
> > conflicts and confusion with previous diff.
> > I have had "duff" as a local alias for a unidiff (diff -u) for a long time, so
> > I think I want to name this project "duff", and make unidiff output the
> > default (in case the so far simplistic cmdline tool becomes install-worthy...)
> 
> I'd love to see your project become a drop-in replacement for OpenBSD's
> diff(1).

Maybe then we should discuss it in a more general mailing list?
Would that be tech@ ?

> So I don't see any problem with naming it "diff" :o)  Being
> compliant with POSIX and with the commonly use arguments is the tricky
> part.

Should be able to take the current diff(1) and swap out only the diff algo,
right? As long as there aren't too many mad spaghetti tentacles.


A side question, slightly OT, and just out of curiosity: if I were to integrate
this diff code in a GPL project, as the Copyright holder I can just relicense
it, right? If I did that, would I then destroy the ability to use the code in
OpenBSD? Or would they become two separate projects based on the same code
once? Would they then follow their own contribution paths, and not be able to
benefit from each others improvements somehow? It's not something I want to do
really, just wondering how the different license worlds interact.

~N