From: neeels Subject: Re: diff algo implementation ("duff") To: gameoftrees@openbsd.org Date: Wed, 22 Jan 2020 17:37:55 +0100 On Wed, Jan 22, 2020 at 12:56:33PM +0100, Martin Pieuchot wrote: > On 22/01/20(Wed) 04:32, neeels wrote: > > [...] > > The first step would probably be some automated profiling with large amounts of > > test data, so that we get feedback on whether things improve or are getting > > worse. Any simple ideas? > > What kind of tests cases do you want? Huge diffs in term of lines > added/removed like sys/dev/drm or Mesa updates? Or moving a function in > a driver file and making sure the diff is as small as possible? We don't necessarily want minimal number of diff lines, because a highly fragmented diff is worse than one that shows humans intuitively what is going on. Could be interesting to show as a metric at best. The most interesting part is how long does it take to calculate a diff. And maybe how much peak RAM did it consume? We could use real world data, like produce diffs for the entire history of project X. Not sure how best to do that, first dump lots of source files from a real git repos onto disk and run the diff cmdline tool on it, or first integrate into got and measure blame performance? Anyhow we'd ideally want to measure the pure diff time, not the process mangement / object parsing / disk wait. > That means on top of the algorithms you implemented we could start > looking at the integration of this new pieces of code :o) Did you look > at the current diffreg() interface? It is used in: > > usr.bin/cvs > usr.bin/diff > usr.bin/rcs > got/lib/ I looked at diffreg but not much / forgot all about it already; anyhow, I'm confident that with a diff_result struct that the current diff_main() spits out, you can easily generate any kind of result structure. (As an optimization we can maybe later directly store in the final result format? But that shouldn't have any measurable impact, really.) (Also now thinking that a streamy way of printing results could be nice, to start seeing output before the entire diff is completed. That is possible with some code changes: instead of passing out one final result, provide a callback that gets solved diff chunks as they come up. But I'd also save that for later?) > > Another thing, so far it is called just "diff", which is asking for huge name > > conflicts and confusion with previous diff. > > I have had "duff" as a local alias for a unidiff (diff -u) for a long time, so > > I think I want to name this project "duff", and make unidiff output the > > default (in case the so far simplistic cmdline tool becomes install-worthy...) > > I'd love to see your project become a drop-in replacement for OpenBSD's > diff(1). Maybe then we should discuss it in a more general mailing list? Would that be tech@ ? > So I don't see any problem with naming it "diff" :o) Being > compliant with POSIX and with the commonly use arguments is the tricky > part. Should be able to take the current diff(1) and swap out only the diff algo, right? As long as there aren't too many mad spaghetti tentacles. A side question, slightly OT, and just out of curiosity: if I were to integrate this diff code in a GPL project, as the Copyright holder I can just relicense it, right? If I did that, would I then destroy the ability to use the code in OpenBSD? Or would they become two separate projects based on the same code once? Would they then follow their own contribution paths, and not be able to benefit from each others improvements somehow? It's not something I want to do really, just wondering how the different license worlds interact. ~N