"GOT", but the "O" is a cute, smiling pufferfish. Index | Thread | Search

From:
neeels <got@kleinekatze.de>
Subject:
diff algo implementation ("duff")
To:
gameoftrees@openbsd.org
Date:
Wed, 22 Jan 2020 04:32:24 +0100

Download raw body.

Thread
Hi got,

at u2k20, by stsp's request I have started to write a Myers + Patience diff
algorithm.

In the meantime stsp has provided a git repository to put it in:
https://git.gameoftrees.org/gitweb/?p=diff.git;a=tree
It appears that it can't be cloned with https yet. stsp?

I've reached a very interesting status with the diff code:

- various diff algorithms fully work AFAICT:
  - plain Myers, the one that needs quadratic space but only one traversal;
  - Myers divide and conquer;
  - Patience diff;
  - "None" diff, it simply barfs out the least efficient difference.

- there is a nice scheme to plug these algorithms into each other.
  By pointing diff_algo_config structs at each other, you can do various
  sequences and fallbacks. The default behavior now is:
  1. Try plain Myers, but if it needs more than 4Mb of RAM, fallback to
  2. Patience diff. When it is done splitting,
  3. again do Patience. If Patience can no longer split and needs help, do
  4. Myers divide-and-conquer. When that has split off a smaller chunk,
     go back to 1.
  5. Should this end up recursing ridiculously deep (default 1024),
     then fall back to dumping an inefficient diff of that chunk.
  Even as I'm writing this down, I am thinking of better combinations; the
  point is that this is easily configurable.

- algorithms don't recurse, but instead one algorithm does a full pass, records
  solved and unsolved chunks -- then frees all its transitory memory -- and
  then the next deeper algorithm continues to do its pass: less peak mem.

This is a pretty much ideal playground for experimenting with different
algorithms, writing benchmarks and trying out optimizations.

The first step would probably be some automated profiling with large amounts of
test data, so that we get feedback on whether things improve or are getting
worse. Any simple ideas?

Then, here are some avenues to explore:

- After each division into a smaller chunk, trivially swallow all adjacent
  lines that are equal in both files. 'git' does this, I don't yet.

- when patience diff recurses on itself, it could re-use previous line matching
  state to speed up (for each non-unique line, remember where the next one was,
  so if that is outside the smaller chunk we know whether it is now unique
  without more iterations).

- the patience specific state is still in members of struct diff_atom; if it
  were stored separately, other diff algos would need less memory.

- there is a myers divide-and-conquer case where the two traces walk past each
  other: set DEBUG to 1 in lib/debug.h, recompile, and
  cd test; ../diff/diff test008*
  You will notice that the traces found no midpoint and it falls back to
  "none" -- because of a match both at the start and end of the files.
  Find a way to not walk almost the entire file just to fail.

I hope that the coding style matches got, and that the structuring and massive
amount of code comments allow diff hacking fun to come up also for others than
me?

Another thing, so far it is called just "diff", which is asking for huge name
conflicts and confusion with previous diff.
I have had "duff" as a local alias for a unidiff (diff -u) for a long time, so
I think I want to name this project "duff", and make unidiff output the
default (in case the so far simplistic cmdline tool becomes install-worthy...)

Let me know if you have feedback of any kind: coding style sucks for OpenBSD?
All those algorithms are crap? Boundless appreciation? :)

~N