diff algo implementation ("duff")

Hi got, at u2k20, by stsp's request I have started to write a Myers + Patience diff algorithm. In the meantime stsp has provided a git repository to put it in: https://git.gameoftrees.org/gitweb/?p=diff.git;a=tree It appears that it can't be cloned with https yet. stsp? I've reached a very interesting status with the diff code: - various diff algorithms fully work AFAICT: - plain Myers, the one that needs quadratic space but only one traversal; - Myers divide and conquer; - Patience diff; - "None" diff, it simply barfs out the least efficient difference. - there is a nice scheme to plug these algorithms into each other. By pointing diff_algo_config structs at each other, you can do various sequences and fallbacks. The default behavior now is: 1. Try plain Myers, but if it needs more than 4Mb of RAM, fallback to 2. Patience diff. When it is done splitting, 3. again do Patience. If Patience can no longer split and needs help, do 4. Myers divide-and-conquer. When that has split off a smaller chunk, go back to 1. 5. Should this end up recursing ridiculously deep (default 1024), then fall back to dumping an inefficient diff of that chunk. Even as I'm writing this down, I am thinking of better combinations; the point is that this is easily configurable. - algorithms don't recurse, but instead one algorithm does a full pass, records solved and unsolved chunks -- then frees all its transitory memory -- and then the next deeper algorithm continues to do its pass: less peak mem. This is a pretty much ideal playground for experimenting with different algorithms, writing benchmarks and trying out optimizations. The first step would probably be some automated profiling with large amounts of test data, so that we get feedback on whether things improve or are getting worse. Any simple ideas? Then, here are some avenues to explore: - After each division into a smaller chunk, trivially swallow all adjacent lines that are equal in both files. 'git' does this, I don't yet. - when patience diff recurses on itself, it could re-use previous line matching state to speed up (for each non-unique line, remember where the next one was, so if that is outside the smaller chunk we know whether it is now unique without more iterations). - the patience specific state is still in members of struct diff_atom; if it were stored separately, other diff algos would need less memory. - there is a myers divide-and-conquer case where the two traces walk past each other: set DEBUG to 1 in lib/debug.h, recompile, and cd test; ../diff/diff test008* You will notice that the traces found no midpoint and it falls back to "none" -- because of a match both at the start and end of the files. Find a way to not walk almost the entire file just to fail. I hope that the coding style matches got, and that the structuring and massive amount of code comments allow diff hacking fun to come up also for others than me? Another thing, so far it is called just "diff", which is asking for huge name conflicts and confusion with previous diff. I have had "duff" as a local alias for a unidiff (diff -u) for a long time, so I think I want to name this project "duff", and make unidiff output the default (in case the so far simplistic cmdline tool becomes install-worthy...) Let me know if you have feedback of any kind: coding style sucks for OpenBSD? All those algorithms are crap? Boundless appreciation? :) ~N

2020-01-22 03:32 neeels:
diff algo implementation ("duff")
- 2020-01-22 11:56 Martin Pieuchot:
  diff algo implementation ("duff")
- - 2020-01-22 16:37 neeels:
    diff algo implementation ("duff")