reuse delta-base objects during pack file creation

Stefan Sperling <stsp@stsp.name> wrote: > When cloning large repositories from gotd (e.g. ports.git) there is a > noticeable tail end during the deltification phase where progress becomes > slower. Time spent deltifying the last, say, 10% of objects may even > exceed the time spent on the first 90% of objects. > > The reason is that we would be reusing deltas before we hit 90% and then > run deltification on the remaining objects. In this deltification phase we > do not copy delta-bases as they are, but we deltify them. > This takes time, and we gain very little from doing this work. Someone else > has already spent effort finding optimal delta bases while the pack file > which stores the delta-base object was created. And by deltifying delta-bases > we make delta chains longer, which can make unpacking slower. > > Git reuses delta-base objects directly, too, so this is a proven approach. > > We need two changes for this optimization, both combined in the diff below. > > 1) Add flags to the raw object data structure which tell us whether a raw > object was found in a pack file and, if so, whether it is stored as a delta > or as a delta-base (i.e. a verbatim copy of the object). > > 2) Skip the deltification loop for packed delta-bases. > We must still initialize m->dtab for these objects since they might be > used as a base during deltification of other objects. But we can skip the > expensive hunt for delta-base candidates. The code which writes out the > pack file will see that no delta has been calculated for these objects and > simply copy them to the generated pack file as-is. > > Loose objects and non-reused deltified objects are still (re-)deltified > as before. To speed things up further in the future we could look into > reusing deltas from multiple pack files. Though I am not sure if the > additional complexity would be worth it. Well maintained repositories > have one large pack file and a handful of small ones. > > While copying objects between packs we have decompression/recompression > overhead which i believe Git manages to avoid. This might be another > path to improving performance in the future. > > ok? I've almost missed this. obviously ok op@ > M lib/got_lib_object.h | 4+ 0- > M lib/got_lib_privsep.h | 3+ 2- > M lib/object_open_io.c | 8+ 3- > M lib/object_open_privsep.c | 22+ 18- > M lib/pack_create.c | 19+ 0- > M lib/privsep.c | 4+ 2- > M libexec/got-read-object/got-read-object.c | 1+ 1- > M libexec/got-read-pack/got-read-pack.c | 4+ 1- > > 8 files changed, 65 insertions(+), 27 deletions(-) > > commit - 9456c7974d487ec39d90e4fd16887cf464d3841e > commit + 14e4379835382ffd9432c3ee46ddb4f6e3c87a9d

2026-02-20 21:49 Stefan Sperling:
reuse delta-base objects during pack file creation
- 2026-02-25 06:02 Omar Polo:
  reuse delta-base objects during pack file creation