"GOT", but the "O" is a cute, smiling pufferfish. Index | Thread | Search

From:
"Omar Polo" <op@omarpolo.com>
Subject:
Re: reuse delta-base objects during pack file creation
To:
Stefan Sperling <stsp@stsp.name>
Cc:
gameoftrees@openbsd.org
Date:
Wed, 25 Feb 2026 07:02:19 +0100

Download raw body.

Thread
Stefan Sperling <stsp@stsp.name> wrote:
> When cloning large repositories from gotd (e.g. ports.git) there is a
> noticeable tail end during the deltification phase where progress becomes
> slower. Time spent deltifying the last, say, 10% of objects may even
> exceed the time spent on the first 90% of objects.
> 
> The reason is that we would be reusing deltas before we hit 90% and then
> run deltification on the remaining objects. In this deltification phase we
> do not copy delta-bases as they are, but we deltify them.
> This takes time, and we gain very little from doing this work. Someone else
> has already spent effort finding optimal delta bases while the pack file
> which stores the delta-base object was created. And by deltifying delta-bases
> we make delta chains longer, which can make unpacking slower.
> 
> Git reuses delta-base objects directly, too, so this is a proven approach.
> 
> We need two changes for this optimization, both combined in the diff below.
> 
> 1) Add flags to the raw object data structure which tell us whether a raw
> object was found in a pack file and, if so, whether it is stored as a delta
> or as a delta-base (i.e. a verbatim copy of the object).
> 
> 2) Skip the deltification loop for packed delta-bases.
> We must still initialize m->dtab for these objects since they might be
> used as a base during deltification of other objects. But we can skip the
> expensive hunt for delta-base candidates. The code which writes out the
> pack file will see that no delta has been calculated for these objects and
> simply copy them to the generated pack file as-is.
> 
> Loose objects and non-reused deltified objects are still (re-)deltified
> as before. To speed things up further in the future we could look into
> reusing deltas from multiple pack files. Though I am not sure if the
> additional complexity would be worth it. Well maintained repositories
> have one large pack file and a handful of small ones.
> 
> While copying objects between packs we have decompression/recompression
> overhead which i believe Git manages to avoid. This might be another
> path to improving performance in the future.
> 
> ok?

I've almost missed this.

obviously ok op@

> M  lib/got_lib_object.h                       |   4+   0-
> M  lib/got_lib_privsep.h                      |   3+   2-
> M  lib/object_open_io.c                       |   8+   3-
> M  lib/object_open_privsep.c                  |  22+  18-
> M  lib/pack_create.c                          |  19+   0-
> M  lib/privsep.c                              |   4+   2-
> M  libexec/got-read-object/got-read-object.c  |   1+   1-
> M  libexec/got-read-pack/got-read-pack.c      |   4+   1-
> 
> 8 files changed, 65 insertions(+), 27 deletions(-)
> 
> commit - 9456c7974d487ec39d90e4fd16887cf464d3841e
> commit + 14e4379835382ffd9432c3ee46ddb4f6e3c87a9d