From: Omar Polo Subject: bump deltify table increase step To: gameoftrees@openbsd.org Date: Fri, 28 Jul 2023 10:40:48 +0200 we're pretty slow at deltifying things. it's OK, git itself is bad with big files, but we can be too slow with moderately-sized files. I noticed this when a couple of days ago I hit the gotd timeout trying to push some changes on a private repository because got was taking too long to deltify ~100mb of data splitted mostly in 3-4 files. (The good news is that gotd timeout works!) Most of this slowness comes from resizing the table. Performance degrades when the table is mostly filled, and by incrementing its size by 64 slot every time we end up reallocating it quite a lot for big files. There's also the memory usage to keep low though, as this code is also used in gotd, we don't want the table to become 'too big' by e.g. doubling its size at every resize. Note however that there's nothing preventing the table to grow arbitrarly large, it just takes forever and that we don't have cancellation in place in this code path. I've set up a repository to do some tests: it's a standalone binary that wraps the deltify code: . The main branch is the exact same code we have in got, the branch `256' bumps the resize step from 64 to 256. Just as-is the improvements are noticeable; on my hw deltifying 20M of /dev/random takes around 8 seconds, when before it was at 75 seconds. Nothing stops us from using an even bigger step size (with 512 deltifying 20M takes ~2.5 seconds), but since we're already pretty conservative in the amount of memory used and I'm one of the first to have hit this I decided to stay on the low side. The memory usage doesn't grow considerably: the entries of the table are 24 bytes long on amd64 (two off_t and one uint32_t, plus padding), so in the worst case we end up having allocate 24 * 256 = ~6k of memory. Future work on this front includes adding cancellation support. thoughts? ----------------------------------------------- commit c96c4f62e57c75a03c3594fd3a70ab41c1200e76 (256, origin/256) from: Omar Polo date: Fri Jul 28 08:11:58 2023 UTC bump the increment size to 256 diff 3931424de03023f577cc02f637c60e65ca80ffd3 c96c4f62e57c75a03c3594fd3a70ab41c1200e76 commit - 3931424de03023f577cc02f637c60e65ca80ffd3 commit + c96c4f62e57c75a03c3594fd3a70ab41c1200e76 blob - 904fe2806d161dfaee79db5f0f3c73e691d3f4dc blob + 99b5d418b8cec83d130d52be4e5022d0310f8c31 --- lib/deltify.c +++ lib/deltify.c @@ -140,18 +140,18 @@ addblk(struct got_delta_table *dt, FILE *f, off_t file dt->blocks[i].offset = offset; dt->blocks[i].hash = h; dt->nblocks++; - if (dt->nalloc < dt->nblocks + 64) { + if (dt->nalloc < dt->nblocks + 256) { struct got_delta_block *db; size_t old_size = dt->nalloc; db = dt->blocks; - dt->blocks = calloc(dt->nalloc + 64, + dt->blocks = calloc(dt->nalloc + 256, sizeof(struct got_delta_block)); if (dt->blocks == NULL) { err = got_error_from_errno("calloc"); dt->blocks = db; return err; } - dt->nalloc += 64; + dt->nalloc += 256; /* * Recompute all block positions. Hash-based indices of blocks * in the array depend on the allocated length of the array. @@ -204,18 +204,18 @@ addblk_mem(struct got_delta_table *dt, uint8_t *data, dt->blocks[i].offset = offset; dt->blocks[i].hash = h; dt->nblocks++; - if (dt->nalloc < dt->nblocks + 64) { + if (dt->nalloc < dt->nblocks + 256) { struct got_delta_block *db; size_t old_size = dt->nalloc; db = dt->blocks; - dt->blocks = calloc(dt->nalloc + 64, + dt->blocks = calloc(dt->nalloc + 256, sizeof(struct got_delta_block)); if (dt->blocks == NULL) { err = got_error_from_errno("calloc"); dt->blocks = db; return err; } - dt->nalloc += 64; + dt->nalloc += 256; /* * Recompute all block positions. Hash-based indices of blocks * in the array depend on the allocated length of the array.