"GOT", but the "O" is a cute, smiling pufferfish. Index | Thread | Search

From:
Stefan Sperling <stsp@stsp.name>
Subject:
Re: teach gotadmin cleanup to remove redundant packfiles
To:
Omar Polo <op@omarpolo.com>
Cc:
gameoftrees@openbsd.org
Date:
Sat, 17 Jun 2023 10:16:35 +0200

Download raw body.

Thread
On Fri, Jun 16, 2023 at 11:25:27PM +0200, Omar Polo wrote:
> A packfile becomes "redundant" when another packfile
> contains all its objects; this usually happens as more objects are
> added to a repository and the user issues a `gotadmin pack'

This assumption is only true for 'gotadmin pack -a'. Without -a, only
loose objects which do not yet appear in any pack file will be packed.
 
> The idea is simple, as per stsp' explanation on irc a while ago: find
> the biggest packfile by number of objects, throw everything in a
> idset, then iterate the other pack files and see if they're redundant.

> +If redundant copies of packed objects exist in loose form, such
> +redundant copies will be purged.
> +If all the objects of a pack file are present in another pack file,
> +such redundant pack file will be purged.

With my suggested tweak to the algorithm I describe below, I would
reword this to cover the more general case:

  If all the objects of a pack file are present in other pack files,
  the redundant pack file will be purged.

> +.Pp
> +For compatibility with Git, if a matching file
> +.Pa .keep
> +exists for a given pack file, such pack file won't be removed.

Grammar: s/such pack/this pack/

> +.Pp
>  Objects will usually become unreferenced as a result of deleting
>  branches or tags with
>  .Cm got branch -d
> @@ -261,6 +268,10 @@ In order to determine the set of objects which are ref
>  .Cm got ref -d
>  may also leave unreferenced objects behind.
>  .Pp
> +Pack files will usually become redundant as a result of creating
> +bigger packs with
> +.Cm gotadmin pack .

As noted above, this example needs -a to be effective.

> +	TAILQ_FOREACH(pe, &repo->packidx_paths, entry) {

By default the above list is in readdir(3) order, with entries for the
most recently accessed packs moved to the front of the list.
I think we should instead be using a custom list of pack index files here,
sorted by the amount of object IDs they contain in descending order.
Rationale below.

> +		if (cancel_cb) {
> +			err = (*cancel_cb)(cancel_arg);
> +			if (err)
> +				break;
> +		}
> +
> +		packidx_path = pe->path;
> +		err = got_repo_get_packidx(&packidx, packidx_path,
> +		    repo);
> +		if (err)
> +			goto done;
> +
> +		if (packidx == best) {
> +			remove = 0;
> +		} else {
> +			nobjects = be32toh(packidx->hdr.fanout_table[0xff]);
> +			for (i = 0; i < nobjects; ++i) {
> +				pid = &packidx->hdr.sorted_ids[i];
> +
> +				memset(&id, 0, sizeof(id));
> +				memcpy(&id.sha1, pid->sha1, sizeof(id.sha1));
> +
> +				if (!got_object_idset_contains(idset, &id))
> +					break;
> +			}
> +
> +			remove = (i == nobjects);

If we sorted the list of pack index files by the amount of objects in
descending order then we could keep adding more IDs to 'idset' here,
provided 'remove' is false. This way we would detect objects in any
smaller packs which are made redundant by any bigger packs, instead
of only catching objects which are made redundant by the one largest
pack file.

We should then be able to eliminate some redundancy without having to
re-pack everything, but only re-packing a particular branch such as 'main'.
This might be useful for people who have fetched some branches but generally
do not fetch very often. They would usually end up with a few large pack
files instead of many small ones.

To simulate this you could create two or 3 branches with unique commits and
pack each such branch individually, using gotadmin pack -x to cut off the
common ancestors found in the main branch, and create some redundant copies
of each such pack. This way you could extend the regression test to cover a
case where two or three packs comprise the full set and the rest are redundant.