On Tue, Jul 11, 2023 at 03:35:06PM +0200, Omar Polo wrote:
> The gotadmin cleanup algorithm for pack files is inadequate because it
> only works in terms of unique objects contained in pack files, not
> about reachability.
> Let's say you have a pack file which holds objects for a few temporary
> branches that are long gone. Those objects are unreachable, but since
> the objects in that pack file are not present in any other pack (are
> 'unique') then gotadmin cleanup won't delete that pack.
> Here's an attempt at reworking this by taking into consideration if
> the objects are also reachable or not. I've refactored a bit how we
> cleanup since now I need to walk the objects unconditionally and
> there's some churn due to swapping the order of the arguments in the
> reporting function.
> cleanup.sh contains also some changes to address previous mistakes.
> cmp(1) returns nonzero when the files differ and since the output was
> changed we can't grep it like that anymore, so use -q and check the
> number of pack files later with gotadmin info.
> one more thing that should be done but will be addressed as follow-up
> is to avoid removing pack files that are 'too new', much like we do
> with loose object, and for the same reason.
Ok. And I agree that we should also be looking at timestamps of
pack files before deleting them.
This diff is indeed a bit hard to review because of the refactoring
changes you mentioned. It would have been easier to review if the
swapping of arguments and such had been left out for later.
If I understood correctly then the core change which changes pack
deletion heuristic is this part?
> @@ -1352,6 +1276,9 @@ pack_is_redundant(int *redundant, struct got_repositor
> if (got_object_idset_contains(idset, &id))
> + if (!got_object_idset_contains(traversed_ids, &id))
> + continue;
> *redundant = 0;
> err = got_object_idset_add(idset, &id, NULL);
> if (err)