"GOT", but the "O" is a cute, smiling pufferfish. Index | Thread | Search

From:
Omar Polo <op@omarpolo.com>
Subject:
Re: teach gotadmin cleanup to remove redundant packfiles
To:
Stefan Sperling <stsp@stsp.name>
Cc:
gameoftrees@openbsd.org
Date:
Sun, 18 Jun 2023 16:00:22 +0200

Download raw body.

Thread
On 2023/06/18 15:33:44 +0200, Stefan Sperling <stsp@stsp.name> wrote:
> > +	/*
> > +	 * For compatibility with Git, if a matching .keep file exist
> > +	 * don't delete the packfile.
> > +	 */
> > +	dot = strrchr(path, '.');
> > +	*dot = '\0';
> > +	if (strlcat(path, ".keep", sizeof(path)) >= sizeof(path))
> > +		return got_error(GOT_ERR_NO_SPACE);
> 
> Would be nice to use got_error_fmt here to print the overlong path.
> Because it comes from the filesystem the value is beyond our
> control and it would suck if someone hit this and couldn't easily
> figure out which NO_SPACE error case is being hit.

I'm not sure the return line could ever be hitted.

The path is of the form object/packs/pack-$hash.idx, then we change
the .idx with .keep and the other .something.  Hash is guaranteed to
be a sha1 digest so we can't pratically overflow the buffer.

not that I'm against a more explicit error message, just sharing the
reason for my choice.

> > +	if (faccessat(got_repo_get_fd(repo), path, F_OK, 0) == 0)
> > +		*remove = 0;
> > +
> > +	for (i = 0; i < nitems(ext); ++i) {
> > +		*dot = '\0';
> > +
> > +		if (strlcat(path, ext[i], sizeof(path)) >=
> > +		    sizeof(path))
> > +			return got_error(GOT_ERR_NO_SPACE);
> > +
> > +		if (fstatat(got_repo_get_fd(repo), path, &sb, 0) ==
> > +		    -1) {
> > +			if (errno == ENOENT &&
> > +			    strcmp(ext[i], ".pack") != 0 &&
> > +			    strcmp(ext[i], ".idx") != 0)
> > +				continue;
> > +			return got_error_from_errno2("fstatat", path);
> > +		}
> > +
> > +		*size_before += sb.st_size;
> > +		if (!*remove) {
> > +			*size_after += sb.st_size;
> > +			continue;
> > +		}
> > +
> > +		if (dry_run)
> > +			continue;
> > +
> > +		if (unlinkat(got_repo_get_fd(repo), path, 0) == -1) {
> 
> Should we lock the file before deleting it, like remove_packidx() which
> is used by gotadmin cleanup -p?
> In fact, should we call remove_packidx() here of unlinking .idx directly?
> 
> I see that git-repack doesn't seem to bother with such locking but it is
> confusing to have an "unlink_pack" function which doesn't take a lock
> and a "remove_packidx" function which takes a lock.

another good point.  I was only thinking of the case of cleanup in
parallel, and so assumed that the (soon to be added) locking around
the operation would have been fine.

I'll send a follow-up to change this.

> > +			if (errno == ENOENT &&
> > +			    strcmp(ext[i], ".pack") != 0 &&
> > +			    strcmp(ext[i], ".idx") != 0)
> 
> What is the reason for ignoring ENOENT for .bitmap etc. but not for
> .idx and .pack extensions? Can't we always treat ENOENT as "this
> file has already been removed somehow, which is fine by me"?

I just wanted to be paranoic.  i'll drop the explicit .pack and .idx
check (twice.)

> > +				continue;
> > +			return got_error_from_errno2("unlinkat",
> > +			    path);
> > +		}
> > +	}
> 
> > +		err = idset_add_from_packidx(&remove, repo, sorted[i].path,
> > +		    idset);
> 
> It's just a matter of taste but I would call the above function something
> that reads more like an actual english phrase.
> Maybe "scan_packidx" or "pack_is_redundant"?
> 
> > +		if (err)
> > +			goto done;
> > +		err = account_unlink_pack(repo, sorted[i].path, dry_run,
> > +		    &remove, size_before, size_after);
> 
> The name "account_unlink_pack" also seems a bit unclear to me.
> 
> I guess it is supposed to convey that disk space usage accounting is
> happening in here? I would assume such meta-data accounting can be
> implied even if it is not mentioned in the name of the function,
> especially since we are passing in parameters that provide a hint
> at size accounting.
> 
> At this point we have already determined that the pack is redundant.
> We would like to remove it unless in the special .keep case which can
> be handled internally. Could we call this function "purge_redundant_pack"?

i didn't like the names either but haven't come up with anything
better, i'll go with your suggestions.