"GOT", but the "O" is a cute, smiling pufferfish. Index | Thread | Search

From:
Stefan Sperling <stsp@stsp.name>
Subject:
Re: teach gotadmin cleanup to remove redundant packfiles
To:
Omar Polo <op@omarpolo.com>
Cc:
gameoftrees@openbsd.org
Date:
Sun, 18 Jun 2023 15:33:44 +0200

Download raw body.

Thread
Generally this looks great and I am fine with you committing this.
I have some suggestions below which you could follow up on if you
have time. Otherwise I will get to making some tweaks eventually.

On Sun, Jun 18, 2023 at 12:17:47PM +0200, Omar Polo wrote:
> @@ -261,6 +268,11 @@ In order to determine the set of objects which are ref
>  .Cm got ref -d
>  may also leave unreferenced objects behind.
>  .Pp
> +Pack files will usually become redundant as a result of creating

As already noted on IRC, "creating" in the above line
should be dropped.

> +repacking the repository with
> +.Nm
> +.Cm pack Fl a .
> +.Pp

> @@ -1228,6 +1229,200 @@ remove_packidx(int dir_fd, const char *relpath)
>  }
>  
>  static const struct got_error *
> +account_unlink_pack(struct got_repository *repo, const char *packidx_path,
> +    int dry_run, int *remove, off_t *size_before, off_t *size_after)
> +{
> +	static const char *ext[] = {".idx", ".pack", ".rev", ".bitmap",
> +	    ".promisor", ".mtimes"};
> +	struct stat sb;
> +	char *dot, path[PATH_MAX];
> +	size_t i;
> +
> +	if (strlcpy(path, packidx_path, sizeof(path)) >=
> +	    sizeof(path))
> +		return got_error(GOT_ERR_NO_SPACE);
> +
> +	/*
> +	 * For compatibility with Git, if a matching .keep file exist
> +	 * don't delete the packfile.
> +	 */
> +	dot = strrchr(path, '.');
> +	*dot = '\0';
> +	if (strlcat(path, ".keep", sizeof(path)) >= sizeof(path))
> +		return got_error(GOT_ERR_NO_SPACE);

Would be nice to use got_error_fmt here to print the overlong path.
Because it comes from the filesystem the value is beyond our
control and it would suck if someone hit this and couldn't easily
figure out which NO_SPACE error case is being hit.

> +	if (faccessat(got_repo_get_fd(repo), path, F_OK, 0) == 0)
> +		*remove = 0;
> +
> +	for (i = 0; i < nitems(ext); ++i) {
> +		*dot = '\0';
> +
> +		if (strlcat(path, ext[i], sizeof(path)) >=
> +		    sizeof(path))
> +			return got_error(GOT_ERR_NO_SPACE);
> +
> +		if (fstatat(got_repo_get_fd(repo), path, &sb, 0) ==
> +		    -1) {
> +			if (errno == ENOENT &&
> +			    strcmp(ext[i], ".pack") != 0 &&
> +			    strcmp(ext[i], ".idx") != 0)
> +				continue;
> +			return got_error_from_errno2("fstatat", path);
> +		}
> +
> +		*size_before += sb.st_size;
> +		if (!*remove) {
> +			*size_after += sb.st_size;
> +			continue;
> +		}
> +
> +		if (dry_run)
> +			continue;
> +
> +		if (unlinkat(got_repo_get_fd(repo), path, 0) == -1) {

Should we lock the file before deleting it, like remove_packidx() which
is used by gotadmin cleanup -p?
In fact, should we call remove_packidx() here of unlinking .idx directly?

I see that git-repack doesn't seem to bother with such locking but it is
confusing to have an "unlink_pack" function which doesn't take a lock
and a "remove_packidx" function which takes a lock.

> +			if (errno == ENOENT &&
> +			    strcmp(ext[i], ".pack") != 0 &&
> +			    strcmp(ext[i], ".idx") != 0)

What is the reason for ignoring ENOENT for .bitmap etc. but not for
.idx and .pack extensions? Can't we always treat ENOENT as "this
file has already been removed somehow, which is fine by me"?

> +				continue;
> +			return got_error_from_errno2("unlinkat",
> +			    path);
> +		}
> +	}

> +		err = idset_add_from_packidx(&remove, repo, sorted[i].path,
> +		    idset);

It's just a matter of taste but I would call the above function something
that reads more like an actual english phrase.
Maybe "scan_packidx" or "pack_is_redundant"?

> +		if (err)
> +			goto done;
> +		err = account_unlink_pack(repo, sorted[i].path, dry_run,
> +		    &remove, size_before, size_after);

The name "account_unlink_pack" also seems a bit unclear to me.

I guess it is supposed to convey that disk space usage accounting is
happening in here? I would assume such meta-data accounting can be
implied even if it is not mentioned in the name of the function,
especially since we are passing in parameters that provide a hint
at size accounting.

At this point we have already determined that the pack is redundant.
We would like to remove it unless in the special .keep case which can
be handled internally. Could we call this function "purge_redundant_pack"?