"GOT", but the "O" is a cute, smiling pufferfish. Index | Thread | Search

From:
ori@eigenstate.org
Subject:
Re: reuse deltas while packing
To:
ori@eigenstate.org, stsp@stsp.name
Cc:
gameoftrees@openbsd.org, naddy@mips.inka.de
Date:
Wed, 09 Feb 2022 16:48:32 -0500

Download raw body.

Thread
Quoth Stefan Sperling <stsp@stsp.name>:
> On Tue, Feb 08, 2022 at 05:38:22PM -0500, ori@eigenstate.org wrote:
> > Quoth Stefan Sperling <stsp@stsp.name>:
> > > 
> > > However, the deltification algorithms implemented by Git and Got are not
> > > the same. It is possible that a significant difference will always remain
> > > unless we rewrite code inherited from git9 and use a different approach.
> > > 
> > 
> > Are there any objects that it performs particularly
> > poorly on? I remember measuring, and it wasn't worse
> > by a huge margin (about 10% in my testing).
> > 
> > I'd be happy to look and improve the algorithm.
> > 
> 
> There is one change we made relative to git9 that could be relevant.
> We only try 3 objects back as delta bases whereas the original code tried
> the 10 objects back. This was done to speed up packing without delta-reuse,
> and it did grow our pack files a bit. Relevant discussion with some people
> collecting data points was on IRC and is probably lost by now.
> https://git.gameoftrees.org/gitweb/?p=got.git;a=commit;h=4f4d853e5a672ea469a2532774867305712b418e
> 
> I could do a full pack run on the openbsd src repo and log the time it
> takes to deltify each object. That should give us a list of potential
> edge cases. Would that help?
> 
> I would not be surprised if some edge cases could be triggered with
> files beneath sys/dev/pci/drm/amd/include/asic_reg/ because these files
> are very slow to unpack during 'got checkout' and have already triggered
> various bugs in our handling of deltas while reading packs.
> 

Poked around this a bit:

	$ git clone https://github.com/freebsd/freebsd-src.git
	Cloning into 'freebsd-src'...
	remote: Enumerating objects: 4626509, done.
	remote: Counting objects: 100% (10/10), done.
	remote: Compressing objects: 100% (5/5), done.
	remote: Total 4626509 (delta 5), reused 10 (delta 5), pack-reused 4626499
	
	$ du -sh .git/objects/pack/
	2.2G    .git/objects/pack/
	
	$ git repack -dFA
	Enumerating objects: 4626509, done.
	Counting objects: 100% (4626509/4626509), done.
	Compressing objects: 100% (4556343/4556343), done.
	Writing objects: 100% (4626509/4626509), done.
	Total 4626509 (delta 3206504), reused 0 (delta 0), pack-reused 0
	
	$ du -sh .git/objects/pack/
	1.6G    .git/objects/pack/

Meanwhile with git9, but starting with a repo cloned with
torvalds git (since we pick a different set of commits):

	% dircp /mnt/term/tmp/freebsd-src freebsd-src
	% du -sh .git/objects/pack
	2.32957G	.git/objects/pack

	% git/repack
	% du -sh .git/objects/pack
	2.43166G	.git/objects/pack
                                                                                                                                       
So, for this repo, Torvalds git is better than us by a larger
margin than expected, but a smaller margin than observed with
Got. I'll look into improving the delta search.

Also: github is doing something different from git, and it
seems to be pretty close to to what I'm doing.