xbps icon indicating copy to clipboard operation
xbps copied to clipboard

Reimplemented xlocate as xbps-locate

Open friedelschoen opened this issue 1 year ago • 9 comments
trafficstars

I've implemented a new xbps-tool xbps-locate! xbps-rindex collects data into index.plist inside *-repodata but also files into files.plist. xbps-locate will fetch the files.plist from the repo-pool and search for the desired file. I cannot test

Also added to TODO, cleanage of xbps-rindex doesn't clean files.plist yet.

I've also added into repo_open_* in lib/repo.c that the archive-iterator just assumes that the files are in order (they are still written in order for compatiblity) but is checking the actual filename.

On my computer, I've to manually disable _BSD_SOURCE and _SVID_SOURCE, so there is a commit, I don't know if it's only on my computer. (Void Linux x86_64-musl)

Thanks for looking into my code!

friedelschoen avatar Jan 11 '24 23:01 friedelschoen

how does this affect repodata size?

why not make it part of xbps-query?

this also means x(bps-)locate loses the power of pcre and delta-updating the index

classabbyamp avatar Jan 11 '24 23:01 classabbyamp

_BSD_SOURCE was fixed in https://github.com/void-linux/xbps/commit/48c9879d33357254f5f405b3d5463708c9d074f9, rebase

classabbyamp avatar Jan 12 '24 00:01 classabbyamp

Making some calculations: about 60bytes filepath and some overhead, let's talk about 100bytes per file. void-packages got about 13000 packages with each about 50 files (I guess, just assuming right now).

100x50x13'000 ≈ 60mb uncompressed.

Maybe using an extra file like x86_64-files which would be downloaded independently is an option if the overhead is too much.

You're right about loosing the power of PCRE then, maybe a third-party library?

friedelschoen avatar Jan 12 '24 00:01 friedelschoen

Making some calculations: about 60bytes filepath and some overhead, let's talk about 100bytes per file. void-packages got about 13000 packages with each about 50 files (I guess, just assuming right now).

100x50x13'000 ≈ 60mb uncompressed.

$ git -C .cache/xlocate.git/ grep '.' @ | wc -l
3528851
$ git -C .cache/xlocate.git/ grep '.' @ | cut -d: -f3- | wc -c
235757398

so at the very minimum 235 MB assuming single-byte ASCII characters only and no plist overhead

classabbyamp avatar Jan 12 '24 00:01 classabbyamp

Oke! I wasn't aware of that much overhead to include it directly into *-repodata so I'll look into implementing something like x86_64-files and leaving the original *-repodata alone.

friedelschoen avatar Jan 12 '24 00:01 friedelschoen

It's worth noting that the existing xlocate index is large enough to already be in git, where it still takes ages to download if you don't already have a clone and can't take advantage of the delta updating that git provides. 235MB is well within territory of a download that can take ages on a mediocre Internet connection; 5 years ago this would have been a 30~40min project on the Internet connection I had. Even with my current, much better, internet connection that's still an unreasonable amount of delay when simply trying to sync the repodata.

0x5c avatar Jan 12 '24 00:01 0x5c

After some research, making a plist with all the files in xlocate.git:

% cat ../make-plist.sh
echo "<plist>"
echo "\t<dict>"

for pkg in *; do
	echo "\t\t<key>$pkg</key>"
	echo "\t\t<array>"
	for file in $(awk '{print $1}' $pkg); do
		echo "\t\t\t<string>$file</string>"
	done
	echo "\t\t</array>"
done
echo "\t</dict>"
echo "</plist>"
% sh ../make-plist.sh | zstd -f9o ../files.zstd
/*stdin*\            :  5.21%   (   238 MiB =>   12.4 MiB, ../files.zstd)     
% find * -print -exec cat {} \; | zstd -f9o ../files.zstd
/*stdin*\            :  6.58%   (   197 MiB =>   13.0 MiB, ../files.zstd)  

13MiB still is a lot to just include into *-repodata so I would put it into a seperate *-files file, which is fetched individually.

Taking gcc-fortran which is about 13MB takes 5.3s, cloning the xlocate.git takes about 11s. Then updating the git is for sure faster, but how often is that needed if files-lists don't really change with every version.

I cannot tell how accurate this comparision is and how linear is behaves on slower networks. Please correct me if I'm wrong.

% time wget -O /dev/null https://repo-default.voidlinux.org/current/musl/gcc-fortran-12.2.0_4.x86_64-musl.xbps
--2024-01-12 16:20:32--  https://repo-default.voidlinux.org/current/musl/gcc-fortran-12.2.0_4.x86_64-musl.xbps
Resolving repo-default.voidlinux.org (repo-default.voidlinux.org)... 2a01:4f9:4b:42dc::d01, 65.21.160.177
Connecting to repo-default.voidlinux.org (repo-default.voidlinux.org)|2a01:4f9:4b:42dc::d01|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13833815 (13M) [application/octet-stream]
Saving to: ‘/dev/null’

100%[==============================================================================>]  13.19M  2.52MB/s    in 5.0s    

2024-01-12 16:20:37 (2.64 MB/s) - ‘/dev/null’ saved [13833815/13833815]


________________________________________________________
Executed in    5.30 secs      fish           external
   usr time  215.72 millis    0.10 millis  215.62 millis
   sys time  271.42 millis    1.01 millis  270.41 millis

% time git clone https://repo-default.voidlinux.org/xlocate/xlocate.git test
Cloning into 'test'...
Fetching objects: 18387, done.
Updating files: 100% (18503/18503), done.

________________________________________________________
Executed in   11.12 secs    fish           external
   usr time    2.86 secs    0.32 millis    2.86 secs
   sys time    1.43 secs    1.06 millis    1.43 secs

friedelschoen avatar Jan 12 '24 13:01 friedelschoen

Then updating the git is for sure faster, but how often is that needed if files-lists don't really change with every version.

That's reason git is used, it provides the mechanism to download only the new parts of the index ("delta-updating"), keeping existing files as is

0x5c avatar Jan 12 '24 16:01 0x5c

I've now re-implemented xlocate into xbps-query (-o and --ownedhash) to have better integration. From there, you can still search by file/link but also by hash! Every file-hash is included into *arch*-files thus if you are searching exactly this file, you can search it without knowing its name or actual location. *arch*-files shouldn't be too heavy still. Maybe someone can have a look 👍🏼

Also can someone with a binary-repo make a index-file with xbps-rindex and compare speed with xlocate, I don't have the capacities to download a binary-repo to test.

friedelschoen avatar Jan 22 '24 22:01 friedelschoen