xbps
xbps copied to clipboard
Reimplemented xlocate as xbps-locate
I've implemented a new xbps-tool xbps-locate! xbps-rindex collects data into index.plist inside *-repodata but also files into files.plist. xbps-locate will fetch the files.plist from the repo-pool and search for the desired file. I cannot test
Also added to TODO, cleanage of xbps-rindex doesn't clean files.plist yet.
I've also added into repo_open_* in lib/repo.c that the archive-iterator just assumes that the files are in order (they are still written in order for compatiblity) but is checking the actual filename.
On my computer, I've to manually disable _BSD_SOURCE and _SVID_SOURCE, so there is a commit, I don't know if it's only on my computer. (Void Linux x86_64-musl)
Thanks for looking into my code!
how does this affect repodata size?
why not make it part of xbps-query?
this also means x(bps-)locate loses the power of pcre and delta-updating the index
_BSD_SOURCE was fixed in https://github.com/void-linux/xbps/commit/48c9879d33357254f5f405b3d5463708c9d074f9, rebase
Making some calculations: about 60bytes filepath and some overhead, let's talk about 100bytes per file. void-packages got about 13000 packages with each about 50 files (I guess, just assuming right now).
100x50x13'000 ≈ 60mb uncompressed.
Maybe using an extra file like x86_64-files which would be downloaded independently is an option if the overhead is too much.
You're right about loosing the power of PCRE then, maybe a third-party library?
Making some calculations: about 60bytes filepath and some overhead, let's talk about 100bytes per file.
void-packagesgot about 13000 packages with each about 50 files (I guess, just assuming right now).100x50x13'000 ≈ 60mb uncompressed.
$ git -C .cache/xlocate.git/ grep '.' @ | wc -l
3528851
$ git -C .cache/xlocate.git/ grep '.' @ | cut -d: -f3- | wc -c
235757398
so at the very minimum 235 MB assuming single-byte ASCII characters only and no plist overhead
Oke! I wasn't aware of that much overhead to include it directly into *-repodata so I'll look into implementing something like x86_64-files and leaving the original *-repodata alone.
It's worth noting that the existing xlocate index is large enough to already be in git, where it still takes ages to download if you don't already have a clone and can't take advantage of the delta updating that git provides. 235MB is well within territory of a download that can take ages on a mediocre Internet connection; 5 years ago this would have been a 30~40min project on the Internet connection I had. Even with my current, much better, internet connection that's still an unreasonable amount of delay when simply trying to sync the repodata.
After some research, making a plist with all the files in xlocate.git:
% cat ../make-plist.sh
echo "<plist>"
echo "\t<dict>"
for pkg in *; do
echo "\t\t<key>$pkg</key>"
echo "\t\t<array>"
for file in $(awk '{print $1}' $pkg); do
echo "\t\t\t<string>$file</string>"
done
echo "\t\t</array>"
done
echo "\t</dict>"
echo "</plist>"
% sh ../make-plist.sh | zstd -f9o ../files.zstd
/*stdin*\ : 5.21% ( 238 MiB => 12.4 MiB, ../files.zstd)
% find * -print -exec cat {} \; | zstd -f9o ../files.zstd
/*stdin*\ : 6.58% ( 197 MiB => 13.0 MiB, ../files.zstd)
13MiB still is a lot to just include into *-repodata so I would put it into a seperate *-files file, which is fetched individually.
Taking gcc-fortran which is about 13MB takes 5.3s, cloning the xlocate.git takes about 11s. Then updating the git is for sure faster, but how often is that needed if files-lists don't really change with every version.
I cannot tell how accurate this comparision is and how linear is behaves on slower networks. Please correct me if I'm wrong.
% time wget -O /dev/null https://repo-default.voidlinux.org/current/musl/gcc-fortran-12.2.0_4.x86_64-musl.xbps
--2024-01-12 16:20:32-- https://repo-default.voidlinux.org/current/musl/gcc-fortran-12.2.0_4.x86_64-musl.xbps
Resolving repo-default.voidlinux.org (repo-default.voidlinux.org)... 2a01:4f9:4b:42dc::d01, 65.21.160.177
Connecting to repo-default.voidlinux.org (repo-default.voidlinux.org)|2a01:4f9:4b:42dc::d01|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13833815 (13M) [application/octet-stream]
Saving to: ‘/dev/null’
100%[==============================================================================>] 13.19M 2.52MB/s in 5.0s
2024-01-12 16:20:37 (2.64 MB/s) - ‘/dev/null’ saved [13833815/13833815]
________________________________________________________
Executed in 5.30 secs fish external
usr time 215.72 millis 0.10 millis 215.62 millis
sys time 271.42 millis 1.01 millis 270.41 millis
% time git clone https://repo-default.voidlinux.org/xlocate/xlocate.git test
Cloning into 'test'...
Fetching objects: 18387, done.
Updating files: 100% (18503/18503), done.
________________________________________________________
Executed in 11.12 secs fish external
usr time 2.86 secs 0.32 millis 2.86 secs
sys time 1.43 secs 1.06 millis 1.43 secs
Then updating the git is for sure faster, but how often is that needed if files-lists don't really change with every version.
That's reason git is used, it provides the mechanism to download only the new parts of the index ("delta-updating"), keeping existing files as is
I've now re-implemented xlocate into xbps-query (-o and --ownedhash) to have better integration. From there, you can still search by file/link but also by hash! Every file-hash is included into *arch*-files thus if you are searching exactly this file, you can search it without knowing its name or actual location. *arch*-files shouldn't be too heavy still. Maybe someone can have a look 👍🏼
Also can someone with a binary-repo make a index-file with xbps-rindex and compare speed with xlocate, I don't have the capacities to download a binary-repo to test.