moss-d-legacy
moss-d-legacy copied to clipboard
Serverless repository
If I understand right, you wanna store the files in a shared content-addressable store and then access them by symlinks .... I wonder if it can be possible to store the files in IPFS, this way making every PC where a package is installed into a part of a distributed peer-to-peer repository. So no explicit package files are needed for online installation. And when installing a package, the manager should fetch the files from IPFS. And when repairing/updating, it should fetch only broken/changed files. The metadata should also be stored in IPFS.
Of course some measures should be taken for allowing users to maintain privacy and cybersecurity. Such as that by default
- taking part in the distributed repo must be opt-in.
- it should only be allowed to announce only the latest version of a package.
- certain packages can have metadata forbiding announcing them
- for each package and each pair of them their popularity is tracked (in a counting Bloom filter?), and so the amount of information exposed by knowing the fact that a certain set of packages is present on a certain machine. Then only the set of packages that carries the least of information (the same for almost all the users) is announced.
- Unfortumately not-announcing also carries information. IDK how to deal with it.
- Users can opt-in into distributing the packages they don't need. Their files are fetched into the IPFS store but the symlinks are not created.
If I understand right, you wanna store the files in a shared content-addressable store and then access them by symlinks ....
The files in the local PC content-addressable-storage are 1) fetched as packages, 2) shared per transaction via hard links, 3) May in the future not map to all of the sub-packages in the package to which they belong (due to user preference).
Is there a use case for this proposal or is it just a "huh, wouldn't it be cool if (...)" thing where you're just playing with the idea?
The files in the local PC content-addressable-storage are 1) fetched as packages, 2) shared per transaction via hard links, 3) May in the future not map to all of the sub-packages in the package to which they belong (due to user preference).
Thanks for the info.
Is there a use case for this proposal or is it just a "huh, wouldn't it be cool if (...)" thing where you're just playing with the idea?
Just a wild idea. Inspired by things like BitTorrent and PeerTube. The use case is speeding-up fetching the packages by using local (with mabe gigabit speeds and without consuming internet bandwidth, if the peers are in LAN, which is a likely situation in orgs) peers and taking load off the central servers.
Is a reverse squid proxy not sufficient if the goal is to maximise caching and local bandwidth usage...?
No. A proxy is not serverless and it will require a dedicated hardware. A distributed system just shifts the costs to end users. No need to set up a server, no need to maintain it, no need to upgrade it when it becomes overloaded. It is a killer feature of distributed systems - one just uses a client, and the "server" "magically" emerges.
Think about it as about BitTorrent + DHT + a webseeds on the central servers. When there is not enough machines who have the update installed, it is fetched from the webseeds. If there is enough, the update is fetched from p2p.
The problem with usual update systems is that they are distributed as packed archives, on installation the archives are unpacked and deleted, so in order to serve updates clients have to have the archives kept, it is a big storage overhead.
That's why the p2p has to be coupled with update system. Instead of exposing update archives at p2p level of update system should allow fetching individual files. So not the archive as whole is checksummed, but the individual files in it, forming a Merkle DAG.
About storage: certain archive libs (I know, for example, zstd
allows this, unfortunately (brotli
has the best compression in my experience, even better than lzma
for some data) brotli
has removed the API to create dictionaries) expose API to pretrain a dictionary. So I guess storing a pretrained dictionary (the dictionary should be a part of package archive, so created only once) can allow reduce the cost of recompressing data before sending it via the net.