Remote denylists and watching system (proposal)
The following are my thoughts on how to provide denylists so that they can be subscribed-to.
Server
- Using HTTP (poll), essentially an IPFS-gateway served file:
- Lists are made available over an http endpoint.
- Range requests are supported an accepted.
- eTag (set to CID )and caching headers
- I'd like to expand this to use Server Push / Notifications, but it is actually simpler and ok if clients check regularly and check e-tag to see if content changed.
- Using IPFS:
- Denylist IPFS-host server publishes to pubsub topic
denylist/<name>. Must be a signed message. - The message includes the CID of the latest version of the list. This is published every minute, or when it is updated.
- The CID is the CID of the denylist which is a normal unixfs file (balanced chunking).
- UnixFS files have support for seeking out of the box, and only the necessary blocks are downloaded when looking for specific bytes.
- Denylist IPFS-host server publishes to pubsub topic
Client
- HTTP: Client polls for the file every minute using a ranged request starting at the last byte read. A head request can be done in advance to check eTAG and decide if a GET request is needed. New bytes are appended to the file on disk.
- IPFS: Client subscribes to pubsub topic. If a new CID comes in, we use unixfs to seek to the last byte read and then it is appended to the file on disk. The pubsub message can include more than the CID, for example a field to indicate if redownloading the full file and processing from the beginning is necessary.
HTTP polling has been introduced at #22
We want to leverage this and switch ipfs.io and dweb.link to use RAINBOW_DENYLISTS=https://badbits.dwebops.pub/badbits.deny.
Did some initial triage today:
- https://github.com/ipfs-shipyard/nopfs/issues/38 with proposal to leverage HTTP caching
- not a blocker, but will improve cache hits
- https://github.com/protocol/badbits.dwebops.pub/issues/32733 to make upstream badbits published at https://badbits.dwebops.pub/badbits.deny append-only
- @hsanjuan would appreciate any info you remember on this, why the upstream list is sorted? is it for takedown automation PRs to never cause merge conflict? I'd like to fix upstream but need to understand constraints first.
- Lmk, if it is a rabbit hole. I suspect http caching support together with detection of non-append-only lists makes more sense as it is more generic.
Hey, nopfs watches denylists and reads any new lines appended to them. Adding updates in append-only fashion allows to do this without having to re-read the whole file.
I don't think #38 is a must. If the list upstream is append only, you:
- Download it
- Read from len(size_of_download file) for every update using the
Rangeheader. If no updates happened you will be reading a 0-range and it's essentially a no-op method (equivalent to checking for a If-Modified-Since tag), otherwise you obtain only the part of the content that you append to the local copy, and nopfs processes accordingly.
I don't know if you saw, but the badbits list is published in append-only format here: https://denyli.st/badbits.deny.txt and that is what I used for my defunct gateway.
I have a github action that reads https://badbits.dwebops.pub/badbits.deny and finds any new lines and turns appends them to https://denyli.st/badbits.deny... so far so good, it's been going for months.
So you can use RAINBOW_DENYLISTS=https://denyli.st/badbits.deny.txt already. In the meantime I would update badbits to be append only and not have to rely on a 3rd party.