dnscrypt-resolvers dnscry.pt update script

Based on #868 and particularly https://github.com/DNSCrypt/dnscrypt-resolvers/issues/868#issuecomment-1911704288, this is a trivial script to update the dnscry.pt entries in v3/public-resolvers.md and v3/relays.md from https://www.dnscry.pt/resolvers.json.

No-longer relevant issues

The below issues appeared in the first version of this PR, but @Brueggus (dnscry.pt maintainer) has updated the published resolver data to resolve them both, per https://github.com/DNSCrypt/dnscrypt-resolvers/pull/945#issuecomment-2334304634

The first run generates a lot of churn, as compared to the data in this repo, the upstream data now encodes the port number in all of its DNS Stamps. It's easy to see this after format.py is run and the v1/dnscrypt-resolvers.csv is updated. It would be possible to strip the port 443 from the DNS Stamps and recalculate them, but doing so would make these stamps incomparable to upstream, and that cost me a bit of time trying to work out why they were all different before I checked with https://dnscrypt.info/stamps/.

Also visible in v1/dnscrypt-resolvers.csv is some hard-to-avoid churn in entries where upstream has multiple resolvers with the same "location" key, as my dumb-trivial mechanism to resolve that may not pick the same one to rename as the one done previously.

This is visible for example with current dnscry.pt-amsterdam-ipv4 which ends up as dnscry.pt-amsterdam02-ipv4 (based on the public key), and the other entry with that location is now dnscry.pt-amsterdam-ipv4.

I'm not sure if this is really an issue for end users. An alternative would be to add the suffix to the location for any host with a hostname that isn't ___01, but that would rename existing dnscry.pt-singapore-ipv4 (again matching the host keys) to dnscry.pt-singapore03-ipv4. In this approach, Tokyo would also ends up with only dnscry.pt-tokyo02-ipv4 and dnscry.pt-tokyo03-ipv4, but that's actually fair since current dnscry.pt-tokyo-ipv4 is gone, and gets replaced in-place with what upstream calls tyo03.dnscry.pt.

I didn't commit the result of running the scripts since I'm not set up for minisig: I'm not sure what the expectation here is, but it would be trivial to add such a commit if desired.

It might also make for cleaner history if I in-place updated all the dnscry.pt DNS Stamps in-place first to have the :443 port number, then the churn will be much more readable.

The current output from execution looks like this:

> py -3 .\utils\update-dnscry.pt-entries.py

[v3/public-resolvers.md]
Duplicate entry: [dnscry.pt-amsterdam-ipv4] => [dnscry.pt-amsterdam02-ipv4]
Duplicate entry: [dnscry.pt-amsterdam-ipv6] => [dnscry.pt-amsterdam02-ipv6]
Duplicate entry: [dnscry.pt-hongkong-ipv4] => [dnscry.pt-hongkong02-ipv4]
Duplicate entry: [dnscry.pt-hongkong-ipv6] => [dnscry.pt-hongkong02-ipv6]
Duplicate entry: [dnscry.pt-losangeles-ipv4] => [dnscry.pt-losangeles02-ipv4]
Duplicate entry: [dnscry.pt-losangeles-ipv6] => [dnscry.pt-losangeles02-ipv6]

[v3/relays.md]
Duplicate entry: [dnscry.pt-anon-amsterdam-ipv4] => [dnscry.pt-anon-amsterdam02-ipv4]
Duplicate entry: [dnscry.pt-anon-amsterdam-ipv6] => [dnscry.pt-anon-amsterdam02-ipv6]
Duplicate entry: [dnscry.pt-anon-hongkong-ipv4] => [dnscry.pt-anon-hongkong02-ipv4]
Duplicate entry: [dnscry.pt-anon-hongkong-ipv6] => [dnscry.pt-anon-hongkong02-ipv6]
Duplicate entry: [dnscry.pt-anon-losangeles-ipv4] => [dnscry.pt-anon-losangeles02-ipv4]
Duplicate entry: [dnscry.pt-anon-losangeles-ipv6] => [dnscry.pt-anon-losangeles02-ipv6]

Aug 12 '24 16:08 TBBle

dnscry.pt maintainer here. Thanks for your great work! Keeping this repository in sync with the resolver lists I publish has been haunting me for months, so I am happy to see the progress made in this PR!

The first run generates a lot of churn, as compared to the data in this repo, the upstream data now encodes the port number in all of its DNS Stamps.

That's (unfortunately) due to laziness: I am taking the DNS Stamps for DNSCrypt and Anonymized DNS straight from the output of encrypted-dns-server. When I started the project, I only supported DNSCrypt, so there was no need to calculate any other DNS Stamps. Now changing this wouldn't be a big deal since I already calculate the stamps for DoT/DoH. I'll look into that.

Also visible in v1/dnscrypt-resolvers.csv is some hard-to-avoid churn in entries where upstream has multiple resolvers with the same "location" key, as my dumb-trivial mechanism to resolve that may not pick the same one to rename as the one done previously.

I have never properly implemented having multiple resolvers in the same location and that's why things become inconsistent. At the moment, I am (ab-)using the location field, which you find in the JSON as well, and add an incrementing value if needed. For example, the resolver tokyo-ipv4 in my resolvers.md (https://www.dnscry.pt/resolvers.md) shows "location": "Tokyo" in the JSON, tokyo02-ipv4 shows "location": "Tokyo 02" and so on.

I will have to make some adjustments here, but I don't think there's a better/more proper way than adding an incrementing value to locations which host more than one resolver.

Besides that: Is there anything I can change in the JSON output to make things easier for you?

Sep 06 '24 12:09 Brueggus

I think the stamp data in the JSON is fine, my concern was merely that this repo's existing (hand-maintained) stamps are either old, or were being recalculated to remove the port before publication, and so servers that haven't actually changed are showing a changed stamp when you compare the generated output to the existing data. That's why I was leaning towards a second PR to go first, which would update all the DSN stamps for dnscry.pt entries to include the port number, matching upstream, but not otherwise changing the content. That makes the diff resulting from running this PR's script much smaller.

As far as Location, then yeah, making them unique by including an incrementing key or something (I'd think the same as the DNS name, ideally) to ensure stability as servers appear and disappear makes sense. Tokyo is an example of where it's working well already, the concern was about Amsterdam, Hong Kong, and Los Angeles, where they do not have any such integer, and my quick-workaround mismatched the existing data in one of those cases, possibly surprising users if they update and don't re-select the desired service. Having the name be unique and upstream defined would let me remove the hack, and then any such mismatch or churn will only happen once, and be isolated to places where this repo and your upstream data source have an existing mismatch.

If there wasn't existing data, I wouldn't use Location as the unique name anyway, I'd prefer to derive it from the host-name. That would introduce a once-off churn for all downstream consumers of this list who are using dnscry.pt servers though, so it's probably not feasible at this point.

Of course, as you are the dnscry.pt maintainer, you are welcome to take this script, run it manually, apply any manual fixups or churn reduction you see fit, and submit the results as a PR against the data files. That might make things easier for the repo owner, as they would more-easily trust a data-only PR from you compared to a script from (random passing stranger) me.

Sep 06 '24 13:09 TBBle

or were being recalculated to remove the port before publication

I think that's the case. IIRC the stamps were taken as they are first and the port has been removed in a later commit.

I have just published new resolver lists (+ JSON) which have the port removed if the default port is used, which is the case for all resolvers at the moment. This change was overdue anyway to be compliant with the official (?) specifications for DNS stamps.

the concern was about Amsterdam, Hong Kong, and Los Angeles, where they do not have any such integer

Oh boy... until now I didn't even notice those were missing the incrementing key. I wonder how the clients handled the resolver lists containing two resolvers with the same identifier. Anyways, as a quick fix I've added the 02 in the location field so that the names are unique and the hack you added is no longer required. I still have to think of a proper way to implement this and I'd prefer to not tie the resolver name to the hostname so that users won't have to change their configs if I have to change servers... but that's out of scope here.

Sep 06 '24 15:09 Brueggus

Awesome, thank you. I've rebased and rerun the scripts, and the churn is now much lower, so I included the output of the run as a commit for visibility.

Sep 07 '24 01:09 TBBle

Just a heads-up - the IP addresses of dnscry.pt-hongkong-ipv4 and dnscry.pt-hongkong-ipv6 have changed recently. I don't think I can push any changes to this PR.

If we can get this merged, this would help me a lot to keep this repo in sync with changes on my end.

Sep 17 '24 11:09 Brueggus

I've rerun the script and repushed the branch, and those addresses should now be updated.

I'm happy to trivially rebase if I'm pinged here, but I'm not actively tracking dnscry.pt data updates myself. I have not heard from the repo owner, so I'm not sure what expectation to have about timelines for merging this PR.

Sep 18 '24 13:09 TBBle

Thanks!

Until now, dnscry.pt updates were just copied from:

https://www.dnscry.pt/resolvers.md
https://www.dnscry.pt/anon-relays.md

Is there a difference between this and manually parsing the JSON file?

Sep 18 '24 16:09 jedisct1

There's no difference. The JSON is an export of the data my scripts use to generate the resolver files.

Frank Denis @.***> schrieb am Mi., 18. Sept. 2024, 18:30:

Thanks!

Until now, dnscry.pt updates were just copied from:

https://www.dnscry.pt/resolvers.md

https://www.dnscry.pt/anon-relays.md

Is there a difference between this and manually parsing the JSON file?

— Reply to this email directly, view it on GitHub https://github.com/DNSCrypt/dnscrypt-resolvers/pull/945#issuecomment-2358921531, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAJF46GGN66BMTYZGM3TW3ZXGTA5AVCNFSM6AAAAABMMPKFLWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJYHEZDCNJTGE . You are receiving this because you were mentioned.Message ID: @.***>

Sep 18 '24 16:09 Brueggus

Rebased for latest changes to master branch, which see has had (manual, I assume) updates to the dnscry.pt servers, so the final commit's diff is much smaller now.

Is there a difference between this and manually parsing the JSON file?

It's not manual. ^_^ For me, that's a win by itself; YMMV.

For example, looking at the current diff, it's readded the three servers you removed as not working in de9d69e32f76f94b57387da45644fa920f0bb57f. (Someone should probably report that to @Brueggus if not already aware...)

It also adds a bunch of anonymous relays which aren't currently in the list. I haven't checked carefully but it looks like that's the list of relays removed in 5762a73773db2d9ab66517c5fe21a3b7d3d0c1ca.

Either way, it makes it super-easy to see what's changed compared to manually parsing a JSON file. (Although I suspect we want to drop the last commit as it's now somewhat reverting deliberate manual changes.)

Edit: Confirmed that a quick git cherry-pick de9d69e32f76f94b57387da45644fa920f0bb57f 5762a73773db2d9ab66517c5fe21a3b7d3d0c1ca brings us back to the current state of the master branch (ignoring the sig-file changes), so I'll remove the final commit shortly.

Sep 18 '24 17:09 TBBle

But why use the JSON file instead of the already existing .md files?

Here are the scripts that have been used to update the dnscry.pt entries so far:

https://github.com/DNSCrypt/dnscrypt-resolvers/blob/next/utils/dnscry.pt-merge.py
https://github.com/DNSCrypt/dnscrypt-resolvers/blob/next/utils/dnscry.pt-relays-merge.py

They're very simple as they just add a prefix to the names. Using the JSON file looks way more complicated.

Sep 18 '24 18:09 jedisct1

Also, copying the .md files ensures that the resolver names are exactly the same whether one is using dnscry.pt as a source, or dnscrypt.info as a source.

Sep 18 '24 18:09 jedisct1

But why use the JSON file instead of the already existing .md files?

There's a built-in JSON parser in Python stdlib, so I didn't even need to think about parsing Markdown's various flavours.

I also assumed upstream JSON was canonical, and any MD output was generated from that and liable to change.

When I came into this, both formats already existed and there wasn't any hint of an existing MD parser in the lInked bug or repo that I saw.

Sep 19 '24 02:09 TBBle