htmlq
htmlq copied to clipboard
noscript
Trying to get a list of currently available Invidious instances, I started doing
curl -s https://redirect.invidious.io | htmlq "noscript"
which gave me a list of all the noscript elements on the page, including the one I was looking for:
<noscript><div class="instances-list"><h2>Available instances</h2><ul class="list"><li><a href="https://invidious.snopyta.org">invidious.snopyta.org</a></li><li><a href="https://yewtu.be">yewtu.be</a></li><li><a href="https://invidious.kavin.rocks">invidious.kavin.rocks</a></li><li><a href="https://invidious-us.kavin.rocks">invidious-us.kavin.rocks</a></li><li><a href="https://invidious-jp.kavin.rocks">invidious-jp.kavin.rocks</a></li><li><a href="https://vid.puffyan.us">vid.puffyan.us</a></li><li><a href="https://invidious.namazso.eu">invidious.namazso.eu</a></li><li><a href="https://inv.riverside.rocks">inv.riverside.rocks</a></li><li><a href="https://vid.mint.lgbt">vid.mint.lgbt</a></li><li><a href="https://invidious.osi.kr">invidious.osi.kr</a></li><li><a href="https://invidio.xamh.de">invidio.xamh.de</a></li><li><a href="https://yt.artemislena.eu">yt.artemislena.eu</a></li></ul></div></noscript>
But when I tried to dig deeper to only get the list of URLs, it only gave me empty results, no matter what I tried:
$~ curl -s https://redirect.invidious.io | htmlq "noscript a"
$~ curl -s https://redirect.invidious.io | htmlq "noscript li"
$~ curl -s https://redirect.invidious.io | htmlq "noscript ul"
$~ curl -s https://redirect.invidious.io | htmlq "noscript div"
Is this an issue with noscript
in general or with that specific site? Why does it find what I am looking for in the first place?
Using htmlq 0.4.0
from AUR
I know I can do
curl -s https://api.invidious.io/instances.json | jq -r '.[][1].uri'
because that is where the data from outside the noscript
comes from, but this might still be a valid issue.
I wonder if this is because this uses servo/html5ever
under the hood:
And it looks like that code may have an option for "scripting_enabled" which defaults to true and then makes noscript
elements raw data
https://github.com/servo/html5ever/blob/57eb334c0ffccc6f88d563419f0fbeef6ff5741c/html5ever/src/tree_builder/rules.rs#L118-L126
I couldn't see where to set whether not you want to set that to false, but as a complete hack work around, you can do this for now:
$~ curl -s https://redirect.invidious.io | htmlq --text "noscript" | htmlq --attribute href .instances-list a