Fix downloading a linked css resource with no extension
Summary
While trying to download a textbook written with PreteXt, I ran into an issue with some google icons that they use not rendering properly. I found two bugs, one was incorrect filtering of css resources in link tags. The second bug involved the ArchiveDownloader incorrectly placing a generated filename of index.css at the root of the download, when hitting a linked resource with no extension, instead of nesting it into sub-directories relevant for the domain and other url paths of the resource. This caused further resources requested below the CSS file to incorrectly reference outside of the download directory entirely.
I saw this in the run of the ArchiveDownloader, which meant that this URL was not being downloaded/rewritten:
Skipping node with src https://fonts.googleapis.com/css2?family=Material+Symbols+Outlined:opsz,wght,FILL,GRAD@24,400,0,0
This is the relevant line from the source site:
<link rel="stylesheet" href="https://fonts.googleapis.com/css2?family=Material+Symbols+Outlined:opsz,wght,FILL,GRAD@24,400,0,0">
I found a bug in css_node_filter, it was excluding this tag even though it had a "rel" attribute. While the attributes can be accessed from the root Tag object like node["rel"], the in operator doesn't work as they are in a sub-property that gets re-exposed by the class.
Observed results of download with this first change to fix the css resource filtering
downloads/active_calc_2e_again_823433/
├── activecalculus.org
│ └── single2e
│ ├── external
│ ├── frontmatter.html
├── fonts.googleapis.com
├── fonts.gstatic.com
│ └── s
│ └── materialsymbolsoutlined
│ └── v290
│ └── kJF1BvYX7BgnkSrUwT8OhrdQw4oELdPIeeII9v6oDMzByHX9rA6RzaxHMPdY43zj-jCxv3fzvRNU22ZXGJpEpjC_1v-p_4MrImHCIJIZrDCvHOel.woff
├── index.css
The index.css file that was placed at the root had these contents, meaning it was making a relative path reference outside of the download directory:
@font-face {
font-family: 'Material Symbols Outlined';
font-style: normal;
font-weight: 400;
src: url("../../fonts.gstatic.com/s/materialsymbolsoutlined/v290/kJF1BvYX7BgnkSrUwT8OhrdQw4oELdPIeeII9v6oDMzByHX9rA6RzaxHMPdY43zj-jCxv3fzvRNU22ZXGJpEpjC_1v-p_4MrImHCIJIZrDCvHOel.woff") format('woff');
}
It seemed like the correct fix was to get this index.css to be placed down into the file-system hierarchy based on the resource URL, rather than change how this relative path was being generated. The fix was simple once I found the relevant line of code to change. Currently the diff is a little bigger than needed as I assumed I might want to do some escaping of the URL to turn it into a file path, but thinking a bit longer on it I thought it might be reasonably likely anything that can be in an HTTP URL might be fine to be in a unix path (assuming this tool is only going to be run in unix environments), and that all slashes are used to create sub-directories. I can simplify it down the the one line version of the fix and remove the todo if you agree.
References
Page I was downloading: https://activecalculus.org/single2e/frontmatter.html
Reviewer guidance
I'm interested in adding a automated test, I don't necessarily want to hot link to the source site in a test, but as this is just for the scraping library I assume these tests aren't run that often. Considering if it might be useful to add some kind of test tool that would enable just checking in a directory of test web resources that could be "downloaded" by exposing them through a little local webserver or something. Happy to hear any thoughts or guidance on testing.
@rtibbles I added a test, I did have a thought that this test could be a little more complete, as I took a folder of the downloaded result and served it from a local webserver. That just means that I am already including relative links in the test "source site", so I could be a little more thorough and start up two little servers on different ports that would exercise some of the logic around downloading from different servers/domains.
I'm very likely going to be hanging around in this code more, so I'm likely to write more tests and can do that as a follow up. I'm already working on fixing the code for capturing a site with executing javascript on a headless browser. The default PhantomJS browser that it is invoking when I run it is no longer supported, but swapping in headless chrome is easy, just currently hunting down further resulting things that are coming up when exercising that code.
@rtibbles I fixed a few issues with the test, this should now be good to go. I didn't do the thing I mentioned in my last comment about serving the html and CSS/font from different servers/domains, but I don't think that is really necessary.
Hi @jaltekruse - sorry for the delay in review here, I've taken a quick read through, and I'm not seeing anything alarming, and the test coverage looks it provides a very realistic test! One possibility would have been to mock the requests responses instead of running a full server.
If you can run pre-commit run --all-files on the PR that should fix the linting issues. And lastly, it looks like the paths you've used are not working well on Windows, I can't quite tell from the error traceback if it's the paths of the files themselves, or how they're being composed in the test, but I suspect the former.
Also note that I rebased this onto develop as the tests on develop were broken previously! Feel free to do your own rebase if you like :)