node-website-scraper
node-website-scraper copied to clipboard
Integrity attribute stripping
So I'm scraping a site and generating integrity attributes but after returning the parsed body the integrity attributes are being striped.
I've found at resource-handler/html/index.js:55 is the following section
then((updatedText) => {
el.setData(updatedText);
el.removeIntegrityCheck();
});
Which is the issue, it looks like you check a bunch of child elements to find any other resources to load but in the process remove the integrity (which makes sense if you change the content of the child resource), but my use case no child resource has anything that ever changes as all links are relative/external.
I applied the following patch locally to test: https://github.com/almightyju/node-website-scraper/commit/ecc8aacc3a0b3cd7a7c7a992111e82a2b727a7d7
But I'm not convinced its the right approach since it doesn't work without a modified getReference action as this line in resource-handler/index.js:61
const { reference } = await self.getReference({parentResource, resource: respondedResource, originalReference: childPath});
sets reference as 'relative/url' when the original reference was '/relative/url' which means the content is different in the html resource handler and strips the integrity attribute. I've changed the getReference function via a plugin for myself but I'm not sure how badly it will break other cases if this was in your library and without it my patch is rather obscure to use.
Hoping there might be an easy way to resolve the leading / being stripped from the originalReference :)
Hey @almightyju 👋
Sorry for late response
So here we have 2 things - removing integrity check and getReference. I'll try to explain why it's implemented in this way
As for integrity check - it is removed for each resource because even if path (ex. ./path/style.css
) was not updated (it was the same on original website and on copied), resource content may be changed (for example, some background image from css file was downloaded to different file than on original website). If we leave integrity check - resource will not be loaded on copied site
As for getReference - it generates relative link to make copied website work locally from directory with most simple setup. All you need to do - is srape website and copy scraped website directory somewhere and it works. With absolute link it requires additional setup.
So I think if having absolute links is suitable for your case - you can use getReference
to overwrite default behavior. As for removing integrity checks - I'm not sure that checking if path was updated is enough. Maybe need also check if resource content was updated. What do you think?
No rush, I've worked around it in my exact use case by using the saveResource action to add the integrity stuff back on.
It makes sense why you strip the integrity if you change the content that's scraped. For me personally an option of "preserver integrity attributes" which simply leaves the attribute alone would work, at least that way it's more obvious via the documentation by standard the attribute is removed (I was super confused how the attribute just went missing).
The better solution would be to check if any content has changed and simply leave the integrity if it hasn't or if the content has changed and the integrity attribute was originally there generate a new hash for it (crypto.createHash('sha256').update(content).digest('base64')
gives you the hash).
But like I said, for my case right now I'm actually generating the integrity for resources on the page so a simple preserve option would do :)
Nice that you found way to add integrity attributes 👍
I'm not going to add property like "preserve integrity attribute" now, I guess in 90% cases content will be changed and integrity should be removed/updated. Probably I should mention about this somewhere in readme or FAQ.
Leaving this issue open for now to not forget about it. I'll try to find good way to add integrity checks to downloaded resources. Maybe it makes sense to add more common getHtmlAttributes
action to add not only integrity but also other attributes.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
It's rather odd to do on scripts and CSS's tags that come from CDNs and will never change. Please disable this behavior at least on the links that are filtered out by the urlFilter.
@alexivkin it's not called for filtered out resources (by url filter, by max depth, etc.). Only for resources that were actually downloaded - their content was changed so integrity attribute is removed.
What I mean is it's stripping the integrity attributes off tags that are in the downloaded file, but pointing to the skipped / filtered out asset. For example the original index.html
has <link rel="stylesheet" href="https://stackpath.bootstrapcdn...
with the appropriate integrity attribute. The scraped/saved index.html
has that attribute removed from the <link rel="...
. but that <link
is still pointing to the remote CSS pm the CDN (as it should). Scraper should keep the integrity attribute on the <link
in that case.
@alexivkin could you provide reproducible example of such behavior when link is pointing to remote cdn and integrity attribute is removed? Then I can take closer look on it.
We have test which checks that integrity attribute is removed for downloaded elements only https://github.com/website-scraper/node-website-scraper/blob/c565e14fcea79a983a1e1f1557254d764b121a39/test/unit/resource-handler/html.test.js#L239-L269 Maybe there is some bug in your case
Is this intentional in the test? - <link href=http: //examlpe.com/style.css"
Anyhow, here is where it is happening for me. Source html:
<head>
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<script src="https://cdnjs.cloudflare.com/ajax/libs/popper.js/1.14.7/umd/popper.min.js" integrity="sha384-UO2eT0CpHqdSJQ6hJty5KVphtPhzWj9WO1clHTMGa3JDZwrnQq4sF86dIHNDz0W1" crossorigin="anonymous"></script>
</head>
The integrity attribute on the last two lines disappears in the downloaded version.
It looks like the tags are erroneously get processed by loadResourcesForRules
in the html handler. el.removeIntegrityCheck();
in then
promise of downloadChildrenPaths
strips the integrity attribute.
Could there be something with the way cheerio parses my html? I've noticed it also removes closing slashes - <meta name... />
becomes <meta name=... >
although I am not too concerned about that part.