Check page URLs for extension before direct fetch attempt
Fixes #829
- Only attempt direct fetch (non-browser fetch()) of page URLs with known non-HTML extensions, otherwise attempt loading in the browser. (Can perhaps further optimize to discover new non-HTML extensions)
- Also: Async fetch dedup: treat unknown status / 206 same as 200 for dedup purposes, to avoid duplicate loading
I wonder if it might be better to direct fetch any URL that ends in a file extension (and that's not .html or .htm, since some older sites followed that convention)? I think introducing a relatively short list of acceptable file formats is going to result in us not fetching a lot of files we'd want to - just off the top of my head, common file extensions that wouldn't get fetched with this implementation would include CSVs, plaintext files, Powerpoint presentations, TIFFs, GIFs, videos in other container formats like .avi/.mov/.mkv, and so on...
If we are going to move forward with an allowlist of extensions, I think we should look for a third party-managed list that would be a bit more comprehensive.
I wonder if it might be better to direct fetch any URL that ends in a file extension (and that's not
.htmlor.htm, since some older sites followed that convention)? I think introducing a relatively short list of acceptable file formats is going to result in us not fetching a lot of files we'd want to - just off the top of my head, common file extensions that wouldn't get fetched with this implementation would include CSVs, plaintext files, Powerpoint presentations, TIFFs, GIFs, videos in other container formats like.avi/.mov/.mkv, and so on...If we are going to move forward with an allowlist of extensions, I think we should look for a third party-managed list that would be a bit more comprehensive.
Yeah, maybe that's a smaller list to maintain, would also include .asp, .php, etc.. Another option is to always try browser load, and then if non-HTML, add extension to direct fetch check list for later..
Yeah, maybe that's a smaller list to maintain, would also include .asp, .php, etc.. Another option is to always try browser load, and then if non-HTML, add extension to direct fetch check list for later..
Yeah I think we'd be better off avoiding an allowlist altogether. But a shorter "don't direct fetch these extensions" list could work, or going off of your second idea, maybe we just always try browser load and then if it's non-HTML, directly fetch it regardless of extension?