broken-link-checker
broken-link-checker copied to clipboard
Check local files
I'm not able to check local file using the file:// protocol. I got an:
ReferenceError: protocol is not defined Unhandled rejection Error: undefined
Because it hasn't been implemented yet. You can only check local HTML documents with a String or a Stream.
Thank you for the answer. Could I ask you the syntax to check a html file with string or stream via command line?
API only. It won't be added to the CLI either because once file:// support is added, there'll be no need for it.
Thank you for the answer.
I've started implementing this and it's a slight bit more complicated than I thought it'd be. Is there a reason why supporting the file protocol would be necessary when a temporary server could be started in the site's directory?
@stevenvachon
In my case, I'd like to run a gulp task in my _site/**/*.html directory after building my site to check for bad links. If a bad link is found, the gulp build task will fail and the build server will notify me. I wrote the following code after reviewing this issue and the API docs, but I seem to be missing something because it doesn't appear to work correctly:
var blc = require('broken-link-checker');
var gulp = require('gulp');
gulp.task('check-links', function() {
var htmlChecker = new blc.HtmlChecker( {
html: function(tree, robots) {
},
junk: function(result) {
console.log(result);
},
link: function(result) {
if (result.broken) {
console.log(blc[result.brokenReason]);
} else if (result.excluded) {
console.log(blc[result.excludedReason]);
}
},
complete: function() {
console.log("complete");
}
});
//scan all site html files
htmlChecker.scan('_site/**/*.html', '/');
});
When I run the task, this is all I see:
$ gulp check-links
[08:01:13] Using gulpfile ~/git/dev-docs/gulpfile.js
[08:01:13] Starting 'check-links'...
[08:01:13] Finished 'check-links' after 11 ms
$
Am I not using the correct syntax?
What you're trying to do will not work until v0.8 is released unless you only want to check external links since file:// links are not yet supported.
You could try:
const files = minimatch('_site/**/*.html');
htmlChecker.scan( fs.createReadStream(files[0]), 'file://'+files[0] );
But you'll need to write your own queue logic that's performed on each complete.
OK, thanks!
@jeff-matthews did you get that working?
I'm trying to do this, either file:// from the CLI or through the API. Is this close?
@adamwolf no, not with broken-link-checker. I ended up using html-proofer instead.
For now, you're better off setting up and using a temporary local server to host your HTML files for checking.
Sad to see that this feature wasn't implemented.
What issues did you face? Maybe we can help somehow?
I just haven't had time to finish it yet. The main difficulty in it is likely handling CORS.
Hm, can't figure out how CORS involved in local resources? When you work with local files, you just fs read them... that all, isn't that?
how about a remote page that has a file:// link, or a local page that references a parent path?
file:// can be replaced by __dirname I believe, turning it into direct absolute filesystem path for Node.
a local page that references a parent path
Well, in case if it will be turned into regular node fs read with absolute filesystem paths, this shouldn't be an issue at all, since Node can receive access anywhere in filesystem, it doesn't have restrictions.
For some cases it might be worth to define cwd for broken-link-checker and ignoring any path, that goes beyond it. But so far I don't see why it would be required.
I understand that my statements are quire profanic, since I didn't deal with such scenarios, but so far it seems like it should work...
It's not a question of whether it can, but whether it should. It's a security concern.
Right, It is exactly what I wanted to add.
I think that going beyond defined domain sounds dangerous in any situation. And when working with local checking, we should consider filesystem our domain and do not follow remote pages. At best, we should check existance of remote page, but do not scrape it.
Shouldn't it fix our security concerns, since situations with remote page, pointing to file:// and such won't occur at first place?
I think that applying CORS rules for local scraping won't work anyway, since it was designed to fix quite other security issues.
No, it should absolutely not check existence in all circumstances. Imagine a page at /path/to/index.html with a link to file:///usr/local/.
It is not impossible for remote pages to have file:// links.
Well, this can be made optional. After all, when user runs lib against his own codebase, he have some sense of trust and can opt to enable full scraping of local resources, or disable following file:// links. Why should we decide for user?
Imagine a page at /path/to/index.html with a link to file:///usr/local/.
So far I don't see what harm it brings, to be honest.
It is not impossible for remote pages to have file:// links.
I'm starting to feel that I miss the point. What exactly do you mean under remote page?
/path/to/index.html from your example seems to be local resource.
Besides, if we do only check of existence of remote resource, but will stop scraping right there, does it make any difference if remote resource have file:// or no?
I've seen this implemented as https://www.npmjs.com/package/broken-link-checker-local