broken-link-checker icon indicating copy to clipboard operation
broken-link-checker copied to clipboard

Check local files

Open fbrzvnrnd opened this issue 9 years ago • 21 comments

I'm not able to check local file using the file:// protocol. I got an:

ReferenceError: protocol is not defined Unhandled rejection Error: undefined

fbrzvnrnd avatar Aug 18 '16 10:08 fbrzvnrnd

Because it hasn't been implemented yet. You can only check local HTML documents with a String or a Stream.

stevenvachon avatar Aug 18 '16 11:08 stevenvachon

Thank you for the answer. Could I ask you the syntax to check a html file with string or stream via command line?

fbrzvnrnd avatar Aug 18 '16 20:08 fbrzvnrnd

API only. It won't be added to the CLI either because once file:// support is added, there'll be no need for it.

stevenvachon avatar Aug 18 '16 20:08 stevenvachon

Thank you for the answer.

fbrzvnrnd avatar Aug 18 '16 22:08 fbrzvnrnd

I've started implementing this and it's a slight bit more complicated than I thought it'd be. Is there a reason why supporting the file protocol would be necessary when a temporary server could be started in the site's directory?

stevenvachon avatar Oct 06 '16 23:10 stevenvachon

@stevenvachon

In my case, I'd like to run a gulp task in my _site/**/*.html directory after building my site to check for bad links. If a bad link is found, the gulp build task will fail and the build server will notify me. I wrote the following code after reviewing this issue and the API docs, but I seem to be missing something because it doesn't appear to work correctly:

var blc          = require('broken-link-checker');
var gulp         = require('gulp');

gulp.task('check-links', function() {

var htmlChecker = new blc.HtmlChecker( {
    html: function(tree, robots) {
    },
    junk: function(result) {
      console.log(result);
    },
    link: function(result) {
      if (result.broken) {
        console.log(blc[result.brokenReason]);
      } else if (result.excluded) {
        console.log(blc[result.excludedReason]);
      }
    },
    complete: function() {
        console.log("complete");
    }
});

//scan all site html files
htmlChecker.scan('_site/**/*.html', '/');

});

When I run the task, this is all I see:

$ gulp check-links
[08:01:13] Using gulpfile ~/git/dev-docs/gulpfile.js
[08:01:13] Starting 'check-links'...
[08:01:13] Finished 'check-links' after 11 ms
$

Am I not using the correct syntax?

jeff-matthews avatar Nov 30 '16 14:11 jeff-matthews

What you're trying to do will not work until v0.8 is released unless you only want to check external links since file:// links are not yet supported.

You could try:

const files = minimatch('_site/**/*.html');
htmlChecker.scan( fs.createReadStream(files[0]), 'file://'+files[0] );

But you'll need to write your own queue logic that's performed on each complete.

stevenvachon avatar Nov 30 '16 17:11 stevenvachon

OK, thanks!

jeff-matthews avatar Nov 30 '16 18:11 jeff-matthews

@jeff-matthews did you get that working?

I'm trying to do this, either file:// from the CLI or through the API. Is this close?

adamwolf avatar Dec 14 '16 17:12 adamwolf

@adamwolf no, not with broken-link-checker. I ended up using html-proofer instead.

jeff-matthews avatar Dec 14 '16 17:12 jeff-matthews

For now, you're better off setting up and using a temporary local server to host your HTML files for checking.

stevenvachon avatar Dec 14 '16 17:12 stevenvachon

Sad to see that this feature wasn't implemented.

What issues did you face? Maybe we can help somehow?

ArmorDarks avatar Jul 19 '17 19:07 ArmorDarks

I just haven't had time to finish it yet. The main difficulty in it is likely handling CORS.

stevenvachon avatar Jul 19 '17 20:07 stevenvachon

Hm, can't figure out how CORS involved in local resources? When you work with local files, you just fs read them... that all, isn't that?

ArmorDarks avatar Jul 19 '17 20:07 ArmorDarks

how about a remote page that has a file:// link, or a local page that references a parent path?

stevenvachon avatar Jul 19 '17 20:07 stevenvachon

file:// can be replaced by __dirname I believe, turning it into direct absolute filesystem path for Node.

a local page that references a parent path

Well, in case if it will be turned into regular node fs read with absolute filesystem paths, this shouldn't be an issue at all, since Node can receive access anywhere in filesystem, it doesn't have restrictions.

For some cases it might be worth to define cwd for broken-link-checker and ignoring any path, that goes beyond it. But so far I don't see why it would be required.

I understand that my statements are quire profanic, since I didn't deal with such scenarios, but so far it seems like it should work...

ArmorDarks avatar Jul 19 '17 20:07 ArmorDarks

It's not a question of whether it can, but whether it should. It's a security concern.

stevenvachon avatar Jul 19 '17 20:07 stevenvachon

Right, It is exactly what I wanted to add.

I think that going beyond defined domain sounds dangerous in any situation. And when working with local checking, we should consider filesystem our domain and do not follow remote pages. At best, we should check existance of remote page, but do not scrape it.

Shouldn't it fix our security concerns, since situations with remote page, pointing to file:// and such won't occur at first place?

I think that applying CORS rules for local scraping won't work anyway, since it was designed to fix quite other security issues.

ArmorDarks avatar Jul 19 '17 20:07 ArmorDarks

No, it should absolutely not check existence in all circumstances. Imagine a page at /path/to/index.html with a link to file:///usr/local/.

It is not impossible for remote pages to have file:// links.

stevenvachon avatar Jul 19 '17 20:07 stevenvachon

Well, this can be made optional. After all, when user runs lib against his own codebase, he have some sense of trust and can opt to enable full scraping of local resources, or disable following file:// links. Why should we decide for user?

Imagine a page at /path/to/index.html with a link to file:///usr/local/.

So far I don't see what harm it brings, to be honest.

It is not impossible for remote pages to have file:// links.

I'm starting to feel that I miss the point. What exactly do you mean under remote page?

/path/to/index.html from your example seems to be local resource.

Besides, if we do only check of existence of remote resource, but will stop scraping right there, does it make any difference if remote resource have file:// or no?

ArmorDarks avatar Jul 19 '17 20:07 ArmorDarks

I've seen this implemented as https://www.npmjs.com/package/broken-link-checker-local

honzajavorek avatar Jun 29 '20 13:06 honzajavorek