suckit icon indicating copy to clipboard operation
suckit copied to clipboard

Unicode handling of --include and --exclude

Open mr-bo-jangles opened this issue 3 years ago • 8 comments

So my specific usecase here is attempting to mirror a site with a lot of directories of various languages, but skipping the static files at a higher level.

Example Folder Structure

/Static/<collection of unwanted static files>
/Assets/<collection of unwanted static files>
/Books/
      ./ -> /Books/
      ../ -> /
      ===/<directory tree of unwanted static files>
      121/<directory tree of static files>
      Help/<directory tree of static files>
      مساعدة/<directory tree of static files>
      Помощь/<directory tree of static files>

I want to be sure that by running a command similar to suckit https://domain.tld -i "/Books/[a-Z0-9]+/" I will download the Tree under /Books/ while excluding anything under ./, ../, and ===/

mr-bo-jangles avatar Jul 09 '21 08:07 mr-bo-jangles

This looks correct. The best way to know is by testing it, and I would love to see the result of such a test. If you can build this directory tree, just serve it using a webserver and try to run suckit on localhost

Skallwar avatar Jul 09 '21 12:07 Skallwar

@mr-bo-jangles Did it worked ?

Skallwar avatar Aug 31 '21 21:08 Skallwar

Maybe we can add an option to output URL filtering information to stdout or a file, e.g, if the include or exclude regex matches? I think this would lead to more transparency what suckit is doing. I also plan to implement functionality to rewrite the local URLs that could profit from this debug feature.

raphCode avatar Mar 13 '22 20:03 raphCode

Maybe we can add an option to output URL filtering information to stdout or a file

Good idea

I also plan to implement functionality to rewrite the local URLs that could profit from this debug feature.

What do you mean?

Skallwar avatar Mar 17 '22 22:03 Skallwar

What do you mean?

To download a phpBB forum, I added a hack to rewrite some URLs, namely remove a ?sid=<hash> parameter. Otherwise the same pages get downloaded over and over again with different sid hashes. If you want to take a look: https://github.com/raphCode/suckit/blob/fusornet_hack/src/scraper.rs#L191

I originally planned to flesh this out into a dedicated feature / command line option, but eventually didn't. I already achieved my goal and I could not figure out a way to do it properly.

raphCode avatar Mar 22 '22 19:03 raphCode

The problem with removing parameters such as ?sid is that they might have changed the content of the requested page. If you remove them, 2 links identical except the parameters will have a common page downloaded by suckit while they should have 2 different pages

Skallwar avatar May 02 '22 09:05 Skallwar

In general you are correct, but in the specific case of phpBB the content is always the same, no matter the ?sid parameter value. One solution would be to just ignore all links with this parameter, like suggested here, but this may create a swath of broken links. I just removed the parameter from the URL and collapsed all links into their "canonical" form without the session id parameter.

I actually just found a different solution, namely to send session cookies, which avoids ?sid parameters getting appended to links in the first place.

raphCode avatar May 02 '22 15:05 raphCode

We could imagine a solution where you whould have a list of tuple with a regex and list of arguments to remove

Vec<(regex, Vec<parameter>)>

But it might be really costly

Skallwar avatar May 03 '22 11:05 Skallwar