browsertrix-crawler Exclusion rules for browser behaviors

Exclusion rules for browser behaviors

Open pato-pan opened this issue 1 year ago • 0 comments

I don't think there is a way to exclude files or urls that come from browser behaviors. The only way I know is to disable autoplay with the option --behaviors autoscroll,autofetch,siteSpecific but this could also exclude other elements in the page that you may not want to exclude.

I made an example page that can be seen here https://patotester14.blogspot.com

Below is what the log shows whenever the crawler has the autoplay option, and is allowed to download these elements

{"timestamp":"2023-10-24T13:25:48.339Z","logLevel":"debug","context":"behaviorScript","message":"Starting behavior: Autoplay","details":{"page":"https://patotester14.blogspot.com/","workerid":0}}
{"timestamp":"2023-10-24T13:25:48.341Z","logLevel":"debug","context":"behaviorScript","message":"processing media element: <audio controls=\"\" autoplay=\"\" loop=\"\" controlslist=\"nodownload\">\n<source src=\"https://drive.google.com/uc?id=1UBgLcRFGNatNULprWqn6SWFQu26kCUuO&amp;export=download\">\n</audio>","details":{"page":"https://patotester14.blogspot.com/","workerid":0}}
{"timestamp":"2023-10-24T13:25:48.344Z","logLevel":"debug","context":"behaviorScript","message":"fetch media source URL: https://drive.google.com/uc?id=1UBgLcRFGNatNULprWqn6SWFQu26kCUuO&export=download","details":{"page":"https://patotester14.blogspot.com/","workerid":0}}
{"timestamp":"2023-10-24T13:25:48.344Z","logLevel":"debug","context":"behaviorScript","message":"media URL found, pausing playback","details":{"page":"https://patotester14.blogspot.com/","workerid":0}}

To exclude this, I had tried the following, but I think the first one is enough to show it's unable to exclude the element.

    exclude:
      - .*1UBgLcRFGNatNULprWqn6SWFQu26kCUuO.*
      - .*drive.google.com/uc\?id=1UBgLcRFGNatNULprWqn6SWFQu26kCUuO&export=download
      - .*drive.google.com/uc\?id=1UBgLcRFGNatNULprWqn6SWFQu26kCUuO&export=download
    blockRules:
      - url: .*1UBgLcRFGNatNULprWqn6SWFQu26kCUuO.*
      - url: .*drive.google.com/uc\?id=1UBgLcRFGNatNULprWqn6SWFQu26kCUuO&export=download
      - url: .*drive.google.com/uc\?id=1UBgLcRFGNatNULprWqn6SWFQu26kCUuO&export=download

This is necessary in the case where you want to download all media elements, except for one due to how large it is. My use case though is that because I am using google drive and google drive always provides a different download link for the file, the file gets redownloaded a ridiculous amount of times. Enough to turn a 100mb archive into a 1gb archive. You may see this happen with my example, but it won't happen as much since my example only has a few pages and I made sure to choose a small audio file.

Oct 24 '23 15:10 pato-pan

browsertrix-crawler browsertrix-crawler copied to clipboard

Exclusion rules for browser behaviors

browsertrix-crawler
browsertrix-crawler copied to clipboard