brozzler icon indicating copy to clipboard operation
brozzler copied to clipboard

How to add behaviors?

Open sepastian opened this issue 6 years ago • 6 comments

Does brozzler have support for adding new, or customizing existing behaviors?

From what I understood, this requires both a yaml file matching urls to behaviors, and the actual behaviors in js files.

If there's no support currently, how about adding one or more flags, allowing to specify additional yaml and js files, or directories containing these? Where would be a good place to implement this, so it could become an official feature of brozzler?

sepastian avatar Nov 27 '18 08:11 sepastian

Hi, @sepastian !

Yes, brozzler supports adding new, and customizing existing, behaviors.

brozzler's yaml file matching urls to behaviors is here: https://github.com/internetarchive/brozzler/blob/master/brozzler/behaviors.yaml

brozzler currently looks for behavior js (often as jinja2 templates) in this directory: https://github.com/internetarchive/brozzler/tree/master/brozzler/js-templates

galgeek avatar Nov 27 '18 17:11 galgeek

Hello @sepastian, what @galgeek said, and the code that loads behaviors is here: https://github.com/internetarchive/brozzler/blob/master/brozzler/init.py#L97 Maybe you could add support for an environment variable or command line option to point brozzler at a different or additional behaviors.yaml. Not sure right now what makes the most sense. Can you say a little bit more about your use case?

nlevitt avatar Nov 27 '18 18:11 nlevitt

Thanks for your replies @galgeek and @nlevitt.

I found where brozzler is loading behaviors.

As for my use case. We are crawling various newspapers for a research project. One newspaper, for example, redirects to the start page when scrolling to the bottom; this leads to the crawl never finishing, because brozzler never reaches the bottom of the page. (This is my interpretation of what's going on, I don't have much experience with brozzler though, is that possible?) This is why new behaviors need to be added.

Now, we could clone the brozzler repository and edit the files defining behaviors there. But I thought it would be easier to install brozzler using pip, and still be able to define behaviors. That is, we would like to define behaviors without cloning the brozzler repository.

As I see it, it would be enough to define the location of behaviors.yaml; this file would then reference JS templates at various locations.

Your suggestion of specifying the location of behaviors.yaml through an env variable makes sense. In addition, that location could point to a directory, e.g. behaviors.d/, containing behaviors.d/some_behavior.yaml, behaviors.d/another_behavior.yaml and so on. This way, one could add new behaviors in separate files, similar to how configuration is often done in Linux, compare for example /etc/rsyslog.d/. Then, when referencing a JS template inside any YAML file, that JS template would be searched for relative to the location of the YAML file. This way, it won't be necessary to specify the base path of JS templates explicitly.

Finally, brozzler could search for behaviors in standard locations, such as <FILEPATH>/behaviors.d/*.yaml, $HOME/.brozzler/behaviors.d/*.yaml (which maps to ~/.brozzler/behaviors.d/ in Linux, and some path under \Users in Windows). Here, FILEPATH is the directory containing the script currently executing. These locations would be searched in order, the default/fallback location would be the current behaviors.yaml file that's shipping with brozzler. Behaviors from all locations would be merged, i.e. the behaviors shipping with brozzler would still be available, but could also be overwritten by another definition.

With these changes, it would be possible to maintain behaviors like so.

my-brozzler-project
  behaviors.d
    site1.yaml # references ../js-templates/template1.js
    site2.yaml # references ../js-templates/template2.js
  js-templates
    template1.js
    template2.js

When executing cd my-brozzler-project && brozzler-easy, behaviors would be picked up from the standard location <FILEPATH>/behaviors.d/*.yaml; the JS templates referenced inside the behavior files would be loaded relative to the location of the YAML files where they are referenced.

This could be achieved only by extending how behaviors are currently loaded in https://github.com/internetarchive/brozzler/blob/master/brozzler/init.py#L97.

What do you think?

sepastian avatar Nov 28 '18 09:11 sepastian

Hi, @sepastian !

I wonder what newspaper site is never finishing? I'd like to take a look and see if we can fix this.

galgeek avatar Nov 30 '18 05:11 galgeek

Hi @galgeek, sorry for the wait.

Its https://www.sueddeutsche.de/thema/Europawahl. This page lists articles concerning the upcoming election for the European Parliament. Our goal is to archive that page, plus all articles referenced on that page.

I don't know much about Brozzler yet, but I suspect it halts because the job file defines max_hops: 1. When crawling an article one level below the seed page, the redirect occurs when scrolling to the end of the page. It seems that brozzler never reaches the end of the page, because the redirect occurs before - so the next page gets loaded (two levels below the seed), and brozzler will wait forever for the crawl to finish. Could that be the problem?

The job file.

id: sueddeutsche.de
warcprox_meta:
  warc-prefix: sueddeutsche_de
  stats:
    buckets:
      - sueddeutsche-de-stats
seeds:
- url: https://www.sueddeutsche.de/thema/Europawahl
  time_limit: 60
  scope:
    accepts:
    # Article links consist of name + id, e.g. /abc-cde-1.123
    # Ignore pages names /1.123 (without name).
    - regex: ^.+-\d\.\d+$
    # Include sibling pages.
    - regex: ^.+Europawahl-\d+$
    blocks:
    - substring: news/homepagefeed/count
    - regex: ^https?://www.sueddeutsche.de$
    - domain: adsafeprotected.com
    - domain: contentinsights.com
    - domain: lp4.io
    - domain: moatads.com
    max_hops: 1

sepastian avatar Dec 04 '18 12:12 sepastian

@sepastian something like what you propose should be fine, though I'm not sure about the details at this moment. I think it's best to focus on identifying the issue with your particular crawl first.

nlevitt avatar Dec 05 '18 00:12 nlevitt