4cat Added Selenium URL scraper as new datasource; modified column filter to allow detailed matching information

Added Selenium URL scraper as new datasource; modified column filter to allow detailed matching information

Open dale-wahl opened this issue 3 years ago • 3 comments

trafficstars

Added selenium_scraper as a new Search class to be used in creating new datasources.
Created url_scraper datasource which allows a user to scrape a list of urls and up to 5 subpages on the host
Modified the column-filter processor to provide detail output showing which matches were found in a given column (this can mimic Tracker Tracker by searching for substrings within HTML and noting which were found/not found in the HTML)
updated Search base class to allow after_search_completed method; this was necessary to ensure Selenium webdriver and Chrome browser are properly closed and also works with get_items generators.
validate_url helper function added to helpers.py

@stijn-uva I have tested this and not found any issues, but did want you to particularly review the changes to Search.

Nov 29 '21 13:11 dale-wahl

I need to update the setup (Docker and manual). Some references I use to get Firefox working properly: https://takac.dev/example-of-selenium-with-python-on-docker-with-latest-firefox/ https://github.com/mozilla/geckodriver/releases/tag/v0.30.0 Can also remove Google Chrome.

Apr 29 '22 12:04 dale-wahl

Updated this branch to work with the new config_manager/database settings. Also updated selenium and made some minor bug fixes. Finally created a separate installation that only runs with the selenium settings are input into 4CAT settings (automatically in Docker when the backend container is restarted).

The only "to-do" left is a quality of life update to the frontend to seperate out these specific datasources since they do not conform to the social media paradigm of other datasources.

Aug 02 '22 11:08 dale-wahl

Merged master into this branch and fixed all the conflicts. I had updated the column_filter, but those merges were a mess (both you and I made changes in the master that conflicted) so I left it as is. I may revisit or think up a better way for users to search for specific bugs/trackers.

I tested installation and both the URL and Screenshot processors work. There is a install script for Docker that automatically runs if you add firefox to the 4CAT settings (you need to also enable the desired datasources). Right now the install runs when restarting the Docker container. I will try to revisit that and see if I can get it to work with the interactive restart.

Jan 17 '23 12:01 dale-wahl

4cat 4cat copied to clipboard

Added Selenium URL scraper as new datasource; modified column filter to allow detailed matching information

4cat
4cat copied to clipboard