4cat
4cat copied to clipboard
Added Selenium URL scraper as new datasource; modified column filter to allow detailed matching information
- Added
selenium_scraperas a new Search class to be used in creating new datasources. - Created
url_scraperdatasource which allows a user to scrape a list of urls and up to 5 subpages on the host - Modified the
column-filterprocessor to provide detail output showing which matches were found in a given column (this can mimic Tracker Tracker by searching for substrings within HTML and noting which were found/not found in the HTML) - updated Search base class to allow
after_search_completedmethod; this was necessary to ensure Selenium webdriver and Chrome browser are properly closed and also works withget_itemsgenerators. validate_urlhelper function added tohelpers.py
@stijn-uva I have tested this and not found any issues, but did want you to particularly review the changes to Search.
I need to update the setup (Docker and manual). Some references I use to get Firefox working properly: https://takac.dev/example-of-selenium-with-python-on-docker-with-latest-firefox/ https://github.com/mozilla/geckodriver/releases/tag/v0.30.0 Can also remove Google Chrome.
Updated this branch to work with the new config_manager/database settings. Also updated selenium and made some minor bug fixes. Finally created a separate installation that only runs with the selenium settings are input into 4CAT settings (automatically in Docker when the backend container is restarted).
The only "to-do" left is a quality of life update to the frontend to seperate out these specific datasources since they do not conform to the social media paradigm of other datasources.
Merged master into this branch and fixed all the conflicts. I had updated the column_filter, but those merges were a mess (both you and I made changes in the master that conflicted) so I left it as is. I may revisit or think up a better way for users to search for specific bugs/trackers.
I tested installation and both the URL and Screenshot processors work. There is a install script for Docker that automatically runs if you add firefox to the 4CAT settings (you need to also enable the desired datasources). Right now the install runs when restarting the Docker container. I will try to revisit that and see if I can get it to work with the interactive restart.