4cat icon indicating copy to clipboard operation
4cat copied to clipboard

Add data source that interfaces with a TCAT tweet database

Open stijn-uva opened this issue 4 years ago • 3 comments

In previous data sprints people have uploaded tweet dumps to 4CAT for analysis, so it seems to have some utility for analysis of that kind of data. In case a TCAT database is running on the same server, or accessible remotely, a data source that gets tweets from it according to parameters could be quite useful.

stijn-uva avatar Jan 18 '21 12:01 stijn-uva

We have a data source for the v2 API now. Since TCAT uses the v1 API, the straightforward approach to this would be to add a separate data source that gives v1 data. This can then signal to the users that they provide tweets from a different (if similar) source. If the data maps to the same basic format as the v2 datasource's map_item() provides, most if not all processors should remain cross-compatible.

stijn-uva avatar Dec 17 '21 14:12 stijn-uva

I wanted to store my notes/progress on the direct connection to TCAT database.

First step is opening up a TCAT instance to remote connections.

  1. Comment out skip-networking (if active) and bind-address. I found bind-address located in /etc/mysql/mariadb.conf.d/50-server.cnf on tcat4.
  2. Restart mysql sudo systemctl restart mariadb.service
  • I checked both sudo systemctl status mariadb.service and in mysql SHOW PROCESSLIST first (tcat4 hasn't been collecting for a while now)
  1. Add a new user to access remotely GRANT SELECT ON twittercapture.* TO newuser@'1.2.3.4' IDENTIFIED BY 'newpassword';
  • wildcards are allowed in IP addresses, so in principle something like 192.168.% could give access across our network.

Second step was even easier since there is a mysql database class already in 4CAT.

from backend.lib.database_mysql import MySQLDatabase
import logging
db_name = 'twittercapture'
db_user = 'newuser'
db_password = 'newpassword'
db_host = 'tcat4_local_network_ip'
db_port = 3306 # mariadb default 
db = MySQLDatabase(logging, db_name, db_user, db_password, db_host, db_port)

We could provide some basic bin data by running a few queries and showing them in UserInput.OPTION_INFO blocks. Bin names, number of tweets, and perhaps phrases/users tracked. This proved pretty quick, but it might be nice to store this info someplace (metrics table?) and have a mini worker that queries it daily.

I thought the benefit to a direct connection was accessing the different tables and we could easily just have 4CAT create datasets based on any query (just save fetch_all into a csv), but that would create very odd datasets by 4CAT standards. So if we prefer the direct db connection approach to what you've already created, we could recreate the map_item from the twitterv2 datasource as a "basic usage" option and provide an "advanced usage" option that allows more robust querying of TCAT. Each bin consists of seven tables so queries can be quite complex (binname_hashtags, binname_media, binname_mentions, binname_place, binname_tweets, binname_urls, binname_withheld). Some of those are actually super useful for certain processors, but we'd then be creating processors specifically for this type of datasource (which may or may not be the goal).

dale-wahl avatar Apr 06 '22 12:04 dale-wahl

Pushed a dmi-tcatv2 datasource to your pull request mentioned earlier. The "basic" query produces a result based on the map_item of the twitterv2 datasource. The advanced query will let you query anything you like. It does not do anything to the query itself (meaning the other user selection options are not relevant aside from needing to at least select a bin corresponding with the relevant TCAT instance).

dale-wahl avatar Apr 07 '22 15:04 dale-wahl