4cat icon indicating copy to clipboard operation
4cat copied to clipboard

Datasource that interfaces with a TCAT instance

Open stijn-uva opened this issue 2 years ago • 6 comments

It works, and arguably fixes #117, but:

  • The form looks hideous with the million query fields. Do we need them all for 4CAT? Is there a way to make it look better?
  • The list of bins displayed in the 'create dataset' form simply lists bins from all instances. This can get really long really fast when supporting multiple instances. A custom form control may be necessary to make this user-friendly.
  • The list of bins is loaded synchronously whenever get_options() is run. The result should probably be cached or updated in the background (with a separate worker...?)
  • The data format now follows that of twitterv2's map_item(), but there is quite a bit more data in the TCAT export that we could include.

stijn-uva avatar Jan 26 '22 15:01 stijn-uva

To enable this, in config.py:

DATASOURCES = {
	"dmi-tcat": {
		"instances": ["http://tcat7.digitalmethods.net"]
	}
}

(for example)

stijn-uva avatar Jan 26 '22 15:01 stijn-uva

* The form looks hideous with the million query fields. Do we need them all for 4CAT? Is there a way to make it look better?

We could hide some under an "Advanced Options" section since many are unlikely to be frequently used. I ordered them that way, but better to hide the "Advanced Options" section with a button or something similar.

* The list of bins displayed in the 'create dataset' form simply lists bins from all instances. This can get really long really fast when supporting multiple instances. A custom form control may be necessary to make this user-friendly.

I made some changes to what it displays (bin_name: num tweets from date to date), but I am not sure they best way to organize it. We could break it out by TCAT instance though that doesn't seem relevant to the users. Right now they are also ordered by instance than bin name; but I could at least order them by bin name to easily find what we want.

* The list of bins is loaded synchronously whenever `get_options()` is run. The result should probably be cached or updated in the background (with a separate worker...?)

From what I can tell, get_options() only runs when you select DMI-TCAT as the datasource type. Not sure if that is "too much" (probably with many TCAT instances it would be), but it is dynamic information (more so now since I added in datetimes to the bins). I think you are proposing a worker that runs periodically and, say, caches this data into a database or sort of background dataset somewhere? It would follow easily if we set up the database to store options/settings.

* The data format now follows that of `twitterv2`'s `map_item()`, but there is quite a bit more data in the TCAT export that we could include.

Mapped the rest of the TCAT data to the output. There was one oddity: thread_id. Technically a tweet can have both a reply_id and a quote_id (since you can retweet a reply or reply to a retweet). I wasn't sure how to prioritize them, but ultimately either will lead you to the correct "thread". Ideally, we'd find the original tweet and use that as the thread_id.

dale-wahl avatar Feb 22 '22 16:02 dale-wahl

fixed dates as well as the AND/OR query to dmi-tcatv2 datasource

dale-wahl avatar Apr 12 '22 12:04 dale-wahl

Dmi-tcatv2 datasource has been tested and I am happy with the results. The basic query should return expected results in the same format as twitterv2 (and has some robustness to return any additional TCAT data). I could possibly be improved on to better utilize some of TCAT's other tables, but I am not sure there is much additional value for most users.

Additionally there is the advanced query option. This allows a user to directly query any tables in the specific TCAT instance/database. It requires knowledge of the TCAT database structure which may not be readily available, but you could actually query for it if you like (e.g. SHOW TABLES).

dale-wahl avatar Apr 13 '22 10:04 dale-wahl

OH, one oddity that I was not super sure how to resolve. In the collect_tcat_metadata class method, I needed a logger for the MySQLDatabase class we built. I could not figure out how to access the existing logger instance and ended up creating a new one. This currently has the unintended consequence of adding logger instances and making multiple log entries. Definitely needs a fix!

dale-wahl avatar Apr 13 '22 10:04 dale-wahl

Updated the TCAT datasources to work with newer 4CAT changes (e.g. the config database).

dale-wahl avatar Jul 26 '22 14:07 dale-wahl