4cat icon indicating copy to clipboard operation
4cat copied to clipboard

Function merge posts texts in 4cat should merge full text bodies, not stemmed version?

Open hash00x1 opened this issue 3 years ago • 4 comments

Dear Stijn and Sal,

just a quick question: I was looking through the 4cat code for the v2 Twitter API access. Its great work - thanks for that ! I was just wondering, for the function "Merge post texts" at the 4cat webinterface, would it not be more helpful for actual text analysis if the script would merge the full-texts of the tweet text? As of now, in the case of re-tweets it only merges the shortened text-fields, which Twitter provides at the .json key-value tweet['text']. This is also where the map_item function under search_twitter.py takes it from to define the field "body." Which then again seems to be used by the stringify.py script.

However, the full text for re-tweets is actually stored under the .json key-value tweet['referenced_tweets']['text']. Is this done on purpose that currently the stringify-script pulls the text versions with shortened values?

If so, could I request a function (or text field in the .csv-conversion) that also provides full text access?

Like as in:

return {
            "id": tweet["id"],
            "thread_id": tweet.get("conversation_id", tweet["id"]),
            "timestamp": tweet["created_at"].replace("T", " ").replace(".000Z", ""),
            "unix_timestamp": int(datetime.datetime.strptime(tweet["created_at"], "%Y-%m-%dT%H:%M:%S.000Z").timestamp()),
            "subject": "",
            "body": tweet["text"],
            **"body_full": tweet["referenced_tweets"]["text"],** #dummy code. The full-text value is nested in an array.

Best, Lukas

hash00x1 avatar Mar 19 '21 18:03 hash00x1

Thanks for noticing this! This is an issue indeed though it is a bit tricky to fix robustly. I see four options...

  1. Replace retweeted tweets with the original tweets. Disadvantage: datasets could contain tweets not matching the query, e.g. if the dataset is based on a query from:userA and userA has retweeted a tweet by userB.
  2. Replace body column for retweeted tweets with the retweeted tweet's text. Disadvantage: It would seem like the user that retweeted wrote the retweeted tweet, which is not the case.
  3. Make the column used by the 'stringify' processor a configurable parameter, and add retweet text in a separate column. Disadvantage: many other processors use the body column, so it would be confusing if only this one could be configured. Do we want to make this a parameter for all processors?
  4. Always use the body_full column instead of body, if it exists. Disadvantage: this is hard to make clear to the user via the interface, and could be confusing if they expect the body column to be used as happens everywhere else.

As you rightly point out this is pretty trivial to fix on a technical level, but I have to think a bit what the most transparent and robust solution is to this. To be continued...

stijn-uva avatar Mar 24 '21 14:03 stijn-uva

Hey Stijn, just as a quick thought. How about, you add an external colum for body_full as I had suggested in the above sample code plus you add an additional function, that checks for the retweet-status of a text: If its a retweet, it adds the "RE:" credentials in front? Best wishes

hash00x1 avatar Apr 03 '21 13:04 hash00x1

Hi @LukasHyde , in the commit referenced above I have at least made it expand the body to include the full retweet, as RT @username: [full tweet] (instead of just an excerpt as before).

Adding a body_retweeted column would be possible, but I wonder what your use case for this would be, since it would not be picked up by processors (currently). Is the goal here to process it further offline?

stijn-uva avatar Apr 21 '21 15:04 stijn-uva

Hi @stijn-uva, as you guessed already, my use case would have been the subsequent processing of full text tweets offline. If I remember correctly, the main reason was to improve legibility of the dataset (in large datasets it would take a long time to find the original tweet) and to add the raw data for example for NLP applications.

hash00x1 avatar May 17 '21 13:05 hash00x1