twitter-archive-parser Feature request: Parse DMs, add user names and handles

The current twitter archive downloaded omits all the user names and handles. It only contains the ids of the accounts that someone interacted with. With that the archive looses context, especially for the DMs and reply's.

Nov 10 '22 08:11 PinguTS

I see lots of usernames and handles in both the twitter archive and the markdown produced by this script. Can you give an example of where data is missing?

The twitter archive does lose a lot of context - it only contains your tweets and replies, not entire threads.

This script doesn't currently attempt to parse the DMs.

Nov 10 '22 10:11 timhutton

Same here, would love to see handles/usernames in the dm section.

Nov 11 '22 13:11 duracell

I'm struggling to understand what's being asked here.

If this is a bug report: Please be precise about what the script did and what you were expecting it to do instead. It sounds like you are talking about missing data in Twitter's archive? If so then that's a bug for Twitter I would have thought?

If this is a feature request: Please give more details in what you would like the script to do. Currently it doesn't do anything with DMs.

Nov 11 '22 13:11 timhutton

I can only speak for myself, but from my POV this is a feature request. The json for the direct messages has only an id, it would be great if the handle and name (and maybe even the picture) could be resolved and saved. Maybe this isn't the scope and a dedicated script would be better, idk.

Nov 11 '22 14:11 duracell

OK. I have changed the title to reflect my understanding. I don't have any immediate plans to address this but maybe someone else would want to take a look. There are many other twitter archive parsers out there that may well already do this.

Nov 11 '22 14:11 timhutton

Great :) Any recommendations which one can do this? Searched but couldn't find any :(

Nov 11 '22 14:11 duracell

@duracell Here are some tools that I found on github or people have pointed me at:

https://github.com/selfawaresoup/twitter-tools
https://github.com/roobottom/twitter-archive-to-markdown-files
https://gist.github.com/divyajyotiuk/9fb29c046e1dfcc8d5683684d7068efe#file-get_twitter_bookmarks_v3-py
https://archive.alt-text.org/
https://observablehq.com/@enjalot/twitter-archive-tweets
https://github.com/woluxwolu/twint
https://github.com/jarulsamy/Twitter-Archive
https://sk22.github.io/twitter-archive-browser/
https://pypi.org/project/pleroma-bot/
https://github.com/mshea/Parse-Twitter-Archive
https://github.com/dangoldin/twitter-archive-analysis

Nov 11 '22 20:11 timhutton

I made a tool to turn those archive IDs into name, bio, and real url: https://gist.github.com/n1ckfg/df70c6fa1dabac4fe55cb551364adcc5

Nov 13 '22 21:11 n1ckfg

I made a script to parse user IDs and map them to handles. It is different from the scripts linked above in that it doesn't need login or access to Twitter's API, because it uses the TweeterID web service to look up the handles. It also finds some of the handles in the archive itself (looking in mentions and retweets). Sometimes it also finds display names and links, but it can't look up the bio or profile picture yet.

Currently, it just writes the mappings into a JSON file, but you might already want to already use it anyway, in case Twitter goes down even faster than expected...

The script is available in the userids branch in my fork of this project: https://github.com/flauschzelle/twitter-archive-parser/tree/userids

@lenaschimmel and me are working on integrating it into the main parser script and will probably be making a pull request to the main project here later. But integrating it properly might take a few days, so if you're in a hurry, feel free to use my version in the meantime :)

Nov 18 '22 23:11 flauschzelle

@flauschzelle Thanks for looking into this. I was just looking at the JSON for this myself:

if 'in_reply_to_user_id' in tweet and 'in_reply_to_screen_name' in tweet:
  user_id_to_handle[tweet['in_reply_to_user_id']] = tweet['in_reply_to_screen_name']

For my archive this gives me 234 handles and is enough for making a start on parsing DMs, followers/followings.

Maybe we should get that basic functionality working and then add the lookup feature afterwards?

Nov 18 '22 23:11 timhutton

I'm trying to understand what you are currently doing and if / how much it overlaps with what @flauschzelle and I have already done / are about to do...

So this is already done now by @flauschzelle:

Collect known and missing names from local archive data (tweets, mentions, dms, group dms, follower and following)
Load known and missing names from from previous runs (/data/parsed_users.json)
Check if anything is actually missing
make a list of user ids to look up
look them up (with tweeterid)
write results to the file

Currently working on:

Integrating your parser.py and @flauschzelle's user_id_parser.py (currently in my fork here though we are not sure if future work will happen primarily in my fork or @flauschzelle's fork)

Things I/we still plan to do:

unify coding style
unify style of log / user output
unify the approach for retries of failed requests
merge code into a single file so that the simple setup guide still works (Right-click this link parser.py and select "Save Link as"...)
add the approach by @n1ckfg as additional option, since it has both advantages and disadvantages

Things that seem useful, but that I didn't really look into:

use the resolved data to put it into some human-readable output of DMs
also use it for followers and followings, which would solve #70

Nov 19 '22 00:11 lenaschimmel

@lenaschimmel Yes, there was some overlap. The branch looks good. To avoid calamity let's tackle it in small PRs:

existing convert_tweet() function appends whatever id:handle connections it finds to a data structure of Users.
new functionality to parse DMs, using Users, output in some simple way
new functionality to parse followers/followings, using Users, output in some simple way
new functionality to do remote lookup on Users, to improve the output of the above

Nov 19 '22 00:11 timhutton

I've just made a note about getting full user data from the API (without a key!) on the followers issue: https://github.com/timhutton/twitter-archive-parser/issues/70#issuecomment-1320721365

Nov 19 '22 01:11 press-rouch

Updated roadmap with current progress:

existing convert_tweet() function appends whatever id:handle connections it finds to a data structure of Users:
- In progress: #76
new functionality to parse DMs, using Users, output in some simple way:
- In progress: #78
new functionality to parse followers/followings, using Users, output in some simple way:
- In progress: #77
new functionality to do remote lookup on Users, to improve the output of the above:
- Issue created for discussion of proposals: #79
improve DMs by adding images, expanding links, etc.:
- Issue created: #80

Nov 19 '22 12:11 timhutton

Any way to add code to pull deleted DM's from your own personal account?

Jan 23 '23 08:01 Bebetternow22

twitter-archive-parser twitter-archive-parser copied to clipboard

Feature request: Parse DMs, add user names and handles

twitter-archive-parser
twitter-archive-parser copied to clipboard