twitter-archive-parser
twitter-archive-parser copied to clipboard
Feature request: Parse DMs, add user names and handles
The current twitter archive downloaded omits all the user names and handles. It only contains the ids of the accounts that someone interacted with. With that the archive looses context, especially for the DMs and reply's.
I see lots of usernames and handles in both the twitter archive and the markdown produced by this script. Can you give an example of where data is missing?
The twitter archive does lose a lot of context - it only contains your tweets and replies, not entire threads.
This script doesn't currently attempt to parse the DMs.
Same here, would love to see handles/usernames in the dm section.
I'm struggling to understand what's being asked here.
If this is a bug report: Please be precise about what the script did and what you were expecting it to do instead. It sounds like you are talking about missing data in Twitter's archive? If so then that's a bug for Twitter I would have thought?
If this is a feature request: Please give more details in what you would like the script to do. Currently it doesn't do anything with DMs.
I can only speak for myself, but from my POV this is a feature request. The json for the direct messages has only an id, it would be great if the handle and name (and maybe even the picture) could be resolved and saved. Maybe this isn't the scope and a dedicated script would be better, idk.
OK. I have changed the title to reflect my understanding. I don't have any immediate plans to address this but maybe someone else would want to take a look. There are many other twitter archive parsers out there that may well already do this.
Great :) Any recommendations which one can do this? Searched but couldn't find any :(
@duracell Here are some tools that I found on github or people have pointed me at:
- https://github.com/selfawaresoup/twitter-tools
- https://github.com/roobottom/twitter-archive-to-markdown-files
- https://gist.github.com/divyajyotiuk/9fb29c046e1dfcc8d5683684d7068efe#file-get_twitter_bookmarks_v3-py
- https://archive.alt-text.org/
- https://observablehq.com/@enjalot/twitter-archive-tweets
- https://github.com/woluxwolu/twint
- https://github.com/jarulsamy/Twitter-Archive
- https://sk22.github.io/twitter-archive-browser/
- https://pypi.org/project/pleroma-bot/
- https://github.com/mshea/Parse-Twitter-Archive
- https://github.com/dangoldin/twitter-archive-analysis
I made a tool to turn those archive IDs into name, bio, and real url: https://gist.github.com/n1ckfg/df70c6fa1dabac4fe55cb551364adcc5
I made a script to parse user IDs and map them to handles. It is different from the scripts linked above in that it doesn't need login or access to Twitter's API, because it uses the TweeterID web service to look up the handles. It also finds some of the handles in the archive itself (looking in mentions and retweets). Sometimes it also finds display names and links, but it can't look up the bio or profile picture yet.
Currently, it just writes the mappings into a JSON file, but you might already want to already use it anyway, in case Twitter goes down even faster than expected...
The script is available in the userids branch in my fork of this project: https://github.com/flauschzelle/twitter-archive-parser/tree/userids
@lenaschimmel and me are working on integrating it into the main parser script and will probably be making a pull request to the main project here later. But integrating it properly might take a few days, so if you're in a hurry, feel free to use my version in the meantime :)
@flauschzelle Thanks for looking into this. I was just looking at the JSON for this myself:
if 'in_reply_to_user_id' in tweet and 'in_reply_to_screen_name' in tweet:
user_id_to_handle[tweet['in_reply_to_user_id']] = tweet['in_reply_to_screen_name']
For my archive this gives me 234 handles and is enough for making a start on parsing DMs, followers/followings.
Maybe we should get that basic functionality working and then add the lookup feature afterwards?
I'm trying to understand what you are currently doing and if / how much it overlaps with what @flauschzelle and I have already done / are about to do...
So this is already done now by @flauschzelle:
- Collect known and missing names from local archive data (tweets, mentions, dms, group dms, follower and following)
- Load known and missing names from from previous runs (
/data/parsed_users.json
) - Check if anything is actually missing
- make a list of user ids to look up
- look them up (with tweeterid)
- write results to the file
Currently working on:
- Integrating your
parser.py
and @flauschzelle'suser_id_parser.py
(currently in my fork here though we are not sure if future work will happen primarily in my fork or @flauschzelle's fork)
Things I/we still plan to do:
- unify coding style
- unify style of log / user output
- unify the approach for retries of failed requests
- merge code into a single file so that the simple setup guide still works (Right-click this link parser.py and select "Save Link as"...)
- add the approach by @n1ckfg as additional option, since it has both advantages and disadvantages
Things that seem useful, but that I didn't really look into:
- use the resolved data to put it into some human-readable output of DMs
- also use it for followers and followings, which would solve #70
@lenaschimmel Yes, there was some overlap. The branch looks good. To avoid calamity let's tackle it in small PRs:
- existing
convert_tweet()
function appends whatever id:handle connections it finds to a data structure of Users. - new functionality to parse DMs, using Users, output in some simple way
- new functionality to parse followers/followings, using Users, output in some simple way
- new functionality to do remote lookup on Users, to improve the output of the above
I've just made a note about getting full user data from the API (without a key!) on the followers issue: https://github.com/timhutton/twitter-archive-parser/issues/70#issuecomment-1320721365
Updated roadmap with current progress:
- existing
convert_tweet()
function appends whatever id:handle connections it finds to a data structure of Users:- In progress: #76
- new functionality to parse DMs, using Users, output in some simple way:
- In progress: #78
- new functionality to parse followers/followings, using Users, output in some simple way:
- In progress: #77
- new functionality to do remote lookup on Users, to improve the output of the above:
- Issue created for discussion of proposals: #79
- improve DMs by adding images, expanding links, etc.:
- Issue created: #80
Any way to add code to pull deleted DM's from your own personal account?