rtweet
rtweet copied to clipboard
Reduce tweet fields returned by default
Current search_tweets()
and friends returns a data frame with 73 columns:
[1] "status_id" "created_at" "user_id"
[4] "screen_name" "text" "source"
[7] "display_text_width" "reply_to_status_id" "reply_to_user_id"
[10] "reply_to_screen_name" "is_quote" "is_retweet"
[13] "favorite_count" "retweet_count" "quote_count"
[16] "reply_count" "hashtags" "symbols"
[19] "urls_url" "urls_t.co" "urls_expanded_url"
[22] "media_url" "media_t.co" "media_expanded_url"
[25] "media_type" "ext_media_url" "ext_media_t.co"
[28] "ext_media_expanded_url" "ext_media_type" "ext_alt_text"
[31] "mentions_user_id" "mentions_screen_name" "lang"
[34] "quoted_status_id" "quoted_text" "quoted_created_at"
[37] "quoted_source" "quoted_favorite_count" "quoted_retweet_count"
[40] "quoted_user_id" "quoted_screen_name" "quoted_name"
[43] "quoted_followers_count" "quoted_friends_count" "quoted_statuses_count"
[46] "quoted_location" "quoted_description" "quoted_verified"
[49] "retweet_status_id" "retweet_text" "retweet_created_at"
[52] "retweet_source" "retweet_favorite_count" "retweet_retweet_count"
[55] "retweet_user_id" "retweet_screen_name" "retweet_name"
[58] "retweet_followers_count" "retweet_friends_count" "retweet_statuses_count"
[61] "retweet_location" "retweet_description" "retweet_verified"
[64] "place_url" "place_name" "place_full_name"
[67] "place_type" "country" "country_code"
[70] "geo_coords" "coords_coords" "bbox_coords"
[73] "status_url"
I'd suggest that we return fewer more complicated columns by default, instead providing some helpers to access them when needed. For example, we could keep media_*
, quoted_*
and retweet_*
in media
, quoted
and rtweet
columns and provide helpers to expand them out when needed.
IOTW I'm suggesting that the data frame unpacking that currently occurs in tweets_to_tbl_()
should be post-poned until the user requests it.
Is it intentional/temporary that since #572 get_timeline()
and lists_statuses()
do not return things like the screen_name of the sender anymore?
It is an error I didn't detect when I merged the PR. Thanks @simonheb for asking!! It is incorrectly processed and later on lost (but should be on user(search_tweets("bla"))
.
Sorry @simonheb I checked more about the issue and it turned out I used incorrect code, user
is internal of rtweet and shouldn't be used externally.
The correct function data to retrieve the screen name is users_data
which correctly returns all the information returned by the API about the user. This is what you should use and it is also documented on the search_tweets
example.
Ok thanks.
But this is just a quick fix, no? In the long run lists_statuses
, etc. should also return user data, no?
They already return this data but it is on an attribute. It is not a quick fix, I had not to do anything here for this to work between the comments. I agree with Hadley that having 70 columns was not not practical.
At the moment I don't plan to change the columns or how the information is returned anytime soon. Perhaps it needs it's own class and/or printing method. I might add some functions to access some of the nested lists within the object, but not going back to one big 73 wide column data.frame. I understand that this is one more breaking change but I think that in general it makes working with the output easier.
These changes look great! It's a good idea [and more sustainable] to more closely mirror actual API data structures. I'm excited to see [and hopefully contribute again to] future changes in the pkg as well! Thank you @llrs and @hadley for all the hard but excellent work!
Hi @mkearney many thanks for your encouraging words. Sorry for the surprise when you installed the development version of the package and it broke your scripts. It was not my intention when I offered to help maintaining the package.
I am aware that changing the column names will break scripts and other packages, that's one of the reasons why it will take some more time until I think about sending the package to CRAN. Perhaps there will be one other breaking change before sending it to CRAN as we were considering renaming the functions. Additionally, there are still some bugs we have introduced I would like to fix and I want to make it easier to transition from 0.7.0 version to this one. Perhaps one of the ways might be adding some helpers to extract columns and rename them. All feedback is welcome, specially if we made something harder or you have other comments to improve the package.
While we've tried to mirror more the actual API data structure one of the main reasons of its success is due to the flattening of the data it does. There is still some work to do on that front as some of the columns have now a nested structure (and save_as_csv and write_as_csv, have not been adapted yet) but it is something important for the analysis we'll keep.
You don't need to apologize, @llrs! You've been doing all of the [great] work here, so the last thing you should worry about is my old scripts working. It seems you've been very thoughtful about everything, so in the meantime, I'll try to give this (how to ease the transition for users) some thought and see if I can't help make this happen!
@mkearney I've sent you an email to the address listed on the description, not sure if you receive them... In case you do not longer do, I wanted you to know that recently, some users asked for improvements on rtweet and suggested making a day-long rtweet hackathon to squash bugs and build in support for new API v2 of Twitter.
I suggested having the hackathon the 27th November or 4th of December. See this thread on the slack channel package-maintenance of rOpenSci. What do you think? Would you like to join the conversation there?