rtweet icon indicating copy to clipboard operation
rtweet copied to clipboard

Reduce tweet fields returned by default

Open hadley opened this issue 3 years ago • 10 comments

Current search_tweets() and friends returns a data frame with 73 columns:

 [1] "status_id"               "created_at"              "user_id"                
 [4] "screen_name"             "text"                    "source"                 
 [7] "display_text_width"      "reply_to_status_id"      "reply_to_user_id"       
[10] "reply_to_screen_name"    "is_quote"                "is_retweet"             
[13] "favorite_count"          "retweet_count"           "quote_count"            
[16] "reply_count"             "hashtags"                "symbols"                
[19] "urls_url"                "urls_t.co"               "urls_expanded_url"      
[22] "media_url"               "media_t.co"              "media_expanded_url"     
[25] "media_type"              "ext_media_url"           "ext_media_t.co"         
[28] "ext_media_expanded_url"  "ext_media_type"          "ext_alt_text"           
[31] "mentions_user_id"        "mentions_screen_name"    "lang"                   
[34] "quoted_status_id"        "quoted_text"             "quoted_created_at"      
[37] "quoted_source"           "quoted_favorite_count"   "quoted_retweet_count"   
[40] "quoted_user_id"          "quoted_screen_name"      "quoted_name"            
[43] "quoted_followers_count"  "quoted_friends_count"    "quoted_statuses_count"  
[46] "quoted_location"         "quoted_description"      "quoted_verified"        
[49] "retweet_status_id"       "retweet_text"            "retweet_created_at"     
[52] "retweet_source"          "retweet_favorite_count"  "retweet_retweet_count"  
[55] "retweet_user_id"         "retweet_screen_name"     "retweet_name"           
[58] "retweet_followers_count" "retweet_friends_count"   "retweet_statuses_count" 
[61] "retweet_location"        "retweet_description"     "retweet_verified"       
[64] "place_url"               "place_name"              "place_full_name"        
[67] "place_type"              "country"                 "country_code"           
[70] "geo_coords"              "coords_coords"           "bbox_coords"            
[73] "status_url"  

I'd suggest that we return fewer more complicated columns by default, instead providing some helpers to access them when needed. For example, we could keep media_*, quoted_* and retweet_* in media, quoted and rtweet columns and provide helpers to expand them out when needed.

hadley avatar Apr 05 '21 13:04 hadley

IOTW I'm suggesting that the data frame unpacking that currently occurs in tweets_to_tbl_() should be post-poned until the user requests it.

hadley avatar Apr 05 '21 13:04 hadley

Is it intentional/temporary that since #572 get_timeline() and lists_statuses() do not return things like the screen_name of the sender anymore?

simonheb avatar Oct 04 '21 16:10 simonheb

It is an error I didn't detect when I merged the PR. Thanks @simonheb for asking!! It is incorrectly processed and later on lost (but should be on user(search_tweets("bla")).

llrs avatar Oct 04 '21 18:10 llrs

Sorry @simonheb I checked more about the issue and it turned out I used incorrect code, user is internal of rtweet and shouldn't be used externally.

The correct function data to retrieve the screen name is users_data which correctly returns all the information returned by the API about the user. This is what you should use and it is also documented on the search_tweets example.

llrs avatar Oct 04 '21 21:10 llrs

Ok thanks.

But this is just a quick fix, no? In the long run lists_statuses, etc. should also return user data, no?

simonheb avatar Oct 07 '21 09:10 simonheb

They already return this data but it is on an attribute. It is not a quick fix, I had not to do anything here for this to work between the comments. I agree with Hadley that having 70 columns was not not practical.

At the moment I don't plan to change the columns or how the information is returned anytime soon. Perhaps it needs it's own class and/or printing method. I might add some functions to access some of the nested lists within the object, but not going back to one big 73 wide column data.frame. I understand that this is one more breaking change but I think that in general it makes working with the output easier.

llrs avatar Oct 07 '21 12:10 llrs

These changes look great! It's a good idea [and more sustainable] to more closely mirror actual API data structures. I'm excited to see [and hopefully contribute again to] future changes in the pkg as well! Thank you @llrs and @hadley for all the hard but excellent work!

mkearney avatar Oct 20 '21 17:10 mkearney

Hi @mkearney many thanks for your encouraging words. Sorry for the surprise when you installed the development version of the package and it broke your scripts. It was not my intention when I offered to help maintaining the package.

I am aware that changing the column names will break scripts and other packages, that's one of the reasons why it will take some more time until I think about sending the package to CRAN. Perhaps there will be one other breaking change before sending it to CRAN as we were considering renaming the functions. Additionally, there are still some bugs we have introduced I would like to fix and I want to make it easier to transition from 0.7.0 version to this one. Perhaps one of the ways might be adding some helpers to extract columns and rename them. All feedback is welcome, specially if we made something harder or you have other comments to improve the package.

While we've tried to mirror more the actual API data structure one of the main reasons of its success is due to the flattening of the data it does. There is still some work to do on that front as some of the columns have now a nested structure (and save_as_csv and write_as_csv, have not been adapted yet) but it is something important for the analysis we'll keep.

llrs avatar Oct 21 '21 17:10 llrs

You don't need to apologize, @llrs! You've been doing all of the [great] work here, so the last thing you should worry about is my old scripts working. It seems you've been very thoughtful about everything, so in the meantime, I'll try to give this (how to ease the transition for users) some thought and see if I can't help make this happen!

mkearney avatar Oct 22 '21 15:10 mkearney

@mkearney I've sent you an email to the address listed on the description, not sure if you receive them... In case you do not longer do, I wanted you to know that recently, some users asked for improvements on rtweet and suggested making a day-long rtweet hackathon to squash bugs and build in support for new API v2 of Twitter.

I suggested having the hackathon the 27th November or 4th of December. See this thread on the slack channel package-maintenance of rOpenSci. What do you think? Would you like to join the conversation there?

llrs avatar Nov 17 '21 20:11 llrs