rtweet Error "duplicate 'row.names' are not allowed" while using search

Error "duplicate 'row.names' are not allowed" while using search_fullarchive

Open KeisukeNish opened this issue 1 year ago • 17 comments

Problem

When using search_fullarchive, I got an error message: Error in .rowNamesDF<-(x, value = value): duplicate 'row.names' are not allowed

I got the following traceback... (I replaced the id_str and others with "...", and search words with "TERM_X".

12.
stop("duplicate 'row.names' are not allowed")
11.
`.rowNamesDF<-`(x, value = value)
10.
`row.names<-.data.frame`(`*tmp*`, value = value)
9.
`row.names<-`(`*tmp*`, value = value)
8.
`rownames<-`(`*tmp*`, value = `*vtmp*`)
7.
rbind(deparse.level, ...)
6.
rbind(structure(list(id = ..., id_str = "...", name = "...", 
screen_name = "...", location = "...", url = NA_character_, 
description = "...", 
derived = structure(list(locations = list(structure(list( ...
5.
do.call("rbind", tweets[["user"]])
4.
tweets_with_users(result)
3.
search_premium("fullarchive", q = q, n = n, fromDate = fromDate, 
toDate = toDate, env_name = env_name, continue = continue, 
safedir = safedir, parse = parse, token = token)
2.
search_fullarchive(q = q, n = n, fromDate = fromDate, toDate = toDate, 
env_name = env_name, token = token)
1.
tweet.archive.search(q = "TERM_A TERM_B -is:retweet OR TERM_A TERM_C -is:retweet", 
n = 500, full = 500, fromDate = 201701010000, toDate = 202112312359, 
env_name = "archive", token = twitter_token)

versions etc.

Twitter API: Premium (paid) rtweet version: 1.0.4. R version: 4.2.1. RStudio version: 2022.07.1 Build 554. Sorry but I can't show you the session info because I don't have the PC I ran the code right now.

I think this is really similar to this issue. I didn't encounter this error while using rtweet of 0.7.0 on another PC. R and Rstudio version are the same as the one.

Aug 05 '22 09:08 KeisukeNish

Hi,

Thanks for the report. I see the rtweet version you are using is not one I released, as I never upgraded rtweet to 1.0.4. Is that a typo or are there some modifications done by someone else?

Unfortunately this kind of problems are hard for me to reproduce unless I know in which tweets or users does this happen.
As you found this is similar to #648 but there I was able to fix it quickly as the report provided enough information to reproduce the problem and test that the fix worked correctly. I don't need the terms you are searching it would be enough if I can run lookup_tweets(id1) and trigger the same parsing problems.

Aug 05 '22 09:08 llrs

Thanks for the quick response. And sorry for the typo, it is 1.0.2.

I couldn't download tweets, so I wonder what I can do for you to reproduce.

What about the terms? they were "美保関観光 -is:retweet OR 美保関旅 -is:retweet" These chinese characters represent Japanese words.

It might be irrelavant, but while trying to solve this, the function requested many times.

Aug 05 '22 10:08 KeisukeNish

You don't need to store the downloaded tweets (you could if you specify parse = FALSE), I need to know which are the tweets that fail to parse. That's why I asked about the id_str you redacted. Please provide the number and I might be able to fix it. These search terms are good, but I will get a different result as there is no guarantee the same search returns the same results.

The number of requests might be a problem for you as it might have exhausted the number of available queries you can do.

Aug 05 '22 11:08 llrs

Thanks again and sorry that I didn't get your point. The traceback includes id_str = "15426138".

Aug 06 '22 01:08 KeisukeNish

Sorry, I cannot reproduce the problem. I didn't find a tweet by id 15426138. I didn't have a problem with rtweet to parse the user with the id 15426138. Searching for those terms didn't result with any error. I could not try the premium search because I ran out of API calls.

You'll need to provide a reproducible example with the error that contains the id of the user or tweet that cause rtweet to fail.

Aug 06 '22 08:08 llrs

Thank you so much for everything.

I will re-try the function hopefully within the next two days. If you want information other than

traceback
session info

for a reproducible example, could you tell me about it?

And if you have any ideas about candidates that trigger the error, and there's something I can do to avoid it, I 'd like to try.

Aug 08 '22 00:08 KeisukeNish

I need to know the id_str of the tweets that are causing problems, without that even if you provide the traceback and the session info I won't be able to fix it.

To avoid triggering the error you could use parse = FALSE and dealt with the parsing yourself, this would also help identify/have the ids of the tweets that are causing this problem.

Aug 08 '22 09:08 llrs

As you commented, parse = FALSE avoids the error.

But I noticed one thing really odd. search_fullarchive seems to download fewer tweets than expected each time when I ran the code

rt_arch_mihonoseki02 <- search_fullarchive(q = "美保関 観光 -is:retweet OR 美保関 旅 -is:retweet",
                                                                      n = 500, premium = TRUE, 
                                                                      fromDate = 202111010000, toDate = 202111302359,
                                                                      env_name = "hoge", token = twitter_token, parse = FALSE)

There I got a large list of 17 lists with data and 3 empty lists. The 17 lists is data.frames of 31rows and 35 columns.

Could you give me an advice to download 500 tweets at maximum each time?(or should I make another issue?)

Aug 09 '22 02:08 KeisukeNish

Hi,now that you have the list with the output, could you update the file here so that I can try to solve this bug?

That the output is simplified to a list of 17 list with data.frames doesn't mean that there aren't information about 500 tweets (17*31 > 500). I don't remember now the details but there should be all the information.

Aug 09 '22 09:08 llrs

Thanks. I made a json file from the rt_arch_mihonoseki02 with write_json of jsonlite. If you need some more, please let me know.

miho.zip

Aug 10 '22 07:08 KeisukeNish

Thanks for providing the data.

I'm not able to trigger the same error with the data provided, I think there is a problem with the serialization to json that is different from expected, internally rtweet uses jsonlite::fromJSON, which simplifies the data while write_json doesn't.

Could you save it following this approach?

Check that using tweets_with_users(rt_arch_mihonoseki02) produces the same error initially reported.
Save the data in a .RDS format via saveRDS, for example saveRDS(rt_arch_mihonoseki02, "mhio.RDS") and upload the file mhio.RDS

Aug 11 '22 11:08 llrs

Thanks a lot.

The first approach didn't return any errors. I upload a zip containing 4 RDS files. miho.RDS is from rt_arch_mihonoseki02. matsue01-04.RDS are from rt_arch_matsuecity-04, made from the followings.

rt_arch_matsuecity <- search_fullarchive(q = "松江 観光 -is:retweet OR 松江 旅 -is:retweet",
                                         n = 500, premium = TRUE,
                                         fromDate = 201801010000, toDate = 202112312359,
                                         env_name = "archive", token = twitter_token, parse = FALSE)

rt_arch_matsuecity02 <- search_fullarchive(q = "松江 観光 -is:retweet OR 松江 旅 -is:retweet",
                                           n = 500,
                                           fromDate = 201801010000, toDate = 202109150108, premium = TRUE,
                                           env_name = "archive", token = twitter_token, parse = FALSE)
rt_arch_matsuecity03 <- search_fullarchive(q = "松江 観光 -is:retweet OR 松江 旅 -is:retweet",
                                           n = 500, premium = TRUE,
                                           fromDate = 201801010000, toDate = 202109050919,
                                           env_name = "archive", token = twitter_token, parse = FALSE)
rt_arch_matsuecity04 <- search_fullarchive(q = "松江 観光 -is:retweet OR 松江 旅 -is:retweet",
                                           n = 500, premium = TRUE,
                                           fromDate = 201801010000, toDate = 201807151632,
                                           env_name = "archive", token = twitter_token, parse = FALSE)

I hope this helps.

miho0812.zip

Aug 12 '22 05:08 KeisukeNish

Now I can reproduce the error with the matsue01.RDS file. Thanks for providing the data.

Apparently some tweets have a derived field which has a location field (which is a place type of data) but others don't. This mix of data results in the error you are seeing.

Aug 12 '22 19:08 llrs

Thank you so much. Then, I'll use search_fullarchive with parse = FALSE for the time being.

I'm not sure it's related to this bug, it seems the function downloaded 100 tweets ( or less ) per request, although I specified premium = TRUE. I'll try it again, and if something wrong, I'll report it.

(sorry, I commented on another account, so I reposted this.)

Aug 12 '22 21:08 KeisukeNish

Indeed the premium argument is not working properly (as it is not used in the internal function). I'll fix that too.

Aug 13 '22 09:08 llrs

Investigating a bit more it seems that this location field is only present if you use enterprise access: https://developer.twitter.com/en/docs/twitter-api/enterprise/enrichments/overview/profile-geo This requires more a special function to handle the format of this location/geo data and I cannot reuse existing functions. I am focusing on fixing issues affecting more users.

However, I fixed the issue about the parameter premium = FALSE as this was easier. This is now fixed in version 1.0.2.9002 in the devel branch.

Aug 13 '22 12:08 llrs

Thank you so much for everything.

Indeed, I'm a premium user, not an enterprise. So it seems the location field is also provided to the paid premium plan.

Still, I can manage the issue with parse = FALSE, premium = TRUE

Aug 13 '22 14:08 KeisukeNish

rtweet rtweet copied to clipboard

Error "duplicate 'row.names' are not allowed" while using search_fullarchive

Problem

versions etc.

rtweet
rtweet copied to clipboard