rtweet
rtweet copied to clipboard
Error "duplicate 'row.names' are not allowed" while using search_fullarchive
Problem
When using search_fullarchive, I got an error message:
Error in .rowNamesDF<-
(x, value = value): duplicate 'row.names' are not allowed
I got the following traceback... (I replaced the id_str and others with "...", and search words with "TERM_X".
12.
stop("duplicate 'row.names' are not allowed")
11.
`.rowNamesDF<-`(x, value = value)
10.
`row.names<-.data.frame`(`*tmp*`, value = value)
9.
`row.names<-`(`*tmp*`, value = value)
8.
`rownames<-`(`*tmp*`, value = `*vtmp*`)
7.
rbind(deparse.level, ...)
6.
rbind(structure(list(id = ..., id_str = "...", name = "...",
screen_name = "...", location = "...", url = NA_character_,
description = "...",
derived = structure(list(locations = list(structure(list( ...
5.
do.call("rbind", tweets[["user"]])
4.
tweets_with_users(result)
3.
search_premium("fullarchive", q = q, n = n, fromDate = fromDate,
toDate = toDate, env_name = env_name, continue = continue,
safedir = safedir, parse = parse, token = token)
2.
search_fullarchive(q = q, n = n, fromDate = fromDate, toDate = toDate,
env_name = env_name, token = token)
1.
tweet.archive.search(q = "TERM_A TERM_B -is:retweet OR TERM_A TERM_C -is:retweet",
n = 500, full = 500, fromDate = 201701010000, toDate = 202112312359,
env_name = "archive", token = twitter_token)
versions etc.
Twitter API: Premium (paid)
rtweet version: 1.0.4
.
R version: 4.2.1
.
RStudio version: 2022.07.1 Build 554
.
Sorry but I can't show you the session info because I don't have the PC I ran the code right now.
I think this is really similar to this issue.
I didn't encounter this error while using rtweet
of 0.7.0
on another PC.
R and Rstudio version are the same as the one.
Hi,
Thanks for the report. I see the rtweet version you are using is not one I released, as I never upgraded rtweet to 1.0.4. Is that a typo or are there some modifications done by someone else?
Unfortunately this kind of problems are hard for me to reproduce unless I know in which tweets or users does this happen.
As you found this is similar to #648 but there I was able to fix it quickly as the report provided enough information to reproduce the problem and test that the fix worked correctly.
I don't need the terms you are searching it would be enough if I can run lookup_tweets(id1)
and trigger the same parsing problems.
Thanks for the quick response.
And sorry for the typo, it is 1.0.2
.
I couldn't download tweets, so I wonder what I can do for you to reproduce.
What about the terms? they were "美保関 観光 -is:retweet OR 美保関 旅 -is:retweet" These chinese characters represent Japanese words.
It might be irrelavant, but while trying to solve this, the function requested many times.
You don't need to store the downloaded tweets (you could if you specify parse = FALSE
), I need to know which are the tweets that fail to parse. That's why I asked about the id_str you redacted. Please provide the number and I might be able to fix it.
These search terms are good, but I will get a different result as there is no guarantee the same search returns the same results.
The number of requests might be a problem for you as it might have exhausted the number of available queries you can do.
Thanks again and sorry that I didn't get your point.
The traceback includes id_str = "15426138"
.
Sorry, I cannot reproduce the problem. I didn't find a tweet by id 15426138. I didn't have a problem with rtweet to parse the user with the id 15426138. Searching for those terms didn't result with any error. I could not try the premium search because I ran out of API calls.
You'll need to provide a reproducible example with the error that contains the id of the user or tweet that cause rtweet to fail.
Thank you so much for everything.
I will re-try the function hopefully within the next two days. If you want information other than
- traceback
- session info
for a reproducible example, could you tell me about it?
And if you have any ideas about candidates that trigger the error, and there's something I can do to avoid it, I 'd like to try.
I need to know the id_str of the tweets that are causing problems, without that even if you provide the traceback and the session info I won't be able to fix it.
To avoid triggering the error you could use parse = FALSE and dealt with the parsing yourself, this would also help identify/have the ids of the tweets that are causing this problem.
As you commented, parse = FALSE avoids the error.
But I noticed one thing really odd.
search_fullarchive
seems to download fewer tweets than expected each time when I ran the code
rt_arch_mihonoseki02 <- search_fullarchive(q = "美保関 観光 -is:retweet OR 美保関 旅 -is:retweet",
n = 500, premium = TRUE,
fromDate = 202111010000, toDate = 202111302359,
env_name = "hoge", token = twitter_token, parse = FALSE)
There I got a large list of 17 lists with data and 3 empty lists. The 17 lists is data.frames of 31rows and 35 columns.
Could you give me an advice to download 500 tweets at maximum each time?(or should I make another issue?)
Hi,now that you have the list with the output, could you update the file here so that I can try to solve this bug?
That the output is simplified to a list of 17 list with data.frames doesn't mean that there aren't information about 500 tweets (17*31 > 500). I don't remember now the details but there should be all the information.
Thanks.
I made a json file from the rt_arch_mihonoseki02
with write_json
of jsonlite
.
If you need some more, please let me know.
Thanks for providing the data.
I'm not able to trigger the same error with the data provided, I think there is a problem with the serialization to json that is different from expected, internally rtweet uses jsonlite::fromJSON
, which simplifies the data while write_json
doesn't.
Could you save it following this approach?
- Check that using
tweets_with_users(rt_arch_mihonoseki02)
produces the same error initially reported. - Save the data in a .RDS format via saveRDS, for example
saveRDS(rt_arch_mihonoseki02, "mhio.RDS")
and upload the filemhio.RDS
Thanks a lot.
The first approach didn't return any errors.
I upload a zip containing 4 RDS files.
miho.RDS
is from rt_arch_mihonoseki02
.
matsue01
-04.RDS
are from rt_arch_matsuecity
-04
, made from the followings.
rt_arch_matsuecity <- search_fullarchive(q = "松江 観光 -is:retweet OR 松江 旅 -is:retweet",
n = 500, premium = TRUE,
fromDate = 201801010000, toDate = 202112312359,
env_name = "archive", token = twitter_token, parse = FALSE)
rt_arch_matsuecity02 <- search_fullarchive(q = "松江 観光 -is:retweet OR 松江 旅 -is:retweet",
n = 500,
fromDate = 201801010000, toDate = 202109150108, premium = TRUE,
env_name = "archive", token = twitter_token, parse = FALSE)
rt_arch_matsuecity03 <- search_fullarchive(q = "松江 観光 -is:retweet OR 松江 旅 -is:retweet",
n = 500, premium = TRUE,
fromDate = 201801010000, toDate = 202109050919,
env_name = "archive", token = twitter_token, parse = FALSE)
rt_arch_matsuecity04 <- search_fullarchive(q = "松江 観光 -is:retweet OR 松江 旅 -is:retweet",
n = 500, premium = TRUE,
fromDate = 201801010000, toDate = 201807151632,
env_name = "archive", token = twitter_token, parse = FALSE)
I hope this helps.
Now I can reproduce the error with the matsue01.RDS file. Thanks for providing the data.
Apparently some tweets have a derived field which has a location field (which is a place type of data) but others don't. This mix of data results in the error you are seeing.
Thank you so much.
Then, I'll use search_fullarchive
with parse = FALSE
for the time being.
I'm not sure it's related to this bug, it seems the function downloaded 100 tweets ( or less ) per request, although I specified premium = TRUE
.
I'll try it again, and if something wrong, I'll report it.
(sorry, I commented on another account, so I reposted this.)
Indeed the premium argument is not working properly (as it is not used in the internal function). I'll fix that too.
Investigating a bit more it seems that this location field is only present if you use enterprise access: https://developer.twitter.com/en/docs/twitter-api/enterprise/enrichments/overview/profile-geo This requires more a special function to handle the format of this location/geo data and I cannot reuse existing functions. I am focusing on fixing issues affecting more users.
However, I fixed the issue about the parameter premium = FALSE
as this was easier. This is now fixed in version 1.0.2.9002 in the devel branch.
Thank you so much for everything.
Indeed, I'm a premium user, not an enterprise. So it seems the location field is also provided to the paid premium plan.
Still, I can manage the issue with parse = FALSE, premium = TRUE