academictwitteR
academictwitteR copied to clipboard
[FR] Next generation archiving architecture
Describe the solution you'd like
The current mechanism for generating file name of json files is the root cause of #304, #305, and #307.
Currently, it is based on the last id
of the data
slot in the input list df
.
https://github.com/cjbarrie/academictwitteR/blob/2809432aaea388e7bb016a1f15f24787e8d05586/R/utils.R#L152-L163
And this is extremely not robust, especially in the case of no data
slot. And in the case of no data
slot but with other slots (e.g. #307), it will produce the "dirty road" problem.
A better idea is to use a hash function provided by digest, e.g. md5, to hash the input df
and name the json files according to the hash. We are not creating NSA-certified program, so md5's collision problem is not a big issue. If it is an issue or we are actually creating NSA-certified program, we can use something else, e.g. SHA-3 or whatever.
require(academictwitteR)
#> Loading required package: academictwitteR
emptydir <- academictwitteR:::.gen_random_dir()
df_to_json2 <- function(df, data_path, errors = FALSE){
df_hash <- digest::digest(df)
jsonlite::write_json(df$data,
file.path(data_path, paste0("data_", df_hash, ".json")))
jsonlite::write_json(df$includes,
file.path(data_path, paste0("users_", df_hash, ".json")))
if (errors) { # error catcher
jsonlite::write_json(df$errors,
file.path(data_path, paste0("errors_", df_hash, ".json")))
}
}
academictwitteR:::create_data_dir(emptydir, verbose = FALSE)
whatever <- list()
df_to_json2(df = whatever, data_path = emptydir)
list.files(emptydir)
#> [1] "data_db41907f4f43686fe19edc3d7eb61082.json"
#> [2] "users_db41907f4f43686fe19edc3d7eb61082.json"
whatever2 <- list()
df_to_json2(df = whatever2, data_path = emptydir)
## it is okay to overwrite as whatever and whatever2 are the same, thus have the same hash
list.files(emptydir)
#> [1] "data_db41907f4f43686fe19edc3d7eb61082.json"
#> [2] "users_db41907f4f43686fe19edc3d7eb61082.json"
whatever$data <- iris
df_to_json2(df = whatever, data_path = emptydir)
list.files(emptydir)
#> [1] "data_9292a802ffcf85ea84ead43bdd68f942.json"
#> [2] "data_db41907f4f43686fe19edc3d7eb61082.json"
#> [3] "users_9292a802ffcf85ea84ead43bdd68f942.json"
#> [4] "users_db41907f4f43686fe19edc3d7eb61082.json"
whatever2$errors <- LETTERS
df_to_json2(df = whatever2, data_path = emptydir, errors = TRUE)
list.files(emptydir)
#> [1] "data_9292a802ffcf85ea84ead43bdd68f942.json"
#> [2] "data_b832bd373d8d0d3bbf2919d0d420fcd1.json"
#> [3] "data_db41907f4f43686fe19edc3d7eb61082.json"
#> [4] "errors_b832bd373d8d0d3bbf2919d0d420fcd1.json"
#> [5] "users_9292a802ffcf85ea84ead43bdd68f942.json"
#> [6] "users_b832bd373d8d0d3bbf2919d0d420fcd1.json"
#> [7] "users_db41907f4f43686fe19edc3d7eb61082.json"
whatever3 <- list()
whatever3$errors <- letters
df_to_json2(df = whatever3, data_path = emptydir, errors = TRUE)
## no dirty road
list.files(emptydir)
#> [1] "data_9292a802ffcf85ea84ead43bdd68f942.json"
#> [2] "data_b832bd373d8d0d3bbf2919d0d420fcd1.json"
#> [3] "data_db41907f4f43686fe19edc3d7eb61082.json"
#> [4] "data_f9dc01c8ffbfbdda3e3070f07b3eb487.json"
#> [5] "errors_b832bd373d8d0d3bbf2919d0d420fcd1.json"
#> [6] "errors_f9dc01c8ffbfbdda3e3070f07b3eb487.json"
#> [7] "users_9292a802ffcf85ea84ead43bdd68f942.json"
#> [8] "users_b832bd373d8d0d3bbf2919d0d420fcd1.json"
#> [9] "users_db41907f4f43686fe19edc3d7eb61082.json"
#> [10] "users_f9dc01c8ffbfbdda3e3070f07b3eb487.json"
res <- purrr::map(list.files(emptydir, full.names = TRUE),
~jsonlite::read_json(., simplifyVector = TRUE))
filenames <- list.files(emptydir)
names(res) <- filenames
res
#> $data_9292a802ffcf85ea84ead43bdd68f942.json
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3.0 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5.0 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5.0 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> 11 5.4 3.7 1.5 0.2 setosa
#> 12 4.8 3.4 1.6 0.2 setosa
#> 13 4.8 3.0 1.4 0.1 setosa
#> 14 4.3 3.0 1.1 0.1 setosa
#> 15 5.8 4.0 1.2 0.2 setosa
#> 16 5.7 4.4 1.5 0.4 setosa
#> 17 5.4 3.9 1.3 0.4 setosa
#> 18 5.1 3.5 1.4 0.3 setosa
#> 19 5.7 3.8 1.7 0.3 setosa
#> 20 5.1 3.8 1.5 0.3 setosa
#> 21 5.4 3.4 1.7 0.2 setosa
#> 22 5.1 3.7 1.5 0.4 setosa
#> 23 4.6 3.6 1.0 0.2 setosa
#> 24 5.1 3.3 1.7 0.5 setosa
#> 25 4.8 3.4 1.9 0.2 setosa
#> 26 5.0 3.0 1.6 0.2 setosa
#> 27 5.0 3.4 1.6 0.4 setosa
#> 28 5.2 3.5 1.5 0.2 setosa
#> 29 5.2 3.4 1.4 0.2 setosa
#> 30 4.7 3.2 1.6 0.2 setosa
#> 31 4.8 3.1 1.6 0.2 setosa
#> 32 5.4 3.4 1.5 0.4 setosa
#> 33 5.2 4.1 1.5 0.1 setosa
#> 34 5.5 4.2 1.4 0.2 setosa
#> 35 4.9 3.1 1.5 0.2 setosa
#> 36 5.0 3.2 1.2 0.2 setosa
#> 37 5.5 3.5 1.3 0.2 setosa
#> 38 4.9 3.6 1.4 0.1 setosa
#> 39 4.4 3.0 1.3 0.2 setosa
#> 40 5.1 3.4 1.5 0.2 setosa
#> 41 5.0 3.5 1.3 0.3 setosa
#> 42 4.5 2.3 1.3 0.3 setosa
#> 43 4.4 3.2 1.3 0.2 setosa
#> 44 5.0 3.5 1.6 0.6 setosa
#> 45 5.1 3.8 1.9 0.4 setosa
#> 46 4.8 3.0 1.4 0.3 setosa
#> 47 5.1 3.8 1.6 0.2 setosa
#> 48 4.6 3.2 1.4 0.2 setosa
#> 49 5.3 3.7 1.5 0.2 setosa
#> 50 5.0 3.3 1.4 0.2 setosa
#> 51 7.0 3.2 4.7 1.4 versicolor
#> 52 6.4 3.2 4.5 1.5 versicolor
#> 53 6.9 3.1 4.9 1.5 versicolor
#> 54 5.5 2.3 4.0 1.3 versicolor
#> 55 6.5 2.8 4.6 1.5 versicolor
#> 56 5.7 2.8 4.5 1.3 versicolor
#> 57 6.3 3.3 4.7 1.6 versicolor
#> 58 4.9 2.4 3.3 1.0 versicolor
#> 59 6.6 2.9 4.6 1.3 versicolor
#> 60 5.2 2.7 3.9 1.4 versicolor
#> 61 5.0 2.0 3.5 1.0 versicolor
#> 62 5.9 3.0 4.2 1.5 versicolor
#> 63 6.0 2.2 4.0 1.0 versicolor
#> 64 6.1 2.9 4.7 1.4 versicolor
#> 65 5.6 2.9 3.6 1.3 versicolor
#> 66 6.7 3.1 4.4 1.4 versicolor
#> 67 5.6 3.0 4.5 1.5 versicolor
#> 68 5.8 2.7 4.1 1.0 versicolor
#> 69 6.2 2.2 4.5 1.5 versicolor
#> 70 5.6 2.5 3.9 1.1 versicolor
#> 71 5.9 3.2 4.8 1.8 versicolor
#> 72 6.1 2.8 4.0 1.3 versicolor
#> 73 6.3 2.5 4.9 1.5 versicolor
#> 74 6.1 2.8 4.7 1.2 versicolor
#> 75 6.4 2.9 4.3 1.3 versicolor
#> 76 6.6 3.0 4.4 1.4 versicolor
#> 77 6.8 2.8 4.8 1.4 versicolor
#> 78 6.7 3.0 5.0 1.7 versicolor
#> 79 6.0 2.9 4.5 1.5 versicolor
#> 80 5.7 2.6 3.5 1.0 versicolor
#> 81 5.5 2.4 3.8 1.1 versicolor
#> 82 5.5 2.4 3.7 1.0 versicolor
#> 83 5.8 2.7 3.9 1.2 versicolor
#> 84 6.0 2.7 5.1 1.6 versicolor
#> 85 5.4 3.0 4.5 1.5 versicolor
#> 86 6.0 3.4 4.5 1.6 versicolor
#> 87 6.7 3.1 4.7 1.5 versicolor
#> 88 6.3 2.3 4.4 1.3 versicolor
#> 89 5.6 3.0 4.1 1.3 versicolor
#> 90 5.5 2.5 4.0 1.3 versicolor
#> 91 5.5 2.6 4.4 1.2 versicolor
#> 92 6.1 3.0 4.6 1.4 versicolor
#> 93 5.8 2.6 4.0 1.2 versicolor
#> 94 5.0 2.3 3.3 1.0 versicolor
#> 95 5.6 2.7 4.2 1.3 versicolor
#> 96 5.7 3.0 4.2 1.2 versicolor
#> 97 5.7 2.9 4.2 1.3 versicolor
#> 98 6.2 2.9 4.3 1.3 versicolor
#> 99 5.1 2.5 3.0 1.1 versicolor
#> 100 5.7 2.8 4.1 1.3 versicolor
#> 101 6.3 3.3 6.0 2.5 virginica
#> 102 5.8 2.7 5.1 1.9 virginica
#> 103 7.1 3.0 5.9 2.1 virginica
#> 104 6.3 2.9 5.6 1.8 virginica
#> 105 6.5 3.0 5.8 2.2 virginica
#> 106 7.6 3.0 6.6 2.1 virginica
#> 107 4.9 2.5 4.5 1.7 virginica
#> 108 7.3 2.9 6.3 1.8 virginica
#> 109 6.7 2.5 5.8 1.8 virginica
#> 110 7.2 3.6 6.1 2.5 virginica
#> 111 6.5 3.2 5.1 2.0 virginica
#> 112 6.4 2.7 5.3 1.9 virginica
#> 113 6.8 3.0 5.5 2.1 virginica
#> 114 5.7 2.5 5.0 2.0 virginica
#> 115 5.8 2.8 5.1 2.4 virginica
#> 116 6.4 3.2 5.3 2.3 virginica
#> 117 6.5 3.0 5.5 1.8 virginica
#> 118 7.7 3.8 6.7 2.2 virginica
#> 119 7.7 2.6 6.9 2.3 virginica
#> 120 6.0 2.2 5.0 1.5 virginica
#> 121 6.9 3.2 5.7 2.3 virginica
#> 122 5.6 2.8 4.9 2.0 virginica
#> 123 7.7 2.8 6.7 2.0 virginica
#> 124 6.3 2.7 4.9 1.8 virginica
#> 125 6.7 3.3 5.7 2.1 virginica
#> 126 7.2 3.2 6.0 1.8 virginica
#> 127 6.2 2.8 4.8 1.8 virginica
#> 128 6.1 3.0 4.9 1.8 virginica
#> 129 6.4 2.8 5.6 2.1 virginica
#> 130 7.2 3.0 5.8 1.6 virginica
#> 131 7.4 2.8 6.1 1.9 virginica
#> 132 7.9 3.8 6.4 2.0 virginica
#> 133 6.4 2.8 5.6 2.2 virginica
#> 134 6.3 2.8 5.1 1.5 virginica
#> 135 6.1 2.6 5.6 1.4 virginica
#> 136 7.7 3.0 6.1 2.3 virginica
#> 137 6.3 3.4 5.6 2.4 virginica
#> 138 6.4 3.1 5.5 1.8 virginica
#> 139 6.0 3.0 4.8 1.8 virginica
#> 140 6.9 3.1 5.4 2.1 virginica
#> 141 6.7 3.1 5.6 2.4 virginica
#> 142 6.9 3.1 5.1 2.3 virginica
#> 143 5.8 2.7 5.1 1.9 virginica
#> 144 6.8 3.2 5.9 2.3 virginica
#> 145 6.7 3.3 5.7 2.5 virginica
#> 146 6.7 3.0 5.2 2.3 virginica
#> 147 6.3 2.5 5.0 1.9 virginica
#> 148 6.5 3.0 5.2 2.0 virginica
#> 149 6.2 3.4 5.4 2.3 virginica
#> 150 5.9 3.0 5.1 1.8 virginica
#>
#> $data_b832bd373d8d0d3bbf2919d0d420fcd1.json
#> named list()
#>
#> $data_db41907f4f43686fe19edc3d7eb61082.json
#> named list()
#>
#> $data_f9dc01c8ffbfbdda3e3070f07b3eb487.json
#> named list()
#>
#> $errors_b832bd373d8d0d3bbf2919d0d420fcd1.json
#> [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
#> [20] "T" "U" "V" "W" "X" "Y" "Z"
#>
#> $errors_f9dc01c8ffbfbdda3e3070f07b3eb487.json
#> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
#> [20] "t" "u" "v" "w" "x" "y" "z"
#>
#> $users_9292a802ffcf85ea84ead43bdd68f942.json
#> named list()
#>
#> $users_b832bd373d8d0d3bbf2919d0d420fcd1.json
#> named list()
#>
#> $users_db41907f4f43686fe19edc3d7eb61082.json
#> named list()
#>
#> $users_f9dc01c8ffbfbdda3e3070f07b3eb487.json
#> named list()
unlink(emptydir)
Created on 2022-03-10 by the reprex package (v2.0.1)
However, it has been 5 CRAN releases already. Changing this probably won't create reproducibility issue in the data level, but surely will create issue in the reproducibility of directory structure (which users probably won't notice).
Anything else?
@cjbarrie @justinchuntingho @TimBMK
your thought?
Another way to deal with it is to save the r
here instead. It would be what users using twarc2 would expect. It would save a lot of time for json->list/list->json conversions too. And it makes the implementation of #296 easier, as all functions use make_query
under the hood. But this would create an even bigger backward compatibility issue than the solution above. Also, r
might contain sensitive information, making the directory not sharable. (Well, the directory SHOULD NOT be shared in the first place per Twitter API's agreement.)
https://github.com/cjbarrie/academictwitteR/blob/2809432aaea388e7bb016a1f15f24787e8d05586/R/utils.R#L33
@cjbarrie @justinchuntingho Any thought?
#304, #305, and #307 can't be solved without handling this one first.
I will review this properly by the end of the week. Thank you for your work in detailing options @TimBMK and @chainsawriot
Sorry for being slow on this. I think that the second solution you offer is likely the most sensible, as it solves the most problems at once.
I'm also not too worried about backward compatibility when it comes to raw file directories. As you say, there are limitations on sharing these anyway. Additionally, any historical query now will by definition, a lot of the time, return data that is different if we ran the same query ten minutes hence. Complete reproducibility is therefore likely not possible.
To make sure I follow the second solution, the flow in brief would be:
my_query <- "#BLM lang:EN"
endpoint_url <- "https://api.twitter.com/2/tweets/search/all"
params <- list(
"query" = my_query,
"start_time" = "2021-01-01T00:00:00Z",
"end_time" = "2021-07-31T23:59:59Z",
"max_results" = 20
)
r <- httr::GET(url = endpoint_url,
httr::add_headers(
Authorization = paste0("bearer ", Sys.getenv("TWITTER_BEARER"))),
query = params)
#the change is then here
saving_here <- jsonlite::fromJSON(httr::content(r, "text"))
And how would we then be storing the output saved from here?
Regarding backward compability, I would suggest adding an option (or legacy function) to retain the old functionality. Not only for the reproducability, but also for revisiting old data. Imagine, for example, a sample of 10 mio Tweets stored as .json. You may have read these into R and saved the combined data as an .RDa file at some point. Chances are, you used bind_tweet()'s tidy or native format. Both of these, however, automatically drop certain variables existent in the .jsons. If you want to check back on that data - say, to see if Twitter produced any usabe named entities - you would need to read the data in with a different format, as RAW or the tidy2 format I've been working on. Chances are, you may not want to spend the time and resources to re-scrape the full data from Twitter for this.
Furthermore, it is true that, when rescraping data, you will not be able to produce the same results. Most likely, you will gain less data due to deleted content. This, however, poses a problem for reproducability, occasionally requiring researchers to use the original data. For example, when trying out new research methods or during a peer review process.
I think that denying some sort of backwards compability for the old data would make all of these processes unnecessarily complicated.
@cjbarrie
Simply save this guy: r
If you need it to be json
response <- jsonlite::fromJSON(httr::content(r, "text"))
response_hash <- digest::digest(response)
jsonlite::write_json(response,
file.path(data_path, paste0("response_", reponse_hash, ".json")))
If you don't care about "interop" (I don't care)
response_hash <- digest::digest(r)
writeRDS(r, file.path(data_path, paste0("response_", reponse_hash, ".RDS")))
And from this point, we need to detect for file names "^response_" as the new format. And maintain the infra for the old format (i.e. "data_", "users_", "errors_") at least for a while.
Aha I hadn't recognized the exact issue. @TimBMK you're right re binding old collections. We'll want to retain functionality for this.
Is the best middle option to retain functionality for binding json-collected tweets, but from now on plan to output as individual .rds files? I'm happy with dropping our current json output and opting instead for .rds.
I also think the hashing approach is neat
I think saving them as .RDs in the future would be a great solution, as .jsons are veryy large and big collections take an immense amount of storage. If you need to export it to another format for interoperability, you can always export it later. I think it would be critical, however, to keep any and all data returned by the API, as I think it very possible that Twitter will provide additional data in the future (as they did when they added to context annotations), or change the data structure. But I believe that is the case with the approach outlined above?
@TimBMK Yes, Twitter changes the data returned by the API all the time and we are always in the passive position. The core data structure wouldn't change a lot though. I think retaining the entire response from Twitter (i.e. r
) is a good move and it is the same strategy taken by twarc2. (If we like, we can take twarc2's modular approach too, i.e. twarc2 only archives the responses and then there are plugins to handle the data conversion. But don't overengineer it now.)
Should I consider this position confirmed? Do we need to wait for @justinchuntingho ?
If it is confirmed, I will remove the PR #306 . And I think we should do this together with #234
Sorry for my very late response, I think saving the whole r would be a better option.
And I agreed with @TimBMK that we should retain functionality for backward compability.
Stage 1
master branch
- [ ] Modularize the current functionalities to handle the json-based format in
- [ ]
bind_tweets
- [ ]
resume_collection
- [ ]
update_collection
- [ ]
- [ ] Make sure that the modularization won't change the current behaviors
Stage 2
dev branch
- [ ] Implement the RDS-based format here
https://github.com/cjbarrie/academictwitteR/blob/2809432aaea388e7bb016a1f15f24787e8d05586/R/utils.R#L48-L51
- [ ] Update the
query
file format
https://github.com/cjbarrie/academictwitteR/blob/2809432aaea388e7bb016a1f15f24787e8d05586/R/utils.R#L168-L172
- [ ] Make sure the integration of it work in
- [ ]
get_all_tweets
- [ ]
hydrate_tweets
- [ ]
bind_tweets
- [ ]
resume_collection
- [ ]
update_collection
- [ ]
Stage 3
dev branch
- [ ] Make sure the coexistence of json-based and RDS-based functionalities won't mangle with each other
Stage 4
- [ ] Merge the dev branch to master branch
Short remark regarding the saved format: I've started using vroom to read in .csv / tarballed .csv for large data. This is not only faster than reading .RDS, it also ensures interoperability. You only need to make sure you read in the correct column types, esp. character for any IDs, to not lose data. Packing the .csv as tar.gz is also about the same size as .RDS and might therefore be an alternative to .RDS if you care about interoperability