osf.io
osf.io copied to clipboard
Duplicates returned when listing nodes files
This was originally reported in https://github.com/ropensci/osfr/issues/150 by @doomlab.
The reprex here shows that the same file, 421_Lu.pdf
, is returned twice when listing files in the Local IRB directory within this project.
I've confirmed that the duplicate entries are coming from the OSF API, across different pages of results:
#!/usr/bin/env bash
set -e
TOKEN="$OSF_PAT"
NODE="ycn7z"
ID="6113d75ae3801305b39612a8"
LIMIT=2
# Retrieve name and path attributes from JSON response
JQ_FILTER='.data[].attributes | "\(.name) \(.path)"'
for i in $(seq 1 $LIMIT); do
echo "Retrieving page $i"
curl --silent \
"https://api.osf.io/v2/nodes/$NODE/files/osfstorage/$ID/?page=$i" \
-H "Authorization: Bearer $TOKEN" \
-H 'Accept-Header: application/vnd.api+json' \
-H 'Content-Type: application/json; charset=utf-8' \
| jq $JQ_FILTER
done
## Retrieving page 1
## "97_Pfuhl.pdf /6163f0e5fd5b230191983824"
## "1897_Parker.pdf /616440dfc5565801d34b71bf"
## "1698_Butt.pdf /616513bbc5565802014b9ae6"
## "1970_Pavlović.pdf /617436dae572ea00b13a7285"
## "1560_Irrazabal.pdf /618281a0a30f8100cdaa071d"
## "1867_Oner.pdf /6184db04bfb47d00a3ef50dd"
## "169_Montefinese.pdf /6186148c25f90a004a0f6aa6"
## "87_Vaughn.docx /619548800b0c1e01a27fdae5"
## "35_Stewart.pdf /6197ca37ef62980009f5c789"
## "421_Lu.pdf /6161fcd9fd5b2301429849b3" <-- copy 1
##
## Retrieving page 2
## "423_Arriaga.pdf /619d017da83c2001650e8e53"
## "761_Papadatou-Pastou.pdf /619df2886977cd010f496498"
## "712_Davis.pdf /61a7d30d4d4ce5018476e569"
## "1574_Al-Hoorie.pdf /61b89ac6da0b1b0488d05546"
## "206_Ergiyen.pdf /61cc42f3da632006e1fe6f4a"
## "437_Peker.pdf /61fc2630370e6c002bf3d6cc"
## "104_Stieger.pdf /620e3a2511da1c05cdf57647"
## "238_Martínez.pdf /620f7666d9b6cf0144b90449"
## "1052_Parzuchowski.pdf /6220fbccc064270378d90ce5"
## "421_Lu.pdf /6161fcd9fd5b2301429849b3" <-- copy 2
The waterbutler IDs are identical so this does seem like a possible bug.
Let me know if you need any more information.
Thanks @aaronwolen - I will note that when I run the same code I get a different file duplicated. And the duplicated file sometimes changes, usually when I update/upload a new file. You can see my reprex here.
I was able to reproduce the error and it is indeed coming from the API, the default sorting logic for this particular endpoint appears to be broken. As a temporary work around you can use your own sorting criteria to prevent the error, such as https://api.osf.io/v2/nodes/ycn7z/files/osfstorage/6113d75ae3801305b39612a8/?sort=id
will sort correctly by id
, removing all duplicates, similarly for name
etc. We will resolve the issue eventually, but I recommend using the workaround for now. Thanks for your interest, I'll close this issue when we've resolved this bug if you have no further questions or comments.
Thanks for the quick response, @Johnetordoff! A couple follow-up questions for you:
- I didn't even know about the
sort
param. Is it documented somewhere and I missed it? - What other attributes can we sort on?
- Will sorting on any attribute solve the issue?
Thanks @Johnetordoff - I have updated my code and got the appropriate output. Appreciate the workaround.
@aaronwolen
I didn't even know about the sort param. Is it documented somewhere and I missed it?
It is not documented, unfortunately this param is not implemented consistently over all the endpoints it's applied. Some queries, legacy endpoints and attributes haven't been QA'ed for accurate sorting, so they remain undocumented.
What other attributes can we sort on?
The default sorting behavior for list view is to allow the user to sort on any of the attributes
returned in JSON payload. For example https://api.osf.io/v2/users/ allows you to sort on full_name
, given_name
, middle_names
, family_name
, suffix
, date_registered
, active
, tiimezone
, locale
, social
, employment
and education
.
Will sorting on any attribute solve the issue?
I did not check, as I've written this behavior is not guaranteed to be accurate or consistent.