osf.io icon indicating copy to clipboard operation
osf.io copied to clipboard

Duplicates returned when listing nodes files

Open aaronwolen opened this issue 2 years ago • 5 comments

This was originally reported in https://github.com/ropensci/osfr/issues/150 by @doomlab.

The reprex here shows that the same file, 421_Lu.pdf, is returned twice when listing files in the Local IRB directory within this project.

I've confirmed that the duplicate entries are coming from the OSF API, across different pages of results:

#!/usr/bin/env bash

set -e

TOKEN="$OSF_PAT"
NODE="ycn7z"
ID="6113d75ae3801305b39612a8"
LIMIT=2

# Retrieve name and path attributes from JSON response
JQ_FILTER='.data[].attributes | "\(.name) \(.path)"'

for i in $(seq 1 $LIMIT); do
  echo "Retrieving page $i"
  curl --silent \
    "https://api.osf.io/v2/nodes/$NODE/files/osfstorage/$ID/?page=$i" \
    -H "Authorization: Bearer $TOKEN" \
    -H 'Accept-Header: application/vnd.api+json' \
    -H 'Content-Type: application/json; charset=utf-8' \
    | jq $JQ_FILTER
done

## Retrieving page 1
## "97_Pfuhl.pdf /6163f0e5fd5b230191983824"
## "1897_Parker.pdf /616440dfc5565801d34b71bf"
## "1698_Butt.pdf /616513bbc5565802014b9ae6"
## "1970_Pavlović.pdf /617436dae572ea00b13a7285"
## "1560_Irrazabal.pdf /618281a0a30f8100cdaa071d"
## "1867_Oner.pdf /6184db04bfb47d00a3ef50dd"
## "169_Montefinese.pdf /6186148c25f90a004a0f6aa6"
## "87_Vaughn.docx /619548800b0c1e01a27fdae5"
## "35_Stewart.pdf /6197ca37ef62980009f5c789"
## "421_Lu.pdf /6161fcd9fd5b2301429849b3"                  <-- copy 1
##
## Retrieving page 2
## "423_Arriaga.pdf /619d017da83c2001650e8e53"
## "761_Papadatou-Pastou.pdf /619df2886977cd010f496498"
## "712_Davis.pdf /61a7d30d4d4ce5018476e569"
## "1574_Al-Hoorie.pdf /61b89ac6da0b1b0488d05546"
## "206_Ergiyen.pdf /61cc42f3da632006e1fe6f4a"
## "437_Peker.pdf /61fc2630370e6c002bf3d6cc"
## "104_Stieger.pdf /620e3a2511da1c05cdf57647"
## "238_Martínez.pdf /620f7666d9b6cf0144b90449"
## "1052_Parzuchowski.pdf /6220fbccc064270378d90ce5"
## "421_Lu.pdf /6161fcd9fd5b2301429849b3"                  <-- copy 2

The waterbutler IDs are identical so this does seem like a possible bug.

Let me know if you need any more information.

aaronwolen avatar Oct 22 '22 13:10 aaronwolen

Thanks @aaronwolen - I will note that when I run the same code I get a different file duplicated. And the duplicated file sometimes changes, usually when I update/upload a new file. You can see my reprex here.

doomlab avatar Oct 22 '22 14:10 doomlab

I was able to reproduce the error and it is indeed coming from the API, the default sorting logic for this particular endpoint appears to be broken. As a temporary work around you can use your own sorting criteria to prevent the error, such as https://api.osf.io/v2/nodes/ycn7z/files/osfstorage/6113d75ae3801305b39612a8/?sort=id will sort correctly by id, removing all duplicates, similarly for name etc. We will resolve the issue eventually, but I recommend using the workaround for now. Thanks for your interest, I'll close this issue when we've resolved this bug if you have no further questions or comments.

Johnetordoff avatar Oct 22 '22 22:10 Johnetordoff

Thanks for the quick response, @Johnetordoff! A couple follow-up questions for you:

  • I didn't even know about the sort param. Is it documented somewhere and I missed it?
  • What other attributes can we sort on?
  • Will sorting on any attribute solve the issue?

aaronwolen avatar Oct 23 '22 15:10 aaronwolen

Thanks @Johnetordoff - I have updated my code and got the appropriate output. Appreciate the workaround.

doomlab avatar Oct 23 '22 22:10 doomlab

@aaronwolen

I didn't even know about the sort param. Is it documented somewhere and I missed it?

It is not documented, unfortunately this param is not implemented consistently over all the endpoints it's applied. Some queries, legacy endpoints and attributes haven't been QA'ed for accurate sorting, so they remain undocumented.

What other attributes can we sort on?

The default sorting behavior for list view is to allow the user to sort on any of the attributes returned in JSON payload. For example https://api.osf.io/v2/users/ allows you to sort on full_name, given_name, middle_names, family_name, suffix, date_registered, active, tiimezone, locale, social, employment and education.

Will sorting on any attribute solve the issue?

I did not check, as I've written this behavior is not guaranteed to be accurate or consistent.

Johnetordoff avatar Oct 24 '22 13:10 Johnetordoff