rclone icon indicating copy to clipboard operation
rclone copied to clipboard

pCloud doesn't use internal pCloud hash, this is why rclone dedupe takes so much time, because it has to fetch sha1-hashes individually than using the already provided pCloud hash

Open masrlinu opened this issue 1 year ago • 4 comments

What is the problem you are having with rclone?

The command

rclone dedupe --dry-run --fast-list --by-hash -vv pcloud:

takes A LOT of time to find anything!! I stopped it after an hour

The reason is that it is getting the sha1-hashes individually. But it doesn't need to do that, because pCloud is able to deliver the folders and files recursively with it's own pCloud hash in just one single request. Here are the docs.

So could you please use the pCloud hash for rclone dedupe? And please also return the pCloud hash, if I list my files via

rclone lsjson pcloud:

Even better would be, if you directly return the JSON, which you receive by pCloud here.

And even better would be, if the pCloud hash would be globally accepted, so that it would be even checked, if I do a sync between two different pCloud accounts. This would reduce all multiple sha1-requests to just one single request.

And if someone asks pCloud, how this hash is made, maybe it would even be possible to compare the pCloud hash to local files in one single request.

What is your rclone version (output from rclone version)

rclone v1.68.1

  • os/version: darwin 11.7.10 (64 bit)
  • os/kernel: 20.6.0 (x86_64)
  • os/type: darwin
  • os/arch: amd64
  • go/version: go1.23.1
  • go/linking: dynamic
  • go/tags: cmount

Which OS you are using and how many bits (e.g. Windows 7, 64 bit)

Mac OS X 11.7.10 - 64 Bit

Which cloud storage system are you using? (e.g. Google Drive)

pCloud

The command you were trying to run (e.g. rclone copy /tmp remote:tmp)

rclone dedupe --dry-run --fast-list --by-hash -vv pcloud:

A log from the command with the -vv flag (e.g. output from rclone -vv copy /tmp remote:tmp)

2024/10/15 01:22:36 DEBUG : rclone: Version "v1.68.1" starting with parameters ["rclone" "dedupe" "--dry-run" "--fast-list" "--by-hash" "-vv" "pcloud:"] 2024/10/15 01:22:36 DEBUG : Creating backend with remote "pcloud:" 2024/10/15 01:22:36 DEBUG : Using config file from "/Users/masr/.config/rclone/rclone.conf" 2024/10/15 01:22:36 INFO : pcloud root '': Looking for duplicate sha1 hashes using interactive mode.

How to use GitHub

  • Please use the 👍 reaction to show that you are affected by the same issue.
  • Please don't comment if you have no relevant information to add. It's just extra noise for everyone subscribed to this issue.
  • Subscribe to receive notifications on status change and new comments.

masrlinu avatar Oct 14 '24 23:10 masrlinu

It is interesting idea but what pcloud calls a hash is a bit weak and rather looks like some form of CRC:

# pcloud hash
6306013028049022731

which is 0x5783757035D8EF0B, 64 bits only value

vs.

# md5
0x95517998cb172714fc9270b17bdf075c (128 bits)

It might be ok-ish for detecting file changes but too weak for reliable dedup as it is collision prone. IMO would require special logic (to eliminate false positives) retrieving checksum (md5, sha1...) for any files with the same pcloud hash. This would be very different logic to what normally rclone does with hashes.

kapitainsky avatar Oct 15 '24 05:10 kapitainsky

Yes, you are right for very big databases, but the Formula for the Birthday Problem says, that if you have 1.000.000 files, the probability, that two of them have the same 64 bit hash is just 0.000000027105% and a regular person has less than 1.000.000 files on his pCloud.

Could you please at least give the user the option to be able to use the pCloud hash? So that I can start an interactive rclone dedupe process?

masrlinu avatar Oct 15 '24 10:10 masrlinu

Yes, you are right for very big databases, but the Formula for the Birthday Problem says, that if you have 1.000.000 files, the probability, that two of them have the same 64 bit hash is just 0.000000027105% and a regular person has less than 1.000.000 files on his pCloud.

You have NO idea I am afraid in pCloud case, as nobody knows what this "hash" is. Maybe it only returns 10 different values?:) maybe its distribution is heavily skewed etc.

IMO it would be very unwise to use some undocumented value for files deduplication. Nobody knows what it really is. Established hashes have well understood and studied behaviour (so we lame people trust experts here). As you have no idea how this value is calculated you can't say anything whether it can be used for dedupe purpose or not. Far too often home brewed solutions have fatal flows - e.g. can return the same hash for two different lengths files containing only zeros (this is one I encountered myself in some village genius written software).

I think good approach would be to reach to pCloud for either explanation or suggestion about API modification. It would help with many things if they could return bulk dir listing including hashes.

kapitainsky avatar Oct 15 '24 11:10 kapitainsky

Before I knew about rclone, I wrote a program myself that showed me duplicates by simply comparing the pCloud hash. The recursive metadata of all files in my pCloud is 262 MB, and the hash worked great, showing me only the duplicates.

Is it okay to at least return the pCloud hash when you write the following?

rclone lsjson --fast-list pcloud:

Or maybe that you return the JSON from pcloud as it is without changing it? Then I can use it for my own programs, because at the moment, it is not possible to create new apps in pCloud, so the JSON from rclone would be really helpful.

I understand that you want to be 100% on the safe side. But would it be an option to create a pCloud option like

rclone dedupe -i --fast-list --by-hash --use-pcloud-hash pcloud:

at everyone's own risk?

masrlinu avatar Oct 16 '24 02:10 masrlinu

It would be possible to use the pCloud hash if we couldn't support it in the local backend but it would make some bits of rclone really awkward!

Here is an experiment - it implements a pcloud hash.

You can use it like this and it will do the single recurse list to print the hashes

rclone hashsum pcloud TestPcloud:

To be useful for dedupe, the rclone dedupe needs a flag to choose the hash type which I can do if the above works for you.

To answer @kapitainsky concerns - the pcloud backend supports MD5, SHA1 or SHA256 in various combinations, and these will be preferred before the pcloud hash so it would take deliberate user action to use the pcloud hash.

(If you run rclone hashsum it shows you the hashes in preference order)

v1.69.0-beta.8377.3f8619a15.fix-8133-pcloud-hash on branch fix-8133-pcloud-hash (uploaded in 15-30 mins)

ncw avatar Oct 22 '24 20:10 ncw

Sorry that I'm answering just now. I just saw your post. Yes, the checksums work for me, thank you! I would be happy if you could make it work with dedupe :-)

masrlinu avatar Oct 30 '24 19:10 masrlinu