onedriveClient icon indicating copy to clipboard operation
onedriveClient copied to clipboard

Unable to compare existing files downloaded via other methods. As result, onedriveClient is downloading/uploading everything again

Open modelmat opened this issue 7 years ago • 14 comments

If a directory is used which had been previously synced, this program will upload each file again as file {desktop}, which wastes my bandwith.

Can the file hashes be compared first?

modelmat avatar Sep 30 '18 10:09 modelmat

What do you mean when you say "a directory is used"? Are the files going to change the timestamp? Is there any file that has its hash changed?

derrix060 avatar Oct 01 '18 07:10 derrix060

Ie. If I have previously copied my OneDrive directory (using another sync tool, for example), then syncing will reupload every single file with the added (desktop) suffix to the end. Suppose this would be fixed with #17 though.

I meaning don't reupload files with (desktop) suffix if the file on cloud has the same hash, if the local one is newer overwrite it, if the cloud is newer download the cloud.

modelmat avatar Oct 01 '18 07:10 modelmat

The way that is doing now to know if a file is different is first looking at the path (and the parent repository) + the filename. If matches with the cloud, then check the timestamp + hash.

I'm changing a little bit this behaviour on #17...

Let me see if I understood what you are saying:

  • You set the onedrive to sync with an empty directory
  • You download the files from OneDrive using other tool/manually
  • You copied these files to the empty dir that you said to sync

am I right?

derrix060 avatar Oct 01 '18 08:10 derrix060

Yes.

modelmat avatar Oct 01 '18 08:10 modelmat

What I think that is weird in this case is when you set to sync, it should download everything again...

I remember that I had the same issue when I first started to look at this project, what I did is gave up and let the onedrive download everything...

Can you make sure that the files are in the same structure and that the framework is uploading the file, not only the timestamp?

derrix060 avatar Oct 01 '18 08:10 derrix060

I actually tried it again and it seemed not to be, but I have just decided to redownload everything from scratch (deleted with rm :P) so I can't test til it syncs again.

modelmat avatar Oct 01 '18 08:10 modelmat

Investigating #21 I've found why the framework was uploading duplicates.

There are a couple of issues, I will try to explain the steps to check if the item is the same:

Check if the item exists locally

  • id should match
  • c_tag or (size and timestamp) should match

Check if the item has changed:

  • size should match
  • timestamp or hash should match.

Issues:

  • the size reported by Onedrive is not trustable, see https://github.com/OneDrive/onedrive-sdk-python/issues/88.
  • Onedrive takes some time to calculate the hash, see https://docs.microsoft.com/en-us/onedrive/developer/rest-api/resources/driveitem?view=odsp-graph-online and https://docs.microsoft.com/en-us/onedrive/developer/rest-api/resources/file?view=odsp-graph-online.
  • The timestamp can be slitly different (some seconds), so should this be considered the same? (I still need to investigate why the time is different, but I presume that is the time to add the task to the pool and it gets consumed or the time to upload/download a file. If these is the case, there is a MAJOR problem)

One possible way to do is to download the file, calculate the hash and see if it maches, or (how is now), upload the file with a different name. I will think more about how to know if the file is the same or not, and figure out the best way.

derrix060 avatar Oct 02 '18 10:10 derrix060

I assume you meant #22 not 21.

Especially for this issue, if all the files will be downloaded or uploaded as dupes, as long as the time is pretty close it can be assumed to be the same - if there is a substantial difference maybe it should be uploaded (though this should definitely be given to the user) as an option).

modelmat avatar Oct 02 '18 11:10 modelmat

No I mean 21 haha. I was debugging that error and found this...

Usually, the download speed is higher than upload, so I will download the file (hope that the file is not large...) and compare the hash. If the hashes are different, I will keep both locally and on the cloud, letting the user decide which one is up-to-date.

derrix060 avatar Oct 03 '18 06:10 derrix060

@derrix060 You will also run into this issue: https://github.com/OneDrive/onedrive-api-docs/issues/935

The timestamp can be slitly different (some seconds)

Baseline all 'timestamps' (local and OneDrive) to drop fraction seconds - HH:MM:SS is what should be compared otherwise timestamps will always be an issue

abraunegg avatar Dec 03 '18 20:12 abraunegg

@abraunegg thank's for the information! I'm planning to do a very bad workaround: download the file, calculate the checksum manually, and see it matches...

BTW, you have a nice project, congrats!!

derrix060 avatar Dec 05 '18 00:12 derrix060

Maybe it should only download if the timestamp is within 5 minutes or so? This allows for timestamps to be slightly off on onedrive's end and even on the client's end due to clock drift

modelmat avatar Dec 05 '18 04:12 modelmat

Why 5min? Is possible to change a file on the remote, and before 5min change locally as well... It would cover some cases, but not all...

derrix060 avatar Dec 05 '18 05:12 derrix060

I was thinking that 5 minutes would be a reasonable time. Even 10 seconds or so would probably be enough - what I am trying to say is there is no point downloading if the timestamp was say, 2 years apart - there is no point downloading then.

modelmat avatar Dec 05 '18 05:12 modelmat