jottalib icon indicating copy to clipboard operation
jottalib copied to clipboard

Option for only checking modified date and size with jottacloudclientscanner

Open ariselseng opened this issue 9 years ago • 13 comments

I want to backup up a lot of data to jotta with jottacloudclientscanner. Using md5 to check each time takes ages to finished, and is very straining on my drives. How hard would it be to make it only check if size and time is modified before checksumming the file? Rsync-style.

ariselseng avatar Sep 11 '15 09:09 ariselseng

hey @cowai! Nice to see you.

We could do that -- although it would have to be an optional thing, and it would come with a big warning.

That's because metadata like size and time are signals, but I consider a checksum more like truth.

There's one thing I would like to hear your opinion on, though. I recently added the possibility to store checksums in the file system, so we don't have to do all the checksum calculations over and over again.

What do you think about that?

havardgulldahl avatar Sep 12 '15 07:09 havardgulldahl

Hi This would also work and have the same effect. But it would still be "signals". I mean how do you determine when to recalculate a checksum without relying on metadata? Locally storing checksums is fine by me, it would double the io access locally but much better then checksum each time:)

Related question: Does jottalib store the file time at jottalib or is just creating a new date at upload?

Btw, thank you so much for your work!

On 12 September 2015 09:53:07 CEST, "Håvard Gulldahl" [email protected] wrote:

hey @cowai! Nice to see you.

We could do that -- although it would have to be an optional thing, and it would come with a big warning.

That's because metadata like size and time are signals, but I consider a checksum more like truth.

There's one thing I would like to hear your opinion on, though. I recently added the possibility to store checksums in the file system, so we don't have to do all the checksum calculations over and over again.

What do you think about that?


Reply to this email directly or view it on GitHub: https://github.com/havardgulldahl/jottalib/issues/53#issuecomment-139739118

Sent from my phone.

ariselseng avatar Sep 12 '15 08:09 ariselseng

Yeah, you're right, of course. The only way to be certain is to calculate the checksum every time. So, we try to be as close to certain as possible

What I'm thinking, is that by storing the checksum, the last modified time and the file size together, we will trust the cached checksum as long as the two others are still correct. If the file size or the last modified time has changed, we recalculate.

What do you think?

In regards to your other question. I guess we'll be able to store mtime at jottacloud. I haven't tried it. Currently, we're storing the time of upload. Look at JFS.post() for details.

lør. 12. sep. 2015 kl. 10.25 skrev Ari [email protected]:

Hi This would also work and have the same effect. But it would still be "signals". I mean how do you determine when to recalculate a checksum without relying on metadata? Locally storing checksums is fine by me, it would double the io access locally but much better then checksum each time:)

Related question: Does jottalib store the file time at jottalib or is just creating a new date at upload?

Btw, thank you so much for your work!

On 12 September 2015 09:53:07 CEST, "Håvard Gulldahl" < [email protected]> wrote:

hey @cowai! Nice to see you.

We could do that -- although it would have to be an optional thing, and it would come with a big warning.

That's because metadata like size and time are signals, but I consider a checksum more like truth.

There's one thing I would like to hear your opinion on, though. I recently added the possibility to store checksums in the file system, so we don't have to do all the checksum calculations over and over again.

What do you think about that?


Reply to this email directly or view it on GitHub:

https://github.com/havardgulldahl/jottalib/issues/53#issuecomment-139739118

Sent from my phone.

— Reply to this email directly or view it on GitHub https://github.com/havardgulldahl/jottalib/issues/53#issuecomment-139740103 .

havardgulldahl avatar Sep 12 '15 16:09 havardgulldahl

I guess it would work good. I am not convinced it would be necessary though. If rsync does not do it, then it could not be that bad :P But of course it wont hurt either :) I have a question for that though: Is there ever a situation where a file changes and the mtime does not change? I can only think of one case, and that is when there is bitrot, and in that case I dont want my broken file to be reuploaded to jotta :P

Just remember that it will be a huge cache for some. We need to store the path, the size, mtime. It will take around 100-200bytes for each file. In my case several hundred megabytes just for that cache ( that in the end is not that much more secure). It will also like I said earlier, double the disk lookups (depending if you are storing it as simple json or some sort of db).

If you are going with that I would also save a list of the sate of jotta (as an option). So that we don't need to lookup remotely to know what to do, but just push the changes. In my use case I will never ever use the web interface or some other platform.

What do you think?

ariselseng avatar Sep 12 '15 17:09 ariselseng

Yeah, I don't disagree with you. So I'm happy to add the option --no-checksum or something like that.

But we'll keep md5 checksumming as the default, because

  1. We keep feature parity with the official client (and we're exactly as safe as them)
  2. This is backups we're talking about, so we want to err on the side of caution
  3. The implementation would be dependent on a feature (server side mtime) that we don't know much about

havardgulldahl avatar Sep 12 '15 19:09 havardgulldahl

Regarding local md5 cache.

I'm not particularly interested in maintaining a central cache, be it sqlite or a structured, flat json file. Caching is hard, and keeping that cache in sync sounds like a quick way to get in a bad mood.

But take a look at db30d406a04a3689f482e8a2a5e72a2fb889a32a. It's a way of keeping the calculated checksum along with the file itself, using xattr. No central cache. Just some bytes added in the file system, attached to the file.

Of course, you need a file system that supports this. So, it's not for everyone.

I'd appreciate it if you tried it out and let me know your thoughts!

havardgulldahl avatar Sep 12 '15 19:09 havardgulldahl

About always md5 checksumming always-on: I didn't know the official client did this every time. Makes sense do make it do the same thing.

xattr seems like a good idea now that I actually know how it works! :)

ariselseng avatar Sep 12 '15 21:09 ariselseng

Well, I don't think they recalculate the checksum every time. They keep a sqlite db around where they store a lot of metadata:

CREATE TABLE jwt_fl (jwc_id INTEGER PRIMARY KEY ASC AUTOINCREMENT, jwc_name, jwc_path, jwc_hash, jwc_phash, jwc_chksum, jwc_size, jwc_created, jwc_modified, jwc_mp, jwc_revision, jwc_lastchecked, jwc_err, jwc_nextupload,jwc_parentfolder,jwc_folderid);

So I reckon they keep using the cached checksum as long as the file size and date still match. But they always compare checksums with the online copy to see if they need to replace it with the local file.

havardgulldahl avatar Sep 12 '15 21:09 havardgulldahl

If we are going to implement this option. We need to save date in xattr too right? So that we can check if size and date in xattr is the same as the actual file modified time and size, right?

ariselseng avatar Sep 13 '15 13:09 ariselseng

fixed in https://github.com/havardgulldahl/jottalib/commit/64fdf1e480e85eb9c2d56f38df0d8232da7ce87d

havardgulldahl avatar Jan 22 '16 15:01 havardgulldahl

@havardgulldahl So now it can check only by mtime/size?

ariselseng avatar Jan 22 '16 17:01 ariselseng

Hmm I might have been a bit too eager here. ;)

We still have to patch jottacloud.replace_if_changed to only look at mtime if the right argument is passed.

Thanks for paying attention :)

havardgulldahl avatar Jan 22 '16 18:01 havardgulldahl

@havardgulldahl I will see if I can add that option to only check for size and mtime, like the default rsync behaviour. I think that will be a lot faster with thousands of files instead of looking up xattr for each file. I want to backup 10TB with ~2 million files without that taking hours each time. Every millisecond counts in my case :)

ariselseng avatar Jan 23 '16 11:01 ariselseng