maestral icon indicating copy to clipboard operation
maestral copied to clipboard

Maestral is indexing remote folders against my will

Open raffaem opened this issue 3 years ago • 3 comments

Sam,

I reinstalled Maestral and asked it to selective sync only a small folder.

I thought it would require it only a short time to finish. It doesn't.

Apparently it is syncing all the remote folders, even if it is correctly not downloading them because I didn't ask it to download them.

The logs are as follows (level=DEBUG):

2021-05-22 12:56:42 maestral.sync DEBUG: Converted remote changes to SyncEvents
2021-05-22 12:56:42 maestral.sync INFO: Indexing 43879...
2021-05-22 12:56:42 maestral.sync DEBUG: Remote cursor saved: [LONG STRING HERE]
2021-05-22 12:56:43 maestral.sync DEBUG: Listed remote changes:
[<FolderMetadata(path_display=[A FOLDER I DIDN'T ASK TO SYNC])>,
 <FileMetadata(path_display=[A FILE I DIDN'T ASK TO SYNC])>,

raffaem avatar May 22 '21 10:05 raffaem

Uh, yes, I am aware of that, but there is currently no easy solution. There are basically two options for indexing the remote Dropbox:

  1. Hierarchically: List all parents first and children next. This allows skipping excluded folders but requires multiple calls to the files/list_folder endpoint (for the parent folder and then for all of its children). Maestral used to do this originally.
  2. Recursively: This allows us to get away with a single recursive call to the files/list_folder endpoint. This has two advantages: First, it is a lot easier to resume a previously interrupted indexing session (connection problems, shutdown during indexing, etc). We don't have to remember where we were interrupted but can just pass the "curser" from the last batch of synced files. Dropbox will handle the pagination for us. Second, resuming an interrupted indexing job and fetching remote changes both have the same code flow, resulting in a smaller code base.

The changes were made in https://github.com/SamSchott/maestral/pull/296. In principle, it would be possible to have the best of both worlds. This would require manually tracking which folders we have already visited during indexing and I've been too lazy to do this...

samschott avatar May 22 '21 15:05 samschott

But why do we need to index the remote Dropbox? It's mostly excluded folders

raffaem avatar May 22 '21 20:05 raffaem

Well, I'm not saying that we need to index the entire Dropbox. I'm just saying that it's easier to implement. Imagine a folder structure like this:

/
 |
 |
[-] Folder 1
 |    |
 |   [x] Folder 1.1
 |   [ ] Folder 1.2
 | 
[x] Folder 2
[ ] Folder 3

The Dropbox API has a single files/list_folder endpoint with a recursive option. At the moment, we call it once with recursive=True for the root folder. When this call gets interrupted, we can easily resume from the last point with a "cursor" that Dropbox gives us as we progressively process more of the indexing results.

If we never want to index the content of excluded folders, we have to call files/list_folder with recursive=True for each of the fully included folders (Folder 1 and Folder 1.1 in this example) and with recursive=False for the partially included folders (Folder 1). If we are interrupted during indexing, we need to remember both which folders we already indexed completely, and how far we got with processing the results of the files/list_folder call that was interrupted. It's more state to track. I'm not saying that it's difficult to do, just tedious and possibly error prone when missing a detail.

samschott avatar May 23 '21 13:05 samschott