xmpp-http-upload Add ability to maintain quotas and file logging

The README diff contains more info on the changes made.

I'm planning to update Prosody's mod_http_upload_external to prepend the hashed/salted JID in the PUT URLs, so that admins can identify the files uploaded for a particular JID.

This can help with GDPR compliance (e.g. removal of particular user's data upon request) and also makes per-user quotas possible.

If the hashed/salted JID is not in the URL, then the quota is enforced globally across all users.

Since files get removed to enforce the quotas, I figured it would be good to log that to a file.

Jan 13 '19 12:01 jcbrand

Also, unrelated to the review, please ensure that you talk to @mwild1, because I think he has some plans for a v3 of the upload protocol.

Jan 13 '19 13:01 horazont

Hi @horazont

Thank you for taking the time to look into the PR and for providing feedback.

I’m not entirely sure what the aim is. Given the lack of code for user-specific things, I assume that this is about a global quota for now.

That depends on the directory structure within which the files are stored. It can be global or per user.

I want to use it as a per-user quota. I've made changes to mod_http_upload_external so that a user's files are stored in a top-level directory that is a salted hash of their JID. By doing so I can apply the quota per user and I can also easily identify the files uploaded per JID (but only because I know the salt value) and I can remove them as per a GDPR request.

The global quota should not be managed by the XMPP server, but via configuration.

I first had it as local configuration, but I want eventually to have more finegrained quotas. For example I'd like to allow friends and family or people who pay a subscription to have larger quotas than complete strangers.

I think this information will more likely need to be managed and stored by the XMPP server rather than the file upload service.

The global quota is a delicate issue in and by itself: it allows a single user with lots of bandwidth to cause lots of churn on the entire thing and make files of other users be deleted.

Yes, so it comes with trade-offs and people need to decide whether they want to use a global quota or not. My intention is to implement a per-user quota.

The quota implementation as it is is not at all multitasking safe (while the upload implementation is). This is a hard requirement, because the duration of an upload is determined by the speed of the client, and thus xhu must be able to run in multiple threads, coroutine tasks (possibly in the future) and/or processes concurrently. Simply decorating the thing with a few locks won’t do (due to multiprocessing), this needs to be solved on the file-system level.

This is interesting and not something I considered. I'm not sure however what you mean by "this needs to be solved on the file-system level".

This makes me think of the ZODB Python object database.

It's transactional (atomic) and allows rollbacks. You can then use ZEO to allow multiple processes to access to a single database and then put a load-balancer in front of the different processes.

Something like that would pretty much solve the concurrency and thread-safety issue, but it's a big departure from the current design and I wouldn't be surprised if you're not interested in going in such a direction.

The quota implementation is O(n) in the number of uploaded files.

Yes, or rather the number of files per user for the case I have in mind.

In general, I think that quota data needs to be stored to reduce the complexity of the quota check (from O(n) to O(1)).

Yes, you're right in pointing out that looping over all files to calculate storage usage is not something that'll scale well. I like the idea of using a file to store the quota.

Jan 13 '19 23:01 jcbrand

Alright, now I get the big picture here. To summarise, so that we’re on the same page: You modified (or plan to modify) mod_http_upload_external to send the quota (as query argument) and to prefix the URL with an HMAC of the JID and a secret key. Thus, the directory structure ends up being something like this:

/<hmac(jid, key)>/<random nonce>/<filename>.{data,meta}

This allows you to only look at the files in <hmac(jid, key)> to enforce a users quota. So at least some of the code makes more sense to me now. Also, the decision to let the XMPP server hand over the quota via GET makes sense now.

Still, there are unsolved issues. Continuing the discussion.

The quota implementation as it is is not at all multitasking safe (while the upload implementation is). This is a hard requirement, because the duration of an upload is determined by the speed of the client, and thus xhu must be able to run in multiple threads, coroutine tasks (possibly in the future) and/or processes concurrently. Simply decorating the thing with a few locks won’t do (due to multiprocessing), this needs to be solved on the file-system level.

This is interesting and not something I considered. I'm not sure however what you mean by "this needs to be solved on the file-system level".

The file-system is in the end the shared resource the tasks (be it threads, processes or coroutines) operate on. Thus, the synchronisation is best handled by the very same thing. For this, file-system level locks such as some lock based on flock would be suitable (but see my remarks about using flock and multithreading).

This makes me think of the ZODB Python object database.

It's transactional (atomic) and allows rollbacks. You can then use ZEO to allow multiple processes to access to a single database and then put a load-balancer in front of the different processes.

Something like that would pretty much solve the concurrency and thread-safety issue, but it's a big departure from the current design and I wouldn't be surprised if you're not interested in going in such a direction.

Yes, it’s a huge deviation from the current design, which is extremely slim. This will be a nasty trade-off: either we use OS specific tools (such as flock, which is BSD specific but also available on Linux) or a high-level tool such as sqlite or ZODB (I’d prefer SQLite, because I’m more familiar with it though; ideally, the transactions are all rather short and thus we don’t have to worry about the lack of concurrency in SQLite – after all, serialisation is what we need on some level).

I am really not sure what would be the right way here. On the one hand, I’d like to avoid sqlite or anything like that, because I perceive it as overkill, even though it probably isn’t. On the other hand, stuff like flock won’t work in all scenarios, and will still be tricky to get right.

Actually I think that this would be an excellent use-case of LMDB (which is an embeddable ordered key-value store database with support for transactions), but never having used LMDB in a project yet, I’m not confident in that either. However, since we might want to use LMDB in JabberCat, this might be a good playground to explore it. It would also allow to reduce the complexity of various operations by storing basic file metadata in the LMDB too.

The quota implementation is O(n) in the number of uploaded files.

Yes, or rather the number of files per user for the case I have in mind.

In general, I think that quota data needs to be stored to reduce the complexity of the quota check (from O(n) to O(1)).

Yes, you're right in pointing out that looping over all files to calculate storage usage is not something that'll scale well. I like the idea of using a file to store the quota.

The file to store the quota makes everything much trickier though. First the aforementioned multitasking issues. Second, where would you put the file? It needs to be on the level of the user. In your current design, you cannot really be sure whether you’re working with multiple users in prosody or not. And you cannot put the file in the user directory, because it is possible for the XMPP server to create a URL which matches that file name, and then all bets are off – unless characters which aren’t URL safe are used in the filename… such as \t or a space. Takes a bit of consideration though. And it’s still a hack.

Jan 14 '19 16:01 horazont