couchdb icon indicating copy to clipboard operation
couchdb copied to clipboard

[DISCUSS] Validate new document writes against max_http_request_size

Open janl opened this issue 6 years ago • 10 comments

This supersedes #1200.

New Behaviour

This variant introduces no new config variable and no formula, instead, there is a set of three hurdles each doc write has to pass:

  1. doc body size
  2. individual attachment size
  3. length of multipart representation of full doc body and including attachments.

The validation path is now the following:

  • If a new doc body is > max_document_size, we throw an error.
  • If a new attachment is > max_attachment_size, we throw an error.
  • If the new doc body plus new and/or existing attachments is > max_http_request_size, we throw an error.

Notes

This is again just a sketch to show how something like this could look like. The patch is fairly minimal, but it does include a full additional ?JSON_ENCODE of the doc body, and some munging of the attachment stubs, that I’d like to get a performance review for. I’m sure we can make this fast if we need to, but that would require a larger patch, so it’s this sketch for now.

Compatibility

This also sets the max_document_size to 2 GB, to restore 1.x and 2.0.x compatibility as per https://github.com/apache/couchdb/pull/1200#issuecomment-370489809

I’d suggest we make this a BC change in 3.0 to the suggest 64MB or whatever we feel is appropriate then.

Formalities

  • [x] Includes tests
  • [ ] Documentation has been updated // waiting for consensus before doing this

janl avatar Mar 29 '18 11:03 janl

The travis fails here point to https://github.com/apache/couchdb/blob/master/test/javascript/tests/attachments.js#L300-L301 where we allow attachments stubs to be written verbatim and without length info. There is code to resolve this, but it requires reading the attachment info from disk.

I’m not yet implementing this because I wan’t a review of the approach here first. Could be perf-prohibitive on the write path, tho.

janl avatar Mar 29 '18 13:03 janl

Note that whatever we do here - especially if this PR is not merged into 2.2.0 - needs to be documented for the 2.2.0 release, in light of the concerns raised in #1304.

wohali avatar May 01 '18 04:05 wohali

I propose the following:

  1. cherry-pick the change that restores the 4GB request size limit to master/2.2.0.
  2. leave this branch/pr open until we made the validate-request-size-on-write function a) complete and b) fast.
  3. when the function is ready, add it to a post 2.2.0 release with a note that in the future this will be enabled by default and a config option for folks to opt-in at that point.
  4. eventually make it opt-out.

I don’t see this being finished any time soon, and since the end-result is functionally equivalent (sans the opt-in) for 2.2.0, this should not block 2.2.0.

janl avatar Jul 13 '18 15:07 janl

@janl I'll try to make the function. It is close enough to the other one. Otherwise, I think we can keep it.

nickva avatar Jul 13 '18 15:07 nickva

When we talk about 64Mb, 2Gb etc... are we talking about default values that can be raised by the user, or hard limits that the end user cannot change?

For example, if 64Mb is selected for 3.0, would individual users still be able to set 4Gb and more?

As far as we are concerned, we enjoy CouchDb for the ability to use it as a file server / streaming server and keep all the data in one place to ensure database consistency (compared to keeping references to third party storage services such as S3, and then keeping everything in sync, ensuring links are not dead etc... Couchdb is a blackbox neatly integrated with the Application Layer, thus greatly simplifying maintenance and system administration).

Filtered replication is particularly nice with this regard, since it becomes extremely easy to create clusters of multimedia attachments to balance load. For example, one cluster may only contains replicated videos, and the Application Layer uses Command-Query separation to route the user to the right cluster depending on the type of Query being performed. All of this is transparent from the application's perspective, the various addresses of clusters handling specific query types simply need to be configured during the deployment of the application, and further load balancing can be done at the DNS level.

CouchDb hugely simplifies the infrastructure work. The management of large multimedia clusters becomes a breeze with continuous filtered replication, and consumption by application users is very efficient thanks to CouchDb support for HTTP range requests.

It would be interesting to retain the ability to have large documents / attachments / requests sizes for those who know what they are doing.

nkosi23 avatar Jun 30 '19 15:06 nkosi23

@janl This is very stale. Any plans to get this in? I am preparing to mass-close old PRs that never got merged.

wohali avatar Oct 09 '20 15:10 wohali

@janl This is very stale. Any plans to get this in? I am preparing to mass-close old PRs that never got merged.

@janl / @wohali : ping

bessbd avatar Mar 29 '21 11:03 bessbd

@bessbd I think what Jan proposed in https://github.com/apache/couchdb/pull/1253#issuecomment-404873979 is basically done. We've pushed the default limit in 3.x pretty low, and as you know 4.0 changes all the rules.

I think it is probably safe to close this out, but I'd like to see @janl +1 that.

wohali avatar Mar 29 '21 16:03 wohali

is there any update on this issue?

adityajoshi12 avatar May 04 '21 03:05 adityajoshi12

this is very stale.

saurabhprasadsah avatar Jan 11 '22 10:01 saurabhprasadsah