docs.rs
docs.rs copied to clipboard
Add crate file size upload limits
Limits the maximum file size for uploaded files, configurable via Limits. The current default I have right now is 5 GB, but I just set that because I didn't know what the preferred number would be
Closes #588
I considered testing this via File::set_len on a dummy, but idk if a ~5 GB file is something we want to impose on testing
FWIW I wouldn't expect File::set_len on most filesystems/platforms to actually commit anything to disk, just adjust metadata.
According to the docs it allocates zeroed memory, but that doesn’t matter because I can just lower the limit to something small, no idea how I didn’t realize that
Out of the docs I have, the largest file is
91513860 docs/web-sys/0.3.35/x86_64-pc-windows-msvc/src/web_sys/opt/rustwide/target/x86_64-pc-windows-msvc/debug/build/web-sys-caeee69dd050a8ab/out/bindings.rs.html
there's a few of those for the different platforms, then it drops down to a handful of ~10-15MB files from other crates. The largest non-src file is
12291596 docs/stm32f0/0.10.0/x86_64-pc-windows-msvc/search-index-20200414-1.44.0-nightly-edc02580e.js
Does 500MB sound like a reasonable default per-file limit?
I'm not sure this belongs in docs.rs.
a) For files that are part of the tarball, I think filtering based on file size should happen on crates.io, if at all. In particular, I don't like the idea of having crates that can be published but then will be automatically rejected by docs.rs, that seems like it should need an RFC at the minimum.
b) For files that are generated, I'm not sure I see the use case. The files that will be blocked are ... none, unless rustdoc starts generating enormously larger HTML files? In that case why do we need the limit? The only exception I see is files created by build.rs, but I think that should be addressed by limited disk usage during builds, not disk usage during uploading (or otherwise the server itself will run out of disk space before finishing the build).
The point of the limit isn't to try and save space or to ever actually be set off by rustdoc, it's purely for included files of an absurd size as mentioned in the issue, e.g. a crate that has a massive video file in it. Filtering input files somewhat is part of the 101s of web security, even if Crates.io did this for us, I'd still advocate for implementing this because it's a very simple fix that helps with security. And I don't really think an RFC would be needed at all, as A. The limit is absurdly high and B. It's easily fixable (in the unlikely event that someone triggers it in a meaningful way), as the way I implemented it was similar to other limits implemented on builds, such as memory and build timeouts. This is also a better solution than limiting disk usage, as it allows crates to generate whatever weird, large files they may need (and allowing the build to work perfectly), but then just not uploading said large weird files
it's purely for included files of an absurd size
I think that should be handled by crates.io if at all.
Filtering input files somewhat is part of the 101s of web security, even if Crates.io did this for us, I'd still advocate for implementing this because it's a very simple fix that helps with security.
That's only relevant if we accept arbitrary uploads. We only accept uploads that were previously accepted by crates.io.
This is also a better solution than limiting disk usage, as it allows crates to generate whatever weird, large files they may need (and allowing the build to work perfectly), but then just not uploading said large weird files
I actually think silently not uploading the file is worse than failing the build because you don't notice it's not there until you try to visit the page.
Also, you haven't addressed the concern about running out of disk space during a build (which admittedly is somewhat of a separate issue).
For the first two comments, I'll restate that even if Crates.io did this for us, I'd still advocate for the change. It's incredibly simple to implement, understand, test, etc (and I've already done it, so it's not a question of future work either), won't affect performance whatsoever and is just general good practice, even though we're getting docs from a trusted source. The actual change itself is a singular line of code, just an if file.metadata().unwap().len() <= threshold that could potentially save us a lot of pain in the future
I actually think silently not uploading the file is worse than failing the build because you don't notice it's not there until you try to visit the page.
It could be implemented to fail the build as well, but I want to reiterate on the fact that this should never affect a normal crate. Anyone who sets this off is either malicious (and attempting to attack us/crates.io) or at minimum is engaging in terrible practice of making very large build artifacts outside of the target/ directory. The former should be guarded against and the latter is easily fixed, as this only matters to very large files that we are attempting to send to s3
Also, you haven't addressed the concern about running out of disk space during a build (which admittedly is somewhat of a separate issue).
That sounds like a separate docker-based PR, adding a limit to the container itself
For files that are part of the tarball, I think filtering based on file size should happen on crates.io, if at all. In particular, I don't like the idea of having crates that can be published but then will be automatically rejected by docs.rs, that seems like it should need an RFC at the minimum.
crates.io does indeed have a limit, but we override it for some crates if they legitimate have a need for more space. Here is the current exceptions:
crates-io::DATABASE=> select (max_upload_size / 1024 / 1024) AS upload_size_mb, count(*) from crates where max_upload_size is not null group by max_upload_size;
upload_size_mb | count
----------------+-------
15 | 2
20 | 5
50 | 1
55 | 1
300 | 1
That's only relevant if we accept arbitrary uploads. We only accept uploads that were previously accepted by crates.io.
Unfortunately build scripts and proc macros are a thing :(
It could be implemented to fail the build as well, [...]
We will definitely want to fail the build.
Ok, if we're going to implement this it should be a restriction on the build, not on the uploads, and it should be per-crate, not per-file (or a malicious crate could upload a thousand files one byte less than the max).
@pietroalbini it doesn't look like Rustwide currently has an API for limiting the storage space on https://docs.rs/rustwide/0.7.1/rustwide/cmd/struct.SandboxBuilder.html, could that be added?
The reason I'm so adamant about limiting this during the build is because otherwise we could still have DOS attacks which cause the site to go down when the hard drive fills up; having more storage costs is much less of an issue.
I think we should have both limits on the disk space used and the uploads. There is no reason to block this PR only because a different limit is not enforced yet.
For the default limit, we should actually analyze the S3 inventory and see how much storage crates actually use.
For the default limit, we should actually analyze the S3 inventory and see how much storage crates actually use.
Do you know a reasonably efficient way to do that? Just querying the size of each file on S3 will take a very long time because we have so many.
We have the inventory of all the files present in the bucket already.