compression
compression copied to clipboard
Express returns a non-compliant HTTP/206 response when gzip is enabled
CDNs (mostly Azure Front Door) is using HTTP Range Requests to retrieve data from Origin when caching enabled.
Express should ignore Range:
requests header when gzip
compression is in place since Express is unable to respond a HTTP/206
compliant answer with the right computed Content-Range
header in the response that is taking the compressed data length into account.
A fair compliant workaround is to, in that very case of compression where computing Content-Range
values would be too complex, ignore client's Range:
header in the request, and answer the whole compressed content in a HTTP/200 response.
Handling Range:
headers is optional so answering a HTTP/200
is OK.
Answering a HTTP/206
with wrong Content-Range
values is notOK.
Meanwhile another workaround is to disable compression and make CDN handle it, or disable CDN caching, however it would be fair to expect Express to return a compliant HTTP response in any case.
References:
Details and highlighting Express behavior:
https://github.com/DanielLarsenNZ/nodejs-express-range-headers
if the client send a Range
request header and also Accept
request header with the application/gzip
mime type, then it is the clients responsibility to take into account whether the sent byte ranges correspond to zipped file or not. This is the intuitive behavior.
From the top of my head, I can tell a couple of cases where this change might break applications that depend on genomics data that needs to handle huge files such as https://github.com/igvteam/igv.js/ that make request to .bgz
, fasta
or bam
files. These files arrive to the client compressed, decompressed by the browser and then the client knows that the bytes correspond to the uncompressed payload. You can't just dump a 5GB file to the client with compression enabled if there happens to be a range request.
The decompression is handled by the browser and the final receiver gets the bytes of the payload that corresponds to the uncompressed form. This has always been like this. For a lot of developers, intuition & usability >>> RFC compliance. Am I misinterpreting something?
PS: Not everyone uses Next.js or related Azure services. And they have to serve GBs of files that can be compressed fairly well on request (1-10MB chunks )such as genomic data and CANNOT compute the total length of the compressed file ahead of time for performance reasons. That means those who do use compression for large files that can benefit from this library will have drop it or filter these files out. How do these half-baked requirements make their way into RFC is really appalling sometimes..
Is it possible to add a config object such as:
app.use(compression({ enforceRFC7233: true}))
which would default to false
by primum non nocere, thus avoiding breaking apps that rely on this behavior and can be turned on by setting it to true
which would drop Content-Range
header and set status to 200 if a Range
request is encountered for response with Transfer-encoding
? This would make the life of those that want to be super compliant with the RFC easier by just setting a config prop and prevent those that have to drop using compression because the cost to stay compliant (calculating the gzipped size) for large files is greater than the benefit of being compliant
Hi @IbrahimTanyalcin you can make any rule you like to determine if the response should br compressed or not using the filter
option. You can find more and an example at https://github.com/expressjs/compression#filter-1
@dougwilson I understand that I can pass a filter function, but that means dropping support for compression for plain text file types (fasta etc.) that would actually benefit the greatest from such compression. The benefit from compressing a bundle from Next.js of let's say 300kB is nothing compared to the benefit of compressing a 10Mb chunk of a 5Gb genomic sequence file. Am I wrong in my reasoning? It would be so nice if it was possible to devise a solution that wouldn't break apps of people like me and also allow @mscbpi and others to achieve what they want.
I'm not dure I understand. If you return true
from the filter, then it would compress those files you would like compressed. I'm not sure whay the default behavior is for the file you are referring to is, however. If you think it should be compressed by default without needing to make a filter function, please let us know the details on how to detect that type of file for compression.
In genomics we deal with large files, let's say mysequence.fasta
. This a plain text file that reads like ATTTCCGGTTT...
and is around 5Gb:
- These files have an accompanying files called
indexes
that show byte ranges on each line. This index file ends withmysequence.fai
extension and is generally small (10-100kB) - Tools like https://github.com/igvteam/igv.js calculate the portion of the sequence that fits to a user's viewport and make a Range-Request based on the index file, let's say they request a 5MB portion
- Using your compression library, this 5MB portion is extracted from the static file using ExpressJS and compressed and sent to the browser.
- The browser decompresses the payload and IGV knows that the requested byte-range corresponds to the uncompressed file not the compressed one. In fact, from a technical standpoint, this is the only cost effective way to make
Range
requests without costing the server resources (looking from index files rather than telling the server "calculate me the correct Content-Range", which is impractical) and using compression meanwhile.
The RFC requires the server to know the gzipped size beforehand so that the Content-Range
can be correctly calculated. It is in truth just a semantic compliance that will require genomic apps that use Express to drop compression because obviously they do not want Express to send 200 and dump the file, nor they want to gzip
the entire file and calculate the correct Content-Range header based on that. There are plethora of other apps that rely on the current behavior.
I think a solution that wouldn't break backwards compatibility without resorting to filter
(effectively dropping support for that file type) and also allows people like @mscbpi to achieve what they want (because they rely on Next.js and other CDN's behavior) is more reasonable. At the end of the day, it is your call. Thanks for the time.
Ok here is some more info, in my route file for serving static content, I have something like:
app.use('/static', compression(), _static, function(req, res, next) { //downstream middleware ...
_static
is the expressjs static middleware.
I also dug into what are the request headers for such large fasta files and compared them to regular js/html/css files.
Here is the causal js/html/css files:
And here is the request headers I send for a supposedly large genomic file:
So it turns out, the client logic is sending Accept-Encoding: identity
header which in the compression library does NOT compress and is passed to static which doesn't know what to do with .fasta
file so it adds Content-Type: application/octet-stream
.
This means in my case this wouldn't seem to break the behavior with large files, as they are already not compressed. I was wrong. ( There might still be other genomics apps though that do not send the correct encoding request header and expect it work though 🤷)
if the client send a Range request header and also Accept request header with the application/gzip mime type, then it is the clients responsibility to take into account whether the sent byte ranges correspond to zipped file or not. This is the intuitive behavior.
Hi @IbrahimTanyalcin thanks a lot,
This is not Microsoft azure's point of view/interpretation of standard but they may be themselves wrong.
https://learn.microsoft.com/en-us/azure/frontdoor/front-door-caching?pivots=front-door-standard-premium#delivery-of-large-files
Tip If your origin compresses the response, ensure that the
Content-Range
header value matches the actual length of the compressed response.
Other resources / SR lead to the same observation.
@mscbpi yes you are correct, I never objected the RFC, however from a technical standpoint it is very costly to compute the total zipped size and adjust the Content-Range
value.
I wish a provisionary header like Content-Range-Origin: transformed | identity
existed so that the ambiguity is clarified and servers are given a choice. But it is ok, I guess safest bet is to send Accept-Encoding: identity
or use the filter
option to drop support for that file type and hope for the best.