netcdf-c icon indicating copy to clipboard operation
netcdf-c copied to clipboard

ncdump issues HTTP GET requests with headers that are too big (>8k bytes)

Open ndp-opendap opened this issue 9 months ago • 16 comments

Overview

Our good friends at NASA's GSFC are having trouble with data that they are serving (through Hyrax-1.16.3) and accessing said data with ncdump. The problem is that the web machinery is rejecting ncdump's data retrieval request because the HTTP request header is too large: well in excess of the 8k limit found in web servers like Apache httpd and Tomcat. While Tomcat can configure this setting easily, it would appear that Apache httpd cannot, especially with regards to the crucial Apache Module mod_proxy_ajp which seems to be locked in at 8k

Example

Here's a working example from the same collection (granules contain ~300 variables):

ncdump -v 'RetrievalGeometry_retrieval_longitude' https://oco2.gesdisc.eosdis.nasa.gov/opendap/OCO2_L2_Standard.11.2r/2024/221/oco2_L2StdND_53744a_240808_B11205r_240921171946.h5

This dataset in the same collection and fails during ncdump's data retrieval phase:

ncdump -v "RetrievalGeometry_retrieval_longitude"  https://oco2.gesdisc.eosdis.nasa.gov/opendap/OCO2_L2_Standard.11.2r/2024/107/oco2_L2StdND_52084a_240416_B11205r_240610060300.h5

Here's the output:

data:

syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or SCAN_ERROR
context: <html^><head><title>414 Request-URI Too Large</title></head><body><center><h1>414 Request-URI Too Large</h1></center><hr><center>nginx/1.22.1</center></body></html>
NetCDF: Access failure
Location: file vardata.c; line 478

Users at NASA report that

The ncdump utility creates a resultant URL that is 13837 characters long beginning with “/opendap”

And that's a problem because web machinery like Apache httpd and the mod_proxy_ajp limit the request header size to 8k.

I think it may be an upstream swim to get the various web services configured to accept this behavior.

It would, imho, be better if ncdump would detect that the request URL path is too big and make multiple requests, each smaller than 8k.

ndp-opendap avatar Mar 20 '25 16:03 ndp-opendap

Being lazy, I will ask you if you know if HTTP chunking is being used for these requests?

DennisHeimbigner avatar Mar 21 '25 00:03 DennisHeimbigner

Being lazy, I will ask you if you know if HTTP chunking is being used for these requests?

I do not know. But the endpoint:

https://oco2.gesdisc.eosdis.nasa.gov/opendap/OCO2_L2_Standard.11.2r/2024/107/oco2_L2StdND_52084a_240416_B11205r_240610060300.h5

Is an instance of Apache httpd, so it's entirely possible.

I should have said before that when they reconfigured their Tomcat https Connector definition to include a larger maximum header size:

<Connector port="8443" protocol="org.apache.coyote.http11.Http11AprProtocol"
maxThreads="150" maxHttpHeaderSize="65536"   SSLEnabled="true" >
    <UpgradeProtocol className="org.apache.coyote.http2.Http2Protocol" />
    <SSLHostConfig>
        <Certificate certificateKeyFile="conf/localhost-rsa-key.pem"
            certificateFile="conf/localhost-rsa-cert.pem"
            certificateChainFile="conf/localhost-rsa-chain.pem"
        type="RSA" />
    </SSLHostConfig>
</Connector>

Then the ncdump command worked. Unfortunately the public facing side of this server is via the Apache httpd using the mod_proxy_ajp module to talk to Tomcat. From what I could glean from the web page it seems like it doesn't support making the maximum request header size larger than 8k.

This is why I think it's the 13k request header being submitted by ncdump - because they made it work for direct Tomcat access by increasing maxHttpHeaderSize

ndp-opendap avatar Mar 21 '25 02:03 ndp-opendap

My login has expired at nasa so I cannot work on this until they renew them. Hopefully that will be by friday.

DennisHeimbigner avatar Mar 21 '25 03:03 DennisHeimbigner

@DennisHeimbigner -

For what it's worth II put a copy of the file here:

http://test.opendap.org:8080/opendap/GESDISC/oco2_L2StdND_52084a_240416_B11205r_240610060300.h5

You can use that URL to download the file.

I don't think the test.opendap.org system is a good test endpoint for this issue. The NASA team is running a much older server than the one at test.opendap.org, and the systems have different configurations. In particular the GESDISC system is configured to flatten Groups and the test.opendap.org machine is configured for DAP4 centric view and thus is preserving the Groups hierarchies.

I can try setting up a test server with Group flattening feature enabled if that seems useful.

I tried ncdump with the dataset hosted on test.opendap.org and had mixed results. It is certainly a DAP4 dataset, it containsGroups and Int64 variables.

I tried ncdump with the above endpoint and the dap4 protocol:

ncdump -h "dap4://test.opendap.org/opendap/GESDISC/oco2_L2StdND_52084a_240416_B11205r_240610060300.h5"

And that worked.

When I tried for the variable:

ncdump -v "/RetrievalGeometry/retrieval_longitude" "dap4://test.opendap.org/opendap/GESDISC/oco2_L2StdND_52084a_240416_B11205r_240610060300.h5"

It failed with the error message:

checksumhack=0
Error:Checksum mismatch: aerosol_model

NetCDF: DAP failure
Location: file vardata.c; line 478
   retrieval_longitude = 

Which could be a Hyrax problem for sure.

ndp-opendap avatar Mar 21 '25 14:03 ndp-opendap

See comment https://github.com/Unidata/netcdf-c/issues/3105#issuecomment-2744603539

DennisHeimbigner avatar Mar 21 '25 22:03 DennisHeimbigner

The checksum problem represents an issue for the dap4 spec that needs to be resolved. See this discussion https://github.com/OPENDAP/dap4-specification/discussions/6

DennisHeimbigner avatar Mar 21 '25 22:03 DennisHeimbigner

The checksum problem represents an issue for the dap4 spec that needs to be resolved. See this discussion OPENDAP/dap4-specification#6

I agree!

But I this issue is really about DAP2, ncdump, and this complicated data set being served by NASA. I'll set up a server that looks more like what they are running so we can test without the burden of authentication.

ndp-opendap avatar Mar 22 '25 02:03 ndp-opendap

Ok, then a work-around for the large URI request is, as noted in comment, is to use #noprefetch.

DennisHeimbigner avatar Mar 22 '25 03:03 DennisHeimbigner

Ok, then a work-around for the large URI request is, as noted in comment, is to use #noprefetch.

I'm not that fluent in ncdump etc., Is there an invocation of ncdump that you would share that shows this in action?

ndp-opendap avatar Mar 22 '25 13:03 ndp-opendap

Here is an example of your original request with noprefetch modification (see end of the URL):v

ncdump -v "RetrievalGeometry_retrieval_longitude" "https://oco2.gesdisc.eosdis.nasa.gov/opendap/OCO2_L2_Standard.11.2r/2024/107/oco2_L2StdND_52084a_240416_B11205r_240610060300.h5#noprefetch"

This URL now fails for a different reason I am investigating, but it at least gets by the "URI too large" problem,

DennisHeimbigner avatar Mar 22 '25 20:03 DennisHeimbigner

That is weird. The cause of the new error (after adding #noprefetch) is an authorization error. That means I can read the meta-data without authorization, but I cannot read the actual data because I do not have authorization (BTW We --netcdf-- need better error reporting).

Anyway, if someone with authorization tries that ncdump command (with #noprefetch), the whole command should work.

BTW, if you want just the relevant metadata, you can rewrite the command to look like this:

ncdump "https://oco2.gesdisc.eosdis.nasa.gov/opendap/OCO2_L2_Standard.11.2r/2024/107/oco2_L2StdND_52084a_240416_B11205r_240610060300.h5?RetrievalGeometry_retrieval_longitude#noprefeto

where the required variable is used as a DAP2 constraint. The two ncdump commands should be equivalent, and in DAP4, they are. I should fix this, but it is low on my todo list.

DennisHeimbigner avatar Mar 22 '25 21:03 DennisHeimbigner

That is weird. The cause of the new error (after adding #noprefetch) is an authorization error. That means I can read the meta-data without authorization, but I cannot read the actual data because I do not have authorization (BTW We --netcdf-- need better error reporting).

This makes complete sense. I am almost certain that server oco2.gesdisc.eosdis.nasa.gov/opendap is configured to allow a user to navigate the catalog and inspect metadata without authenticating, but for data access the client must authenticate with EDL credentials

ndp-opendap avatar Mar 22 '25 23:03 ndp-opendap

that explains it.

DennisHeimbigner avatar Mar 23 '25 00:03 DennisHeimbigner

I set up my local system like this and reran:

ncdump -v "RetrievalGeometry_retrieval_longitude" "https://oco2.gesdisc.eosdis.nasa.gov/opendap/OCO2_L2_Standard.11.2r/2024/107/oco2_L2StdND_52084a_240416_B11205r_240610060300.h5#noprefetch"

And it worked great.

ndp-opendap avatar Mar 24 '25 12:03 ndp-opendap

Good to hear. I probably should consider some kind of limit on how much ever gets prefetched. The problem is that I can not easily figure out the max request length for URIs. It is probably system dependent.

DennisHeimbigner avatar Mar 24 '25 17:03 DennisHeimbigner

I scanned the Apache httpd and mod_proxy_ajp documentation and they both define a maximum of 8k bytes for the request header. There's clearly a way to make it smaller, but I did not see a way to make it larger. Tomcat also defaults to 8k.

I think that limiting it to 8k might be a reasonable place to start. Another idea: Regardless, if the remote service returns a 414 URI Too Long error a retry with a smaller request document might be in order.

ndp-opendap avatar Mar 24 '25 17:03 ndp-opendap