datapusher icon indicating copy to clipboard operation
datapusher copied to clipboard

Authorization header problem - push to datastore fails - HTTP status code: 400

Open 3vivekb opened this issue 8 years ago • 15 comments

When trying to use datapusher, it would consistently fail.
Datapusher log:

ckan          | 2017-02-23 21:58:40,052 INFO  [ckan.lib.base]  /dataset/test-4/resource_data/1aec102a-035e-4f7b-856b-e9eaf49510d8 render time 0.144 seconds
ckan          | 2017-02-23 21:58:40,451 INFO  [ckan.lib.base]  /dataset/49de5ea7-29c2-431f-ab27-eaaf74d9469b/resource/1aec102a-035e-4f7b-856b-e9eaf49510d8/download/techcrunchcontinentalusa-2.csv render time 0.461 seconds
ckan          | 2017-02-23 21:58:40,520 INFO  [ckan.lib.base]  /api/i18n/en render time 0.002 seconds
ckan          | 2017-02-23 21:58:40,645 INFO  [ckan.lib.base]  /api/3/action/datapusher_hook render time 0.015 seconds
ckan_datapusher | --------------------------------------------------------------------------------
ckan_datapusher | ERROR in scheduler [/usr/lib/python2.7/site-packages/apscheduler/scheduler.py:520]:
ckan_datapusher | Job "push_to_datastore (trigger: RunTriggerNow, run = True, next run at: None)" raised an exception
ckan_datapusher | --------------------------------------------------------------------------------
ckan_datapusher | Traceback (most recent call last):
ckan_datapusher |   File "/usr/lib/python2.7/site-packages/apscheduler/scheduler.py", line 512, in _run_job
ckan_datapusher |     retval = job.func(*job.args, **job.kwargs)
ckan_datapusher |   File "build/bdist.linux-x86_64/egg/datapusher/jobs.py", line 321, in push_to_datastore
ckan_datapusher |     request_url=resource.get('url'), response=e.read())
ckan_datapusher | HTTPError

Ckan HTML Log:

Error: DataPusher received a bad HTTP response when trying to download the data file 
HTTP status code: 400 
Response: <?xml version="1.0" encoding="UTF-8"?> <Error><Code>InvalidArgument</Code><Message>Authorization header is invalid -- one and only one ' ' (space) required</Message><ArgumentName>Authorization</Argume... 
Requested URL: http://data2.vta.org/dataset/49de5ea7-29c2-431f-ab27-eaaf74d9469b/resource/1aec102a-035e-4f7b-856b-e9eaf49510d8/download/techcrunchcontinentalusa-2.csv

This is using a docker-deploy with ckan 2.6.0 and datapusher 0.10.0

The problem lies with this area of code: https://github.com/ckan/datapusher/commit/2394e81c47205abe7df4be7215aa66fa63bedc1e I commented out the line about adding an authorization header and then rebuilt the datapusher - it worked for all my public files and failed for all my private files.

I'm not sure if this problem is in my code and config or updates to ckan.

3vivekb avatar Feb 24 '17 23:02 3vivekb

can you please provide more information about the setup, are you using local filestore or the files are stored on some external service

tino097 avatar Feb 28 '17 10:02 tino097

The files were pushed to S3 and the data, through the api, is pushed to Postgres on RDS.

The full config is here: https://github.com/vta/Open-Data-Portal

3vivekb avatar Mar 01 '17 00:03 3vivekb

We had similar problem when datapusher was getting the file from S3, in that case we started our server with more workers. Can you please try the same?

tino097 avatar Mar 01 '17 09:03 tino097

I don't totally understand the concept of workers - I don't think I have workers on my setup. There is celery that does the uploads. I found a command to spin up more workers but it didn't do anything. It isn't an issue of it only working some of the time and not all of the time. I think the relevant issue is around authentication.

3vivekb avatar Apr 14 '17 17:04 3vivekb

Was there a resolution for this? I am seeing this same issue.

ghost avatar May 09 '17 16:05 ghost

What I think is happening here is that when https://github.com/ckan/datapusher/commit/2394e81c47205abe7df4be7215aa66fa63bedc1e was introduced, the only available backend for the FileStore was local storage. What this commit did was adding an Authorization header to DataPusher requests for uploaded files so CKAN itself could check the authorization.

But when using another backend (eg S3), the Authorization header must be passed as well, and AWS is not happy about it:

<?xml version="1.0" encoding="UTF-8"?> <Error><Code>InvalidArgument</Code><Message>Authorization header is invalid -- one and only one ' ' (space) required</Message><ArgumentName>Authorization</Argum

@3vivekb what extension are you using for S3 storage? @Rjones8901 are you using a different storage backend?

amercader avatar May 11 '17 15:05 amercader

We are using S3 for the file source and RDS for datastore. I found that if I edited the jobs.py to api_key = api_key.strip(api_key) it threw a new error related to datastore user invalid. I agree with your assessment of the problem. I made this update to api_key at line 340 in the jobs.py for datapusher. https://github.com/ckan/datapusher/blob/master/datapusher/jobs.py

I should note that if the workaround for this is commenting out lines 342/345 however the S3 file must be public then for datapusher to be able to grab it.

ghost avatar May 11 '17 15:05 ghost

@Rjones8901 what extension are you using for the S3 storage?

amercader avatar May 11 '17 15:05 amercader

ckanext_cloudstorage

ghost avatar May 11 '17 15:05 ghost

@Rjones8901 @3vivekb I believe this is an issue with ckanext-cloudstorage, see https://github.com/TkTech/ckanext-cloudstorage/issues/13

amercader avatar May 15 '17 15:05 amercader

@amercader I think this is an issue datapusher rather than with ckanext-cloudstorage.

When you use CloudStorage the attempt to get the resource in jobs.push_to_datastore returns a Redirect that points to the correct remote location. CloudStorage is only returning the new location, it isn't setting any headers.

The version of datapusher that I am using use urllib2.urlopen to get the resource, and that follows the redirect, but keeps all the same headers.

I think that datapusher should remove the Authorization header that it added if the redirect is to a different domain. My installation is working correctly with private resources on AWS S3 using ckanext-cloudstorage using the following code:

class HTTPRedirectHandler(urllib2.HTTPRedirectHandler):
    """Remove authorization header if redirecting to a new domain."""

    def redirect_request(self, req, fp, code, msg, headers, newurl):
        oldloc = urlparse.urlparse(req.get_full_url()).netloc
        newloc = urlparse.urlparse(newurl).netloc
        if 'Authorization' in req.headers and newloc != oldloc:
            del req.headers['Authorization']
        return urllib2.HTTPRedirectHandler.redirect_request(self,
            req, fp, code, msg, headers, newurl)

Then in push_to_datastore:

        if resource.get('url_type') == 'upload':
            # If this is an uploaded file to CKAN, authenticate the request,
            # otherwise we won't get file from private resources
            request.add_header('Authorization', api_key)

        opener = urllib2.build_opener(HTTPRedirectHandler())
        response = opener.open(request, timeout=DOWNLOAD_TIMEOUT)

I will try a version of datapusher from trunk and see if we have the same problem with the current version, which uses requests instead of urllib2.urlopen.

rhunwicks avatar Oct 13 '17 14:10 rhunwicks

I can't recreate this issue using the current version of datapusher that uses requests.get instead of urllib2.urlopen to get the resource. It can be closed as far as I can tell, although obviously I didn't open it.

rhunwicks avatar Oct 13 '17 15:10 rhunwicks

@3vivekb @Rjones8901 if either of you are still active, have you had a chance to retry this with ckanext-datapusher master?

TkTech avatar Oct 13 '17 16:10 TkTech

Negative. I haven't had a chance to retry this yet with a new master, however I was seeing this about 2 weeks ago with CKAN 2.7

Sent via the Samsung Galaxy S8+, an AT&T 4G LTE smartphone -------- Original message --------From: Tyler Kennedy [email protected] Date: 10/13/17 12:04 PM (GMT-05:00) To: ckan/datapusher [email protected] Cc: Richard Jones [email protected], Mention [email protected] Subject: Re: [ckan/datapusher] Authorization header problem - push to   datastore fails - HTTP status code: 400 (#116) @3vivekb @Rjones8901 if either of you are still active, have you had a chance to retry this with ckanext-datapusher master?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/ckan/datapusher","title":"ckan/datapusher","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/ckan/datapusher"}},"updates":{"snippets":[{"icon":"PERSON","message":"@TkTech in #116: @3vivekb @Rjones8901 if either of you are still active, have you had a chance to retry this with ckanext-datapusher master?"}],"action":{"name":"View Issue","url":"https://github.com/ckan/datapusher/issues/116#issuecomment-336495804"}}}

ghost avatar Oct 13 '17 17:10 ghost

do we have any resolution for this issue wherein the Datapusher errors with access denied for S3

rokkala avatar Mar 29 '19 10:03 rokkala