census-postgres-scripts icon indicating copy to clipboard operation
census-postgres-scripts copied to clipboard

Download shell scripts for .tar files forbidden

Open RobertSellers opened this issue 6 years ago • 14 comments

This is also somewhat crossposted from the following: https://github.com/aria2/aria2/issues/973. It seems as if wget, curl, and aria2 are forbidden. The .gz extension is also now missing. Any known workarounds to this?

12/20 16:25:03 [ERROR] CUID#8 - Download aborted. URI=https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/Tracts_Block_Groups_Only.tar
Exception: [AbstractCommand.cc:351] errorCode=29 URI=https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/Tracts_Block_Groups_Only.tar
  -> [HttpSkipResponseCommand.cc:231] errorCode=29 The response status is not successful. status=503

RobertSellers avatar Dec 20 '18 16:12 RobertSellers

I ran into this last year and spoke with some IT folks at Census about it. Apparently they were enforcing some rules about SSL and so required a forged User-Agent and Strict-Transport-Security request headers. This worked last year, but isn't working this year. I think they're also blocking wide ranges of AWS IP addresses.

I got around this temporarily by downloading the files from my home and uploading them to the server doing the data load. I subsequently ran into a couple other problems:

  • this year the nationwide .tar files contain state level .zip files that have a different packaging than before
  • one of the estimate files doesn't match the sequence/metadata files in terms of header count

I haven't had a chance to look into these issues yet, which is why Census Reporter hasn't gotten the latest release added yet. I'm hoping to figure it out this weekend.

iandees avatar Dec 20 '18 16:12 iandees

I appreciate the feedback. Also, yes, I'm running on AWS and haven't tested anywhere else so far.

RobertSellers avatar Dec 20 '18 17:12 RobertSellers

I can add: the exact same problem occurs from my local PC using Windows 10 linux subsystem with a wget, so this might not be a problem targeted at AWS.

RobertSellers avatar Dec 20 '18 17:12 RobertSellers

Can you try something that forges the User-Agent header? For example:

wget --debug \
   --header="User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52." \
   --header "Strict-Transport-Security: max-age=31536000" \
   https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/All_Geographies_Not_Tracts_Block_Groups.tar

iandees avatar Dec 20 '18 17:12 iandees

No luck. It's a wall of 403 errors. uGet Desktop in Windows 10 also isn't working. Yeesh. This data isn't hosted anywhere else in bulk?

RobertSellers avatar Dec 20 '18 18:12 RobertSellers

Hi everyone. I'm sorry to hear you're having issues with this. @iandees with whom did you speak at Census? Can you copy me/forward the email ([email protected])?

loganpowell avatar Dec 20 '18 19:12 loganpowell

Hi @loganpowell! I spoke with Jeff Meisel and Lori Carrig last year. I'll forward the email chain.

iandees avatar Dec 20 '18 20:12 iandees

@loganpowell It seems that your Akamai CDN might be blocking .tar downloads from some user agents? I can use wget on the .zip's ok, but the .tar's are failing.

iandees avatar Dec 20 '18 22:12 iandees

I was able to get the download working on AWS with this:

aria2c \
    --allow-overwrite=true \
    --auto-file-renaming=false \
    --dir=/mnt/tmp/acs2017_5yr \
    --max-connection-per-server=5 \
    --force-sequential=true \
    --header='Connection: keep-alive' \
    --header='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' \
    --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' \
    --header='Accept-Encoding: gzip, deflate, br' \
    --header='Accept-Language: en-US,en;q=0.9' \
    "https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/All_Geographies_Not_Tracts_Block_Groups.tar" \
    "https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/Tracts_Block_Groups_Only.tar" \
    "https://www2.census.gov/programs-surveys/acs/summary_file/2017/data/5_year_entire_sf/2017_ACS_Geography_Files.zip" \
    "https://www2.census.gov/programs-surveys/acs/summary_file/2017/documentation/user_tools/ACS_5yr_Seq_Table_Number_Lookup.txt"

iandees avatar Dec 21 '18 02:12 iandees

This seems to be working as required. Thank you for your diligent work on this.

RobertSellers avatar Dec 21 '18 13:12 RobertSellers

@iandees are .tars now cooperating for you?

loganpowell avatar Dec 21 '18 13:12 loganpowell

Naive question, do all AWS requests stem from a small set/same IP?

loganpowell avatar Dec 21 '18 13:12 loganpowell

@loganpowell they are, but it sure would be nice to figure out a way to download this data without having to go through all this header trickery. Other parts of the government might call forging these headers fraud 😬.

Requests from AWS come from different IP addresses, but there is a relatively small range of IP addresses and Akamai is probably able to figure them out. My guess that it was an IP block was based on it working from home and not from AWS machines. It's more likely that Census is using some Akamai product to prevent denial of service attacks and it's set to be too restrictive.

iandees avatar Dec 21 '18 13:12 iandees

@iandees I've had this actually happen to me on my own IP (from home using wget for cartography files). I was blacklisted and had to be manually removed from the blacklist. I'm not an expert here, but I believe the problem is when trying to pull a lot of data over the wire very quickly. Have you tried it with some throttling of your requests?

Btw, I'm very happy you figured out a work around. I don't think what you're doing to work around the blacklisting issue would be considered fraud. You're simply doing what is needed to provide a very important public service.

loganpowell avatar Dec 21 '18 14:12 loganpowell