GEOparse icon indicating copy to clipboard operation
GEOparse copied to clipboard

cannot download and parse GEO files

Open Mengflz opened this issue 4 years ago • 11 comments

After I downloaded Series Matrix File(s), GEOparse.get_GEO function can't work and show there isn't series. image

So I try to use GEOparse.get_GEO function to download files from website. It turned out that. image It seems like url is wrong.

Mengflz avatar Jul 10 '20 05:07 Mengflz

@Mengflz Hi, could you give complete snippet so I could look at it?

guma44 avatar Jul 13 '20 08:07 guma44

image image Here is the error message. When I download GSE52562,I got some problems. This dataset can be downloaded from GEO directly.

Mengflz avatar Jul 13 '20 11:07 Mengflz

Could you also share version of python and GEOparse? For me it is working without any problems.

guma44 avatar Jul 13 '20 12:07 guma44

My GEOparse is version 2.0.1, and my python version is 3.6.8

Mengflz avatar Jul 14 '20 04:07 Mengflz

Is there any update on this issue? I have the same problem trying to get the metadata from 190 GEO series. Downloads fail during the checksum stage Download failed due to 'Downloaded size do not match the expected size for ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE47nnn/GSE47598/soft/GSE47598_family.soft.gz'. ID could be incorrect or the data might not be public yet. I can download the files manually but there are a lot of them and the two I tested I couldn't open using GEOparse on local files either. I use GEOparse v2.0.1 and python 3.8.

daniwelter avatar Feb 12 '21 14:02 daniwelter

As a workaround you can do export GEOPARSE_USE_HTTP_FOR_FTP=yes before running your code.

@daniwelter and I tested the very same code with a virtualenv and the same packages, in Mac OS. In my case Big Sur 11.1. So no clue what's happening.

Packages I used:

certifi==2020.12.5
chardet==4.0.0
GEOparse==2.0.3
idna==2.10
numpy==1.20.1
pandas==1.2.2
python-dateutil==2.8.1
pytz==2021.1
requests==2.25.1
six==1.15.0
tqdm==4.56.2
urllib3==1.26.3

For me it was working with ftp but above's fix solved the issue for daniwelter.

This leads me to think that there is an issue on _download_ftp function. Perhaps with how the total_size is calculated on the _download_ftp function. The _download_http function takes the size from the headers but I don't see any reason why the total_size check would be different in those two functions, everything looks normal. What if len(data)is bigger than what f.write(data) returns? (write returns the number of characters written). For example, if data contains any weird character that won't be written later such as multiple EOF or EOS characters. In my case it works so I can't reproduce the issue, so is just a guess. Maybe any underlying network lib is making weird stuff with the packages.

My MTU seems fine (I know some VPNs mess around with MTU), I can't think about any other differences.

networksetup -getMTU en0
Active MTU: 1500 (Current Setting: 1500)

carlosvega avatar Feb 12 '21 16:02 carlosvega

I have the same problem with GSE39582. I don't think it's GEOparser's fault because I am having the same problem with manually downloading the SOFT file. Using Firefox or wget I am constantly getting different sized files. Using HTTP instead of FTP (export GEOPARSE_USE_HTTP_FOR_FTP=yes) solves the issue.

robertcv avatar Feb 18 '21 10:02 robertcv

Hi, sorry for not replying for long time. Some reason might be that you (might be) behind some corporate proxy. The option GEOPARSE_USE_HTTP_FOR_FTP=yes was introduced because FTP did not want to work with Travis CI. I will check the functions that @carlosvega mentioned but it is hard to debug these issues as for me everything is working.

guma44 avatar Feb 18 '21 13:02 guma44

Yes, but I was using the same VPN as @daniwelter but for me it was working, my guess is that is some network issue… if it doesn't work on the browser then is not your code fault. But maybe you could add the GEOPARSE_USE_HTTP_FOR_FTP as argparse option or as failover for FTP.

carlosvega avatar Feb 18 '21 16:02 carlosvega

@carlosvega @guma44 I wasn't using a VPN at all but the FTP > HTTP switch worked for me either way.

daniwelter avatar Feb 18 '21 16:02 daniwelter

Is there any update on this issue? I have the same problem trying to get the metadata from 190 GEO series. Downloads fail during the checksum stage Download failed due to 'Downloaded size do not match the expected size for ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE47nnn/GSE47598/soft/GSE47598_family.soft.gz'. ID could be incorrect or the data might not be public yet. I can download the files manually but there are a lot of them and the two I tested I couldn't open using GEOparse on local files either. I use GEOparse v2.0.1 and python 3.8.

I saw this sometimes, but sometimes it went away... Not sure the problem, I suspect it's NCBI's side dropping connections

CholoTook avatar May 17 '21 12:05 CholoTook