GEOparse
GEOparse copied to clipboard
cannot download and parse GEO files
After I downloaded Series Matrix File(s), GEOparse.get_GEO function can't work and show there isn't series.
So I try to use GEOparse.get_GEO function to download files from website. It turned out that.
It seems like url is wrong.
@Mengflz Hi, could you give complete snippet so I could look at it?
Here is the error message. When I download GSE52562,I got some problems. This dataset can be downloaded from GEO directly.
Could you also share version of python and GEOparse? For me it is working without any problems.
My GEOparse is version 2.0.1, and my python version is 3.6.8
Is there any update on this issue? I have the same problem trying to get the metadata from 190 GEO series. Downloads fail during the checksum stage
Download failed due to 'Downloaded size do not match the expected size for ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE47nnn/GSE47598/soft/GSE47598_family.soft.gz'. ID could be incorrect or the data might not be public yet.
I can download the files manually but there are a lot of them and the two I tested I couldn't open using GEOparse on local files either.
I use GEOparse v2.0.1 and python 3.8.
As a workaround you can do export GEOPARSE_USE_HTTP_FOR_FTP=yes
before running your code.
@daniwelter and I tested the very same code with a virtualenv and the same packages, in Mac OS. In my case Big Sur 11.1. So no clue what's happening.
Packages I used:
certifi==2020.12.5
chardet==4.0.0
GEOparse==2.0.3
idna==2.10
numpy==1.20.1
pandas==1.2.2
python-dateutil==2.8.1
pytz==2021.1
requests==2.25.1
six==1.15.0
tqdm==4.56.2
urllib3==1.26.3
For me it was working with ftp but above's fix solved the issue for daniwelter.
This leads me to think that there is an issue on _download_ftp
function. Perhaps with how the total_size
is calculated on the _download_ftp
function. The _download_http
function takes the size from the headers but I don't see any reason why the total_size check would be different in those two functions, everything looks normal. What if len(data)
is bigger than what f.write(data)
returns? (write returns the number of characters written). For example, if data contains any weird character that won't be written later such as multiple EOF or EOS characters. In my case it works so I can't reproduce the issue, so is just a guess. Maybe any underlying network lib is making weird stuff with the packages.
My MTU seems fine (I know some VPNs mess around with MTU), I can't think about any other differences.
networksetup -getMTU en0
Active MTU: 1500 (Current Setting: 1500)
I have the same problem with GSE39582. I don't think it's GEOparser's fault because I am having the same problem with manually downloading the SOFT file. Using Firefox or wget
I am constantly getting different sized files. Using HTTP instead of FTP (export GEOPARSE_USE_HTTP_FOR_FTP=yes
) solves the issue.
Hi, sorry for not replying for long time. Some reason might be that you (might be) behind some corporate proxy. The option GEOPARSE_USE_HTTP_FOR_FTP=yes
was introduced because FTP did not want to work with Travis CI. I will check the functions that @carlosvega mentioned but it is hard to debug these issues as for me everything is working.
Yes, but I was using the same VPN as @daniwelter but for me it was working, my guess is that is some network issue… if it doesn't work on the browser then is not your code fault. But maybe you could add the GEOPARSE_USE_HTTP_FOR_FTP as argparse option or as failover for FTP.
@carlosvega @guma44 I wasn't using a VPN at all but the FTP > HTTP switch worked for me either way.
Is there any update on this issue? I have the same problem trying to get the metadata from 190 GEO series. Downloads fail during the checksum stage
Download failed due to 'Downloaded size do not match the expected size for ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE47nnn/GSE47598/soft/GSE47598_family.soft.gz'. ID could be incorrect or the data might not be public yet.
I can download the files manually but there are a lot of them and the two I tested I couldn't open using GEOparse on local files either. I use GEOparse v2.0.1 and python 3.8.
I saw this sometimes, but sometimes it went away... Not sure the problem, I suspect it's NCBI's side dropping connections