microdadosBrasil
microdadosBrasil copied to clipboard
ftp server data: should warn user of download errors
sometimes some file in a ftp server will not download even after a minute of trying. R then procedes to the next file, but we need to warn the user of the missing file.
also, maybe the download function should do a second pass, trying to download only the files that did not download in the first. (maybe even a thrid one).
maybe add option for interactive feedback from user....
Running into this issue -- why is the host @ ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/ so bad? All my requests end up timing out and I am left with a set of empty zip files.
Perhaps could build in a continuous loop in which those that successfully download are removed from the loop. Connection seems sporadic -- sometimes they download, sometimes not -- so is something one would leave overnight with an assurance that at least the download could complete.
@steveofconnell i just implemented your suggestion in #69 and i think it worked, I left running overnight and the download was complete. Please, let me know if it works for you.
I have a lot of problems with corrupted files in this dataset( the files were downloaded but i can't unzip them). Did you have any similar problem in your attempts?
@steveofconnell tks! @nicolassoarespinto , I see your solution (nice!). However, isn't there some download package/function in R (maybe as part of Rcurl) that does that automatically? It would be something equivalent to wget. That would be preferred to us inventing the download rules (how many times to try, etc). maybe google this or ask in StackOverflow.
@nicolassoarespinto
I haven't even gotten to the stage of unzipping yet -- still trying to download everything from that server. One error message I got today, not having seen it before, is pasted below -- where downloaded size of file ends up not matching the "reported size" (not sure where size is reported from, probably something internal -- error 3 below) -- might come from a timeout on their end; not sure -- might want to try to build in another validation if you can access the reported size, or look for that error within the loop and delete the file and try the download for that file over again when the error is thrown:
2: In download_sourceData("RAIS", i = j) : There are files in .7z format inside the main folder, please unzip manually: 2007/AC2007.7z, 2007/AL2007.7z, 2007/AM2007.7z, 2007/AP2007.7z, 2007/BA2007.7z, 2007/CE2007.7z, 2007/DF2007.7z, 2007/ES2007.7z, 2007/ESTB2007.7z, 2007/GO2007.7z, 2007/MA2007.7z, 2007/MG2007.7z, 2007/MS2007.7z, 2007/MT2007.7z, 2007/PA2007.7z, 2007/PB2007.7z, 2007/PE2007.7z, 2007/PI2007.7z, 2007/PR2007.7z, 2007/RJ2007.7z, 2007/RN2007.7z, 2007/RO2007.7z, 2007/RR2007.7z, 2007/RS2007.7z, 2007/SC2007.7z, 2007/SE2007.7z, 2007/SP2007.7z, 2007/TO2007.7z 3: In download.file(file_links[y], destfile = paste(c(dest, file_dir, : downloaded length 522728 != reported length 399587785 4: In download.file(file_links[y], destfile = paste(c(dest, file_dir, : URL 'ftp://ftp.mtps.gov.br/pdet/microdados/RAIS/2008/SP2008.7z': status was 'Failure when receiving data from the peer'
You may also consider adding a validation function on file size (compare downloaded file sizes to known file sizes for target year-file) -- as standalone function run by user at their whim, as an option in the RAIS download, or automatically carried out in the download function
I am finally at the point of unzipping. I get errors for corrupted files occasionally. I then go redownload them directly and they are fine. Both files (corrupted and uncorrupted) are the exact same size, so I have no idea how to detect without trying the unzip first.
FYI, I just posted this question to the R-pkg-devel mailling list: