ldc_downloader icon indicating copy to clipboard operation
ldc_downloader copied to clipboard

file name munging

Open jonmay opened this issue 8 years ago • 2 comments

filenames created by this script are somewhat abnormal.

e.g. LDC2016E75, which is described in the 'file name' column of the ldc downloads page (an imperfect guess at the true filename that would be downloaded by web interface) as 'LDC2016E75_LORELEI_IL3_Dual_Annotation_for_Simple_Named_Entity_and_Situation_Frame_Unsequestered' is downloaded by this script as 'LDC2016E75__LORELEI_IL3_Dual_Annotation_for_Simple_Named_Entity_and_Situation_Frame__LDC2016E75_LORELEI_IL3_Dual_Annotation_for_Simple_Named_Entity_and_Situation_Frame_Unsequestered.tgz'

Note a) the doubling of the entry, and b) the extra underscore in first usage.

Additionally, the script seems to be hard-coded to produce .tgz files but not all files come from LDC as .tgz. This is mostly a bug in LDC's presentation, since i haven't found a way to predict ahead of time what the file name will be; a simple kludge in the python version of this script is to allow the user to determine the filename.

jonmay avatar Oct 13 '16 17:10 jonmay

Thanks for the issue. What would your suggested resolution look like?

dowobeha avatar Oct 13 '16 17:10 dowobeha

Here's the relevant section:

TSV_LINE=$(grep "${LDC_CORPUS}" "${DOWNLOAD_FILE}") CORPUS_URL=$(cut -f 5 <<< "${TSV_LINE}") CORPUS_NAME=$(cut -f 2 <<< "${TSV_LINE}" | tr ' ' '_') CORPUS_FILE=$(cut -f 6 <<< "${TSV_LINE}" | sed 's,.tgz$,,') LDC_CORPUS_FILENAME="${LDC_CORPUS}__${CORPUS_NAME}__${CORPUS_FILE}.tgz"

dowobeha avatar Oct 13 '16 17:10 dowobeha