custom_translate getting called without externalMT
Describe the bug
When using either DIC or NDA as the documentAligner, the custom_translate rule still executes and crashes. I believe this is because the zcat /data/shards/en/182/1/sentences.gz | eval "$filter_command" | b64filter cache ${parallel_cmd} None | pigz -c > "$output" evaluates the MT command to "None" and causes pigz to seg fault.
To Reproduce Steps to reproduce the behavior: .yaml file:
until: segalign
verbose: True
# BASIC VARIABLES
dataDir: data
permanentDir: final
transientDir: transient
tempDir: temp
# DATA SOURCES - CRAWL%ING
hostsFile: test.url
crawler: "wget"
crawlWait: 5
crawlFileTypes: ["html", "pdf"]
crawlTimeLimit: "1800s"
# PREPROCESSING
preprocessor: "warc2text"
shards: 8 # 2^8 = 256 shards
batches: 1024 # each shard split into chunks of 1024 MB
lang1: 'en'
lang2: 'fr'
# ALIGN
documentAligner: 'DIC'
dic: en-fr/en-fr.dic
sentenceAligner: 'hunalign'
This also seems to happen when documentAligner is "NDA" (with vecalign as the sentenceAligner).
Expected behavior I do not believe this rule should execute as there is no specified externalMT system.
Log
Error in rule custom_translate:
jobid: 19
output: /exp/rwicks/bitextor/data/shards/en/182/1/sentences_fr.gz
shell:
mkdir -p /exp/rwicks/bitextor/temp
initial_nolines=$(zcat /exp/rwicks/bitextor/data/shards/en/182/1/sentences.gz | base64 -d | wc -l)
output="/exp/rwicks/bitextor/data/shards/en/182/1/sentences_fr.gz"
filter_command="cat"
if [[ "False" == "True" ]]; then
zcat /exp/rwicks/bitextor/data/shards/en/182/1/sentences.gz | python3 /home/hltcoe/rwicks/.conda/envs/bitextor/lib/python3.8/site-packages/bitextor/utils/apply_command_b64_doc.py --empty-docs-value "" "cut -f 2" > "/exp/rwicks/bitextor/temp/custom_translate_182_1.paragraphs"
para_nolines=$(cat "/exp/rwicks/bitextor/temp/custom_translate_182_1.paragraphs" | base64 -d | grep -E "^p[0-9]+/[0-9]+s[0-9]+/[0-9]+$|^p-1s-1$" | sed "/^\s*$/d" | wc -l)
if [[ "$initial_nolines" -ne "$para_nolines" ]]; then
>&2 echo "Lines count differs: source $initial_nolines, paragraph identification $para_nolines"
exit 1
fi
output=$(mktemp /exp/rwicks/bitextor/temp/custom_translate.tmp_output.XXXXX.gz)
filter_command="python3 /home/hltcoe/rwicks/.conda/envs/bitextor/lib/python3.8/site-packages/bitextor/utils/apply_command_b64_doc.py --empty-docs-value '' 'cut -f 1'"
fi
parallel_cmd=$([[ 1 -gt 1 ]] && echo "parallel --gnu --halt 2 --pipe --j 1 -k" || echo "")
zcat /exp/rwicks/bitextor/data/shards/en/182/1/sentences.gz | eval "$filter_command" | b64filter cache ${parallel_cmd} None | pigz -c > "$output"
n_after=$(zcat "$output" | base64 -d | wc -l)
if [ $initial_nolines -ne $n_after ]; then
>&2 echo "Lines count differs: source $initial_nolines, target $n_after"
exit 1
fi
if [[ "False" == "True" ]]; then
paste <(zcat "$output") <(cat "/exp/rwicks/bitextor/temp/custom_translate_182_1.paragraphs") | python3 /home/hltcoe/rwicks/.conda/envs/bitextor/lib/python3.8/site-packages/bitextor/utils/join_b64_docs.py | pigz -c > "/exp/rwicks/bitextor/data/shards/en/182/1/sentences_fr.gz"
fi
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Additional context Add any other context about the problem here.
Hi!
I think that the reason might be due to until: segalign, but since you're crawling, is difficult for me to try to reproduce the issue with the same configuration. Could you share the whole log, please?
Closing. If you need further assistance, please, re-open this issue.