bitextor icon indicating copy to clipboard operation
bitextor copied to clipboard

custom_translate getting called without externalMT

Open rewicks opened this issue 3 years ago • 1 comments

Describe the bug When using either DIC or NDA as the documentAligner, the custom_translate rule still executes and crashes. I believe this is because the zcat /data/shards/en/182/1/sentences.gz | eval "$filter_command" | b64filter cache ${parallel_cmd} None | pigz -c > "$output" evaluates the MT command to "None" and causes pigz to seg fault.

To Reproduce Steps to reproduce the behavior: .yaml file:

until: segalign
verbose: True

# BASIC VARIABLES
dataDir: data
permanentDir: final
transientDir: transient
tempDir: temp

# DATA SOURCES - CRAWL%ING
hostsFile: test.url
crawler: "wget"

crawlWait: 5
crawlFileTypes: ["html", "pdf"]
crawlTimeLimit: "1800s"

# PREPROCESSING
preprocessor: "warc2text"
shards: 8 # 2^8 = 256 shards
batches: 1024 # each shard split into chunks of 1024 MB

lang1: 'en'
lang2: 'fr'

# ALIGN
documentAligner: 'DIC'
dic: en-fr/en-fr.dic

sentenceAligner: 'hunalign'

This also seems to happen when documentAligner is "NDA" (with vecalign as the sentenceAligner).

Expected behavior I do not believe this rule should execute as there is no specified externalMT system.

Log

Error in rule custom_translate:
    jobid: 19
    output: /exp/rwicks/bitextor/data/shards/en/182/1/sentences_fr.gz
    shell:
        
        mkdir -p /exp/rwicks/bitextor/temp
        initial_nolines=$(zcat /exp/rwicks/bitextor/data/shards/en/182/1/sentences.gz | base64 -d | wc -l)
        output="/exp/rwicks/bitextor/data/shards/en/182/1/sentences_fr.gz"
        filter_command="cat"

        if [[ "False" == "True" ]]; then
            zcat /exp/rwicks/bitextor/data/shards/en/182/1/sentences.gz                 | python3 /home/hltcoe/rwicks/.conda/envs/bitextor/lib/python3.8/site-packages/bitextor/utils/apply_command_b64_doc.py --empty-docs-value "" "cut -f 2" > "/exp/rwicks/bitextor/temp/custom_translate_182_1.paragraphs"

            para_nolines=$(cat "/exp/rwicks/bitextor/temp/custom_translate_182_1.paragraphs" | base64 -d                 | grep -E "^p[0-9]+/[0-9]+s[0-9]+/[0-9]+$|^p-1s-1$" | sed "/^\s*$/d" | wc -l)

            if [[ "$initial_nolines" -ne "$para_nolines" ]]; then
                >&2 echo "Lines count differs: source $initial_nolines, paragraph identification $para_nolines"
                exit 1
            fi
            output=$(mktemp /exp/rwicks/bitextor/temp/custom_translate.tmp_output.XXXXX.gz)
            filter_command="python3 /home/hltcoe/rwicks/.conda/envs/bitextor/lib/python3.8/site-packages/bitextor/utils/apply_command_b64_doc.py --empty-docs-value '' 'cut -f 1'"
        fi

        parallel_cmd=$([[ 1 -gt 1 ]] && echo "parallel --gnu --halt 2 --pipe --j 1 -k" || echo "")

        zcat /exp/rwicks/bitextor/data/shards/en/182/1/sentences.gz             | eval "$filter_command"             |  b64filter cache ${parallel_cmd} None             | pigz -c > "$output"

        n_after=$(zcat "$output" | base64 -d | wc -l)

        if [ $initial_nolines -ne $n_after ]; then
            >&2 echo "Lines count differs: source $initial_nolines, target $n_after"
            exit 1
        fi
        if [[ "False" == "True" ]]; then
            paste <(zcat "$output") <(cat "/exp/rwicks/bitextor/temp/custom_translate_182_1.paragraphs")                 | python3 /home/hltcoe/rwicks/.conda/envs/bitextor/lib/python3.8/site-packages/bitextor/utils/join_b64_docs.py                 | pigz -c > "/exp/rwicks/bitextor/data/shards/en/182/1/sentences_fr.gz"
        fi
        
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Additional context Add any other context about the problem here.

rewicks avatar Aug 04 '22 19:08 rewicks

Hi!

I think that the reason might be due to until: segalign, but since you're crawling, is difficult for me to try to reproduce the issue with the same configuration. Could you share the whole log, please?

cgr71ii avatar Aug 05 '22 06:08 cgr71ii

Closing. If you need further assistance, please, re-open this issue.

cgr71ii avatar Nov 12 '22 12:11 cgr71ii