augur BUG in mask: empty (header only) .bed file causes unhandled error

BUG in mask: empty (header only) .bed file causes unhandled error

Open corneliusroemer opened this issue 2 years ago • 7 comments

Current Behavior

When the .bed file passed to mask contains just a header and nothing to mask, an unhandled error is thrown.

Expected behavior

Empty, header only .bed files are accepted and cause simply no masking.

How to reproduce

Run augur mask --sequences results/global/aligned.fasta --mask config/mask.bed --mask-from-beginning 7000 --mask-from-end 7000 --output results/global/masked.fasta with a header onl .bed file.

Augur 15.0.2

Stacktrace:

Traceback (most recent call last):
  File "pandas/_libs/parsers.pyx", line 1113, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/augur/utils.py", line 678, in read_bed_file
    bed = pd.read_csv(bed_file, sep='\t', header=None, usecols=[1,2],
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 581, in _read
    return parser.read(nrows)
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 883, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 1026, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas/_libs/parsers.pyx", line 1119, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: invalid literal for int() with base 10: 'ChromStart'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/bin/augur", line 10, in <module>
    sys.exit(main())
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/augur/__main__.py", line 10, in main
    return augur.run( argv[1:] )
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/augur/__init__.py", line 75, in run
    return args.__command__.run(args)
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/augur/mask.py", line 215, in run
    mask_sites.update(load_mask_sites(args.mask_file))
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/augur/utils.py", line 734, in load_mask_sites
    mask_sites = read_bed_file(mask_file)
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/augur/utils.py", line 682, in read_bed_file
    bed = pd.read_csv(bed_file, sep='\t', header=None, usecols=[1,2],
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 575, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 933, in __init__
    self._engine = self._make_engine(f, self.engine)
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1235, in _make_engine
    return mapping[engine](f, **self.options)
  File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 551, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file

May 21 '22 11:05 corneliusroemer

Also, an empty file should be accepted as well. But currently it throws an error, so not even a simple workaround exists :/

ERROR: config/mask.bed is an empty file.

May 21 '22 11:05 corneliusroemer

Should update code to catch this error and print the message above.

May 25 '22 19:05 victorlin

Canonical BED format specification

May 25 '22 19:05 tsibley

@tsibley are you saying an empty bed file is not legal? I couldn't find out whether that's the case glancing through the spec. Do you know?

Somewhat related to #946

May 26 '22 14:05 corneliusroemer

Empty bed files should be fine. Nothing on the specification seems to indicate that it's not, and the following works fine:

$ touch empty
$ bedtools sort -i empty 
$ echo $?
0

May 26 '22 21:05 jameshadfield

are you saying an empty bed file is not legal?

No, not saying that. Was just linking to the spec for reference. Sorry I didn't include more context.

May 26 '22 22:05 tsibley

Keeping this comment descriptive rather than prescriptive (at least for the moment).

There's two separate things here:

BED files with no regions/data lines ("empty")
BED files with a header line

which sometimes occur together and sometimes occur separately. This issue is about both 1 with 2 (just a header) and 1 without 2 (the zero-byte BED file).

Regardless of how Augur handles zero-byte BED files, there's no standard header for BED files, c.f. the spec above and bedtools behaviour:

$ cat config/mask.bed 
Chrom	ChromStart	ChromEnd	locus tag	Comment
chr	6400	7500		very diverse region
chr	133050	133250		indel variation and long homopolymers

$ bedtools sort -i config/mask.bed 
Unexpected file format.  Please use tab-delimited BED, GFF, or VCF. Perhaps you have non-integer starts or ends at line 1?
[1]

$ bedtools sort -i <(tail -n +2 config/mask.bed)
chr	6400	7500		very diverse region
chr	133050	133250		indel variation and long homopolymers

although bedtools does (at least) ignore lines starting with #. This applies to all lines though, not just the first line.

Augur's read_bed_file() currently skips the first line if there's a parsing failure and tries again, which handles ad-hoc headers but will also mask other errors in the first line.

May 26 '22 23:05 tsibley

augur augur copied to clipboard

BUG in mask: empty (header only) .bed file causes unhandled error

Current Behavior

Expected behavior

How to reproduce

augur
augur copied to clipboard