augur
augur copied to clipboard
BUG in mask: empty (header only) .bed file causes unhandled error
Current Behavior
When the .bed
file passed to mask
contains just a header and nothing to mask, an unhandled error is thrown.
Expected behavior
Empty, header only .bed
files are accepted and cause simply no masking.
How to reproduce
Run augur mask --sequences results/global/aligned.fasta --mask config/mask.bed --mask-from-beginning 7000 --mask-from-end 7000 --output results/global/masked.fasta
with a header onl .bed
file.
Augur 15.0.2
Stacktrace:
Traceback (most recent call last):
File "pandas/_libs/parsers.pyx", line 1113, in pandas._libs.parsers.TextReader._convert_tokens
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/augur/utils.py", line 678, in read_bed_file
bed = pd.read_csv(bed_file, sep='\t', header=None, usecols=[1,2],
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 581, in _read
return parser.read(nrows)
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read
index, columns, col_dict = self._engine.read(nrows)
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
chunks = self._reader.read_low_memory(nrows)
File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 883, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 1026, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas/_libs/parsers.pyx", line 1119, in pandas._libs.parsers.TextReader._convert_tokens
ValueError: invalid literal for int() with base 10: 'ChromStart'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/bin/augur", line 10, in <module>
sys.exit(main())
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/augur/__main__.py", line 10, in main
return augur.run( argv[1:] )
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/augur/__init__.py", line 75, in run
return args.__command__.run(args)
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/augur/mask.py", line 215, in run
mask_sites.update(load_mask_sites(args.mask_file))
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/augur/utils.py", line 734, in load_mask_sites
mask_sites = read_bed_file(mask_file)
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/augur/utils.py", line 682, in read_bed_file
bed = pd.read_csv(bed_file, sep='\t', header=None, usecols=[1,2],
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
return _read(filepath_or_buffer, kwds)
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 575, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 933, in __init__
self._engine = self._make_engine(f, self.engine)
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1235, in _make_engine
return mapping[engine](f, **self.options)
File "/usr/local/Caskroom/mambaforge/base/envs/nextstrain/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 75, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 551, in pandas._libs.parsers.TextReader.__cinit__
pandas.errors.EmptyDataError: No columns to parse from file
Also, an empty file should be accepted as well. But currently it throws an error, so not even a simple workaround exists :/
ERROR: config/mask.bed is an empty file.
Should update code to catch this error and print the message above.
@tsibley are you saying an empty bed file is not legal? I couldn't find out whether that's the case glancing through the spec. Do you know?
Somewhat related to #946
Empty bed
files should be fine. Nothing on the specification seems to indicate that it's not, and the following works fine:
$ touch empty
$ bedtools sort -i empty
$ echo $?
0
are you saying an empty bed file is not legal?
No, not saying that. Was just linking to the spec for reference. Sorry I didn't include more context.
Keeping this comment descriptive rather than prescriptive (at least for the moment).
There's two separate things here:
- BED files with no regions/data lines ("empty")
- BED files with a header line
which sometimes occur together and sometimes occur separately. This issue is about both 1 with 2 (just a header) and 1 without 2 (the zero-byte BED file).
Regardless of how Augur handles zero-byte BED files, there's no standard header for BED files, c.f. the spec above and bedtools
behaviour:
$ cat config/mask.bed
Chrom ChromStart ChromEnd locus tag Comment
chr 6400 7500 very diverse region
chr 133050 133250 indel variation and long homopolymers
$ bedtools sort -i config/mask.bed
Unexpected file format. Please use tab-delimited BED, GFF, or VCF. Perhaps you have non-integer starts or ends at line 1?
[1]
$ bedtools sort -i <(tail -n +2 config/mask.bed)
chr 6400 7500 very diverse region
chr 133050 133250 indel variation and long homopolymers
although bedtools
does (at least) ignore lines starting with #
. This applies to all lines though, not just the first line.
Augur's read_bed_file()
currently skips the first line if there's a parsing failure and tries again, which handles ad-hoc headers but will also mask other errors in the first line.