augur
augur copied to clipboard
io: Parse metadata with C engine and tab separator
This PR improves slowness issues identified in #789. It is only a slight improvement on my machine, but seems substantial on @huddlej's machine based on https://github.com/nextstrain/augur/issues/789#issuecomment-985931991.
The only concern would be if anyone is leveraging the Python parser for a non-tab-delimited file (e.g. metadata.csv
). This would be a breaking change.
Testing
- No tests
References
You could also use csv.Sniffer.sniff()
to return a Dialect
, which can be passed in to pandas.read_csv()
as the dialect
keyword argument.
The only concern would be if anyone is leveraging the Python parser for a non-tab-delimited file (e.g.
metadata.csv
). This would be a breaking change.
I'm not sure if this is an acceptable trade off. Diving into history, Augur's supported CSV for metadata since at least 2018 and delimiter sniffing since mid-2020.
@huddlej may have thoughts here too.
I agree, @tsibley, that we can't drop CSV support. The original context for the current implementation is described in https://github.com/nextstrain/augur/issues/574. There are two separate problems:
- determining the delimiter of an input file
- efficiently parsing a file given its delimiter
In the older Augur implementations, we addressed problem 1 by inspecting the extension of the input filename. This led to the problems in #574. We opted for the convenience of pandas's delimiter sniffer in Python parser mode, to solve this problem at the expense of a slower solution to problem 2.
As @fanninpm points out, we could use csv.Sniffer
directly on the first line of the input file in read_metadata
. Once we know the delimiter, we could call read_csv
with that delimiter and use the C engine to parse. This would solve both problems without breaking backward compatibility.