augur icon indicating copy to clipboard operation
augur copied to clipboard

io: Parse metadata with C engine and tab separator

Open victorlin opened this issue 3 years ago • 3 comments

This PR improves slowness issues identified in #789. It is only a slight improvement on my machine, but seems substantial on @huddlej's machine based on https://github.com/nextstrain/augur/issues/789#issuecomment-985931991.

The only concern would be if anyone is leveraging the Python parser for a non-tab-delimited file (e.g. metadata.csv). This would be a breaking change.

Testing

  • No tests

References

victorlin avatar Dec 10 '21 20:12 victorlin

You could also use csv.Sniffer.sniff() to return a Dialect, which can be passed in to pandas.read_csv() as the dialect keyword argument.

fanninpm avatar Jan 05 '22 19:01 fanninpm

The only concern would be if anyone is leveraging the Python parser for a non-tab-delimited file (e.g. metadata.csv). This would be a breaking change.

I'm not sure if this is an acceptable trade off. Diving into history, Augur's supported CSV for metadata since at least 2018 and delimiter sniffing since mid-2020.

@huddlej may have thoughts here too.

tsibley avatar Feb 02 '22 19:02 tsibley

I agree, @tsibley, that we can't drop CSV support. The original context for the current implementation is described in https://github.com/nextstrain/augur/issues/574. There are two separate problems:

  1. determining the delimiter of an input file
  2. efficiently parsing a file given its delimiter

In the older Augur implementations, we addressed problem 1 by inspecting the extension of the input filename. This led to the problems in #574. We opted for the convenience of pandas's delimiter sniffer in Python parser mode, to solve this problem at the expense of a slower solution to problem 2.

As @fanninpm points out, we could use csv.Sniffer directly on the first line of the input file in read_metadata. Once we know the delimiter, we could call read_csv with that delimiter and use the C engine to parse. This would solve both problems without breaking backward compatibility.

huddlej avatar Feb 02 '22 23:02 huddlej