Consider using numpy loadtxt under the hood for fast ASCII reading
Description
Interesting progress from numpy who now have a C-based CSV parser built in as loadtxt. See the copied numpy announce below.
I have not looked at it, but I wonder if it is worth investigating to replace our custom ASCII fast C reader. Maybe the answer is a simple "no". The obvious benefit is greatly reducing maintenance of this difficult code. I suspect the numpy version will have better speed and memory performance as well.
Downsides:
- There are some things built in to the astropy fast reader that might not work out of box. E.g. the FastCsv reader supports missing elements at the end as masked values.
- Non-small amount of work to fix something that is not really broken. But it might be a clean and well-defined GSoC project.
- Not clear if the very careful handling of Fortran formats and other details from @dhomeier made it to the numpy parser.
Cc: @dhomeier @hamogu
Numpy announce
https://github.com/numpy/numpy/pull/20580
is now merged. This moves np.loadtxt to C. Mainly making it much
faster. There are also some other improvements and changes though:
- It now supports
quotechar='"'to support Excel dialect CSV. - Parsing some numbers is stricter (e.g. removed support for
_or hex float parsing by default). max_rowsnow actually counts rows and not lines. A warning is given if this makes a difference (blank lines).- Some exception will change, parsing failures now (almost) always
give an informative
ValueError. converters=callableis now valid to provide a single converter for all columns.
Additional context
xref numpy/numpy#20580
Seems interesting to consider! Adding astropy to npreadtext's benchmark:
>>> b = np.loadtxt("test.csv", delimiter=",")
722 ms ± 34.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> c = pd.read_csv("test.csv", delimiter=",")
140 ms ± 35.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> d = pd.read_csv("test.csv", delimiter=",", float_precision="round_trip")
427 ms ± 753 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> e = _loadtxt("test.csv", delimiter=",")
276 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> f = ascii.read("test.csv", delimiter=",", format="csv")
313 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Seems interesting to consider! Adding astropy to npreadtext's benchmark:
>>> e = _loadtxt("test.csv", delimiter=",")
276 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> f = ascii.read("test.csv", delimiter=",", format="csv")
313 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Yes; note that Pandas' default (faster) parser is comparable in precision with Astropy’s fast converter
(~2.e-16 difference to the reference values – in fact they read numerically identical results c and j in
the extended example below). _loadtxt in contrast is as exact as the other, slightly slower readers
d and h (and b, g):
>>> b = np.loadtxt("test.csv", delimiter=",", dtype=dt)
588 ms ± 16.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> c = pd.read_csv("test.csv", delimiter=",", names=colnames)
85.2 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> d = pd.read_csv("test.csv", delimiter=",", float_precision="round_trip", names=colnames)
265 ms ± 1.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> e = _loadtxt("test.csv", delimiter=",", dtype=dt)
157 ms ± 5.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> f = _loadtxt("test.csv", delimiter=",", dtype=dt) # post BIDS-numpy/npreadtext#99
174 ms ± 16.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> g = ascii.read("test.csv", format="csv", data_start=0, delimiter=",", fast_reader=False, names=colnames)
846 ms ± 75.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> h = ascii.read("test.csv", format="csv", data_start=0, delimiter=",", fast_reader=True, names=colnames)
275 ms ± 6.24 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> j = ascii.read("test.csv", format="csv", data_start=0, delimiter=",", fast_reader={"use_fast_converter": True}, names=colnames)
119 ms ± 1.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# file converted to '8.127607995859026380D-01' notation
>>> k = ascii.read("fortran.csv", format='csv', data_start=0, delimiter=",", fast_reader={'exponent_style': 'd'}, names=colnames)
121 ms ± 4.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> l = _loadtxt("fortran.csv", delimiter=",", sci='D', dtype=dt)
164 ms ± 9.95 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> m = np.loadtxt("test.csv", delimiter=",", dtype=dt) # numpy 1.23.0.dev0+698.ga6f55fe29
161 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> n = np.loadtxt("fortran.csv", delimiter=",", dtype=dt, converters=lambda s: float(s.replace(b"D", b"e")))
354 ms ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
On another note, npreadtxt did have a custom float parser that included a custom exponential option sci,
but that was removed in BIDS-numpy/npreadtext#99 before merging with np.loadtxt.
Perhaps we can lobby for restoring this in a later version, though that may require some persuasion,
given there already was some scepsis in the numpy PR about replacing the old, working loadtxt at all.
In the meantime it would also be possible to emulate it with the more powerful converters option in the new loadtxt,
but that will be considerably slower than our converter (see last examples above).
though that may require some persuasion, given there already was some scepsis in the numpy PR about replacing the old, working loadtxt at all.
My 2¢: I don't care at all (but then, while I wrote the code, I don't actually care all that much about text reading :)). Adding it is technically trivial. So that is purely an API decision.
The other differences is what would worry me (i.e. what can you do that NumPy can't). E.g. your comments seem different and are currently single characters in NumPy (everything else triggers slow paths that do not support quotes). So you might have to pre-process each line if you don't want that.
That "fast" float parsing irritates me a bit personally, I have to admit. It seems to mainly make things much faster if you have 12 or more decimal digits. Is that gap between 12-15 digits important enough? And if I have 15-17 digits (full precision) do I really want to lose that by default?
You seem to have whitespace stripping, the numpy code has leading whitespace tripping (numerical parses ignore whitespace), but the option is not exposed. The NumPy C code is 100% unicode, and iterable compatible, though. The numpy code currently does not release the GIL.
That "fast" float parsing irritates me a bit personally, I have to admit. It seems to mainly make things much faster if you have 12 or more decimal digits. Is that gap between 12-15 digits important enough? And if I have 15-17 digits (full precision) do I really want to lose that by default?
For clarification, the default is fast_reader=True to use the compiled reader where possible, but with the standard strtod converters. Only explicitly setting it to {'use_fast_converter': True} (I am not a fan of that "dictionary_of_extra_args" syntax either) switches to the xstrtod-optimised parser.
I found the fast pandas and astropy versions to perform better in particular on the full range of float64 – factors of 2-3 over npreadtext on random values from -1e300 to 1e300 even with 8-12 digits precision, which seems a not so uncommon case. Is it worth the loss of precision? I share your thoughts in https://github.com/numpy/numpy/pull/20580#issuecomment-993678618 that one should better think about a really performant format if those kinds of optimisations become relevant, but the demand is obviously out there. But it's probably not worth the effort to add that as an extra feature to the numpy reader.
Ah, so the slowness of the one shipped with Python hits for exponents. That does indeed seem like something users may be interested in.
Some updates on this now that Astropy requires Numpy >= 1.23, getting the fast loadtxt guaranteed.
As already discussed, its C-based reader takes the lead on shorter numbers (float with <~ 10 significant digits, integers <~ 2**42):
Also interesting to note that the default strtod converter now (on a macOS 14 system with Xcode 15) shows no performance difference to our hand-tuned xstrtod fast converter. So the only reason to keep the latter around will probably be the support for Fortran-style exponent characters (though there might be better options to convert them before parsing the actual numbers.).
Circling back to this, I have the feeling that the lack of missing-value support in the fast np.loadtxt reader could be a show-stopper. It is not obvious how we could work around this in a performant way.
I'm going to mark this as Close?. Feel free to re-open.
I did just start looking at pyarrow, which looks somewhat promising, but with the obvious downside of being a new dependency. I don't really know much about the package on the whole.
@dhomeier - is it easy enough to plug this into your performance tool and do a quick evaluation?
Hi humans :wave: - this issue was labeled as Close? approximately 13 hours ago. If you think this issue should not be closed, a maintainer should remove the Close? label - otherwise, I will close this issue in 7 days.
If you believe I commented on this issue incorrectly, please report this here
Yes, should be doable; I'll look into this. But if I am not mistaken this would not only come as a new Python dependency, but involve the linked C++ libraries on top.
Yes, should be doable; I'll look into this. But if I am not mistaken this would not only come as a new Python dependency, but involve the linked C++ libraries on top.
Yeah, this is a bit of a long-shot. I'll only say that pip install pyarrow was fast and entirely painless. I've gotten much less afraid of dependencies, and in theory "fast reading" could be considered optional. :smile:
What got me intrigued is the multithread option and whether that provides much benefit.