fastparquet
fastparquet copied to clipboard
RuntimeWarning: invalid value encountered in reduce return umr_maximum(a, axis, None, out, keepdims, initial)
This warning message is not understandable. I searched the net and found some references to it but have not been able to understand whether it is important or not.
Using the debugger, I found that this warning is issued while writing a pandas dataframe containing NaN values.
It is caused by line 517 of the file fastparquet/writer.py, function write_column(..) which calculates max and min of the column data:
max, min = data.values.max(), data.values.min()
However, the following lines overwrite these values of max and min except in case selement.type == parquet_thrift.Type.BYTE_ARRAY and selement.converted_type is None
:
if selement.type == parquet_thrift.Type.BYTE_ARRAY:
if selement.converted_type is not None:
max = encode['PLAIN'](pd.Series([max]), selement)[4:]
min = encode['PLAIN'](pd.Series([min]), selement)[4:]
else:
max = encode['PLAIN'](pd.Series([max]), selement)
min = encode['PLAIN'](pd.Series([min]), selement)
It follows that this warning message is caused by a calculation that is not necessary, and hence, it is a bug.
It is easy to rearrange the above code, moving line 517 inside the if clause. This would fix the bug.
Would it be possible for you to write a test case which exposes the bug, and then propose the changes that you would make to fix it?
Here is the test case:
import numpy
import pandas
import fastparquet
df = pandas.DataFrame({'c': pandas.Series([1.,numpy.nan,2.])})
fastparquet.write('test.parquet', df, has_nulls=False)
import os
os.remove('test.parquet')
Here is how I suggest to change lines 517-524 of writer.py (not tested):
if selement.type == parquet_thrift.Type.BYTE_ARRAY:
if selement.converted_type is not None:
max = encode['PLAIN'](pd.Series([max]), selement)[4:]
min = encode['PLAIN'](pd.Series([min]), selement)[4:]
else:
max, min = data.values.max(), data.values.min()
else:
max = encode['PLAIN'](pd.Series([max]), selement)
min = encode['PLAIN'](pd.Series([min]), selement)
Thanks, I'll look into it.
As a note, the warning happens because of the nan values, when choosing not to use nulls. in this case, there is no defined max/min for each chunk, but there is no way to know that beforehand. I also note that the values are not ignored: the new values of max/min in the original code are based on those calculated in the problematic line. In short, I don't suppose it is a bug.
You might be interesting in rescuing the stat=
keyword from branch speedup_speedup (also PR #259), which allows you to forego the max/min calculation for columns of your choice. That PR has not been merged, because although it improves performance for string IO, it doesn't work on py2.
Just to point out that pandas uses nan to indicate missing value. Thus, it is not my choosing not to use nulls but rather pandas, and we are discussing a function that writes a pandas dataframe into fastparquet.
As for your statement
I also note that the values are not ignored: the new values of max/min in the original code are based on those calculated in the problematic line.
--I do not understand it since in my test case the element type is not BYTE_ARRAY but rather numeric, and hence, max and min are calculated by lines 523 and 524 that use pandas max and min and overwrite the values calculated on line 517.
I am not interested in a speedup. Rather, I was testing different tools for large datasets and got this warning from fastparquet that I could not understand. As the warning came from an attempt to write a legitimate pandas dataframe, I tried to research it.
but look: max = encode['PLAIN'](pd.Series([max]), selement)
- the new value of max
depends on the previous value, it does not get ignored. Try your version, you should get a NameError.
Agreed that pandas uses nan to indicate null. When writing to parquet you get the choice of whether to convert them to nulls or whether to indicate no nulls (the column is marked as REQUIRED) and store them as real values that happen to be nan. If loading back to pandas, the two are equivalent. This is controlled by the nulls=
parameter in write()
.
I see about max, thanks, and sorry for my confusion.
I did not find the nulls=
parameter in write(..)
. According to the documentation, there is has_nulls=
parameter that I passed as False
. The doc says about it:
Whether columns can have nulls. If a list of strings, those given columns will be marked as "optional" in the metadata, and include null definition blocks on disk. Some data types (floats and times) can instead use the sentinel values NaN and NaT, which are not the same as NULL in parquet, but functionally act the same in many cases, particularly if converting back to pandas later. A value of 'infer' will assume nulls for object columns and not otherwise.
Are you saying that passing has_nulls=True
will convert NaNs into nulls? I could not infer this from the doc and for this reason decided to pass has_nulls=False
for efficiency.
Yes, has_nulls=
, sorry.
Are you saying that passing has_nulls=True will convert NaNs into nulls
more specifically, it will use parquet's OPTIONAL column spec, and store whether a value is real or null separately from the data. Indeed it takes longer, but you would want to do this if you intend to use the data with a parquet framework that expect the storage in this mode.
Thanks, I understand it now.
I tried this call with has_nulls=True
and it did not give me these warning messages.
I can only suggest to clarify the doc for has_nulls=
parameter to indicate that True
implies convertion of NaNs into nulls.
Many thanks for your clarifications!
Hi All, I just had a similar issue. @BaruchYoussin you made mention of editing the writer.py file. I am finding difficulty locating that file in my system. Please advice
(it would be even better to post the change as a PR, so everyone can benefit)
@jotes35 The file is https://github.com/dask/fastparquet/blob/master/fastparquet/writer.py
However, I am sure it has changed a lot over the past two years.
@martindurant If you refer to my question two years ago, it is no longer relevant.