Mixed units in column cause wrong results for basic operations on dataframe columns
The following code produces wrong results:
import pandas as pd
import pint
import pint_pandas # type: ignore
merged = pd.DataFrame(
{
"some_label": ["a", "b"],
"some_values": [
1 * pint.get_application_registry().t,
1 * pint.get_application_registry().kg,
],
"factors": [
1,
1
],
}
)
merged = merged.astype({"factors": "pint[kg/t]"})
merged["result"] = merged["factors"] * merged["some_values"]
print(merged["factors"])
print(merged["some_values"])
print([q.to("kg") for q in merged.result.values])
Output:
0 1.0
1 1.0
Name: factors, dtype: pint[kilogram / metric_ton]
0 1 metric_ton
1 1 kilogram
Name: some_values, dtype: object
[<Quantity(1.0, 'kilogram')>, <Quantity(1.0, 'kilogram')>]
The second value of the result column is wrong. I would have expected this output:
0 1.0 kilogram / metric_ton
1 1.0 kilogram / metric_ton
Name: factors, dtype: object
0 1 metric_ton
1 1 kilogram
Name: some_values, dtype: object
[<Quantity(1.0, 'kilogram')>, <Quantity(0.001, 'kilogram')>]
which I get when I change the dataframe to be created with the same units but per row for column factors:
import pandas as pd
import pint
import pint_pandas # type: ignore
merged = pd.DataFrame(
{
"some_label": ["a", "b"],
"some_values": [
1 * pint.get_application_registry().t,
1 * pint.get_application_registry().kg,
],
"factors": [
1 * pint.get_application_registry().kg / pint.get_application_registry().t,
1 * pint.get_application_registry().kg / pint.get_application_registry().t,
],
}
)
merged["result"] = merged["factors"] * merged["some_values"]
print(merged["factors"])
print(merged["some_values"])
print([q.to("kg") for q in merged.result.values])
This means that with mixed unit rows in a dataframe the results of operations might be wrong. Am I using this wrong or is this a bug?
When you've created columns for some_values and factors, you've provided a list of quantities, which pandas treats as objects - and you see this when looking at the dtypes. When you've done merged = merged.astype({"factors": "pint[kg/t]"}) this converts the factors column to a PintArray. Do the same for some_values and it will work.
I suggest you look at the example notebook which shows several different ways to create columns in dataframes.
Thanks but this was not a question about how to do this. What I am saying is that I get a wrong result without any warning just by creating dataframe columns this and not another way. This is dangerous af. Is there a way to throw instead of just outputting a wrong result?