polars
polars copied to clipboard
Add optional validation output to joins
(originally by https://github.com/pola-rs/polars/issues/2292#issuecomment-1007962185 )
@austospumanto: Separately, on the problem of two columns becoming one column in the join result: it would be great if polars could retain both columns in the join like pandas does (when the two columns have different names). This is useful for checking for nulls in non-inner joins to see which rows found matches, and also for situations like the one you stated. I find myself duplicating+suffixing columns before joining to get this behavior in polars.
Suggestion by me: It may be easier if we have indicator as an optional output as in Pandas https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge
The indicator seems like a good option to me. If we choose that to be a boolean than it will also be very memory efficient.
If boolean, that would have to be two indicators, one for whether it occurs in the original left dataframe, and one for the right dataframe. I see Pandas opts for a categorical to cover left/right/both.
Still would save 2/8s of RAM. Two indicators sounds good to me.
I think the validate
option in pandas would be good also. (originally suggested in #5883, which was closed as a dupe of this issue)
Note that the validate option has recently been added in https://github.com/pola-rs/polars/pull/9278.
Hi. I'm not an expert on suggesting changes, so I apologize if this isn't the correct method. Although the validation option has been added, it would be very helpful to include an 'indicator' option, similar to what is available in pandas. As far as I know, this feature hasn't been added yet, and the issue explicitly requesting this option ( #5983 ) has been closed. Thanks!