orca
orca copied to clipboard
Broadcast erroneously claims cast_on and onto_on are equivalents of left_on and right_on parameters of pandas.merge.
They used to be, since there were passed through as of at least orca 1.1.
But now, they are assumed to be type str (or None) whereas pandas merge accepts label or list, or array-like.
So the following, which worked under orca 1.1, no longer works:
orca.broadcast(
cast='persons',
onto='trips',
cast_on=['hh_id', 'person_num'],
onto_on=['hh_id', 'person_num'])
At least the documentation should be revise, whcih currently reads:
def broadcast(cast, onto, cast_on=None, onto_on=None,
cast_index=False, onto_index=False):
"""
Register a rule for merging two tables by broadcasting one onto
the other.
Parameters
----------
cast, onto : str
Names of registered tables.
cast_on, onto_on : str, optional
Column names used for merge, equivalent of ``left_on``/``right_on``
parameters of pandas.merge.
"""
Are you getting a specific error?
The cast_on
and onto_on
values are passed straight through to pandas.merge
here:
onto_table = pd.merge(
onto_table, cast_table,
suffixes=['_'+onto, '_'+cast],
left_on=bc.onto_on, right_on=bc.cast_on,
left_index=bc.onto_index, right_index=bc.cast_index)
However, using sequences may break other parts of the code, for example this looks like it might not properly handle lists of column names.
The problem is on line 1808. I don't recall the exact error, but this is where things go wrong:
# intersection is ok if it's the join key
intersection.discard(bc.onto_on)
intersection.discard(bc.cast_on)
The comment for broadcast pretty much says it all:
cast_on, onto_on : str, optional
Column names used for merge, equivalent of ``left_on``/``right_on``
parameters of pandas.merge.
It wants expects that cast_on and onto_on are strings, but pandas.merge also accepts lists.
If you want the exact error, I can easily generate it.
The error is:
File ".../orca/orca.py", line 1803, in merge_tables
intersection.discard(bc.onto_on)
TypeError: unhashable type: 'list'
generated by passing lists of strs to broadcast:
orca.broadcast(cast='persons_merged',
onto='disaggregate_trips',
cast_on=['hh_id', 'person_idx'],
onto_on=['hh_id', 'person_idx'])
Thanks, that's helpful. Looks like that code was added in e52fe6ccc as part of #20. That and the code I mentioned above both look like they wouldn't properly handle lists as cast_on
/onto_on
arguments.
Yes.