floweaver icon indicating copy to clipboard operation
floweaver copied to clipboard

Don't enforce names on input datasets (source, target, value)

Open ELC opened this issue 7 years ago • 6 comments

When working with pre-processed datasets it may be the case where there is no column name source or target or value and even worse, they may be present with a completely different meaning!

I suggest to add optional parameters to set the name of the columns corresponding to this function, this way one shouldn't be "renaming" columns just to fit the library specs.

ELC avatar Aug 02 '18 12:08 ELC

Hi, thanks for your comment and sorry for the slow response. I like the suggestion in principle, but I don't know if would actually be worth the extra complexity? Currently you can do something like this:

dataset = floweaver.Dataset(my_data_frame.rename(columns={
    'my_source': 'source',
    'my_target': 'target',
}))

while what you're suggesting I guess would look like this?

dataset = floweaver.Dataset(my_data_frame,
                            source_column='my_source',
                            target_column='my_target')

which is probably worthwhile though not all that much neater. Is that what you mean?

Note that the value column name is already completely flexible, see the measures argument to weave() and this example.

ricklupton avatar Sep 12 '18 20:09 ricklupton

I didn't know about the measures argument! That solves the problem with the value column.

I believe it is worth it because of the following situations:

  • It is much clearner and safer to make fewer assumptions about the dataset we are expecting so a default value for the columns could be source and target with a Warning showed to the user when they weren't explicitly set up. It makes everything seems less "You have to do this because it's the framworks way".
  • A dataset may have already a column named source or target that are indeed use in some part of the sankey so in this case more than 2 columns should be renamed and those renamings should be remember from that point.
  • A dataset might be used in several ways but with the same structure, so one dataset could be use to represent a flow but maybe the reversal flow is also required so in this case the source and target has to be swapped and that can lead to lots of confussions because constant renaming.
  • A dataset could be used to represent flows with increasing level of detail. Imaging a flow with a source, a target and 10 waypoints in between, now, one is asked to represent the same flow but instead of using the same source one should start with the 1st waypoint, and this goes on until one finally make a flow with the 10th waypoint as a source. Renaming the columns on each step just to follow the framwork spec is a nightmare! (I had to do exactly this kind of work at University). Leaving partitioning, ordering and nodes aside, one could easily manage this situation if the source and target (in this case just the source) could be set via a parameter and not a column name

ELC avatar Sep 13 '18 01:09 ELC

A dataset could be used to represent flows with increasing level of detail. Imaging a flow with a source, a target and 10 waypoints in between, now, one is asked to represent the same flow but instead of using the same source one should start with the 1st waypoint, and this goes on until one finally make a flow with the 10th waypoint as a source. Renaming the columns on each step just to follow the framwork spec is a nightmare! (I had to do exactly this kind of work at University). Leaving partitioning, ordering and nodes aside, one could easily manage this situation if the source and target (in this case just the source) could be set via a parameter and not a column name

There's a difference between the flows in the dataset, and the way they're visualised. In this case why don't you change the partition of the source?

nodes = {
   'x': ProcessGroup(['a', 'b', 'c'], partition1),
    'y': Waypoint(partition2),
    'z': ProcessGroup(['end'], partition3),
}

bundles = [
    Bundle('x', 'z', waypoints=['y'])
]

To make it look like it starts with the first waypoint (partition2) it would become

nodes = {
   'x': ProcessGroup(['a', 'b', 'c'], partition2),
    'z': ProcessGroup(['end'], partition3),
}

bundles = [
    Bundle('x', 'z', waypoints=[])
]

You don't have to change the source because it's the /same data/, just grouped differently.

ricklupton avatar Sep 13 '18 20:09 ricklupton

But generally, ok, I think it'd be reasonable to add this. But it's not trivial as source and target are hard-coded in lots of places. Would you like to have a go?

ricklupton avatar Sep 13 '18 20:09 ricklupton

I can try but not in the short term, I really like the framework and I think it's worth doing. Is there any way to quickly test the code? I don't want to introduce new errors

ELC avatar Sep 14 '18 00:09 ELC

Thanks, see my reply in #55

ricklupton avatar Sep 14 '18 14:09 ricklupton