Don't enforce names on input datasets (source, target, value)
When working with pre-processed datasets it may be the case where there is no column name source or target or value and even worse, they may be present with a completely different meaning!
I suggest to add optional parameters to set the name of the columns corresponding to this function, this way one shouldn't be "renaming" columns just to fit the library specs.
Hi, thanks for your comment and sorry for the slow response. I like the suggestion in principle, but I don't know if would actually be worth the extra complexity? Currently you can do something like this:
dataset = floweaver.Dataset(my_data_frame.rename(columns={
'my_source': 'source',
'my_target': 'target',
}))
while what you're suggesting I guess would look like this?
dataset = floweaver.Dataset(my_data_frame,
source_column='my_source',
target_column='my_target')
which is probably worthwhile though not all that much neater. Is that what you mean?
Note that the value column name is already completely flexible, see the measures argument to weave() and this example.
I didn't know about the measures argument! That solves the problem with the value column.
I believe it is worth it because of the following situations:
- It is much clearner and safer to make fewer assumptions about the dataset we are expecting so a default value for the columns could be
sourceandtargetwith a Warning showed to the user when they weren't explicitly set up. It makes everything seems less "You have to do this because it's the framworks way". - A dataset may have already a column named
sourceortargetthat are indeed use in some part of the sankey so in this case more than 2 columns should be renamed and those renamings should be remember from that point. - A dataset might be used in several ways but with the same structure, so one dataset could be use to represent a flow but maybe the reversal flow is also required so in this case the
sourceandtargethas to be swapped and that can lead to lots of confussions because constant renaming. - A dataset could be used to represent flows with increasing level of detail. Imaging a flow with a
source, atargetand 10waypointsin between, now, one is asked to represent the same flow but instead of using the samesourceone should start with the 1st waypoint, and this goes on until one finally make a flow with the 10th waypoint as asource. Renaming the columns on each step just to follow the framwork spec is a nightmare! (I had to do exactly this kind of work at University). Leaving partitioning, ordering and nodes aside, one could easily manage this situation if thesourceandtarget(in this case just thesource) could be set via a parameter and not a column name
A dataset could be used to represent flows with increasing level of detail. Imaging a flow with a source, a target and 10 waypoints in between, now, one is asked to represent the same flow but instead of using the same source one should start with the 1st waypoint, and this goes on until one finally make a flow with the 10th waypoint as a source. Renaming the columns on each step just to follow the framwork spec is a nightmare! (I had to do exactly this kind of work at University). Leaving partitioning, ordering and nodes aside, one could easily manage this situation if the source and target (in this case just the source) could be set via a parameter and not a column name
There's a difference between the flows in the dataset, and the way they're visualised. In this case why don't you change the partition of the source?
nodes = {
'x': ProcessGroup(['a', 'b', 'c'], partition1),
'y': Waypoint(partition2),
'z': ProcessGroup(['end'], partition3),
}
bundles = [
Bundle('x', 'z', waypoints=['y'])
]
To make it look like it starts with the first waypoint (partition2) it would become
nodes = {
'x': ProcessGroup(['a', 'b', 'c'], partition2),
'z': ProcessGroup(['end'], partition3),
}
bundles = [
Bundle('x', 'z', waypoints=[])
]
You don't have to change the source because it's the /same data/, just grouped differently.
But generally, ok, I think it'd be reasonable to add this. But it's not trivial as source and target are hard-coded in lots of places. Would you like to have a go?
I can try but not in the short term, I really like the framework and I think it's worth doing. Is there any way to quickly test the code? I don't want to introduce new errors
Thanks, see my reply in #55