activity-browser icon indicating copy to clipboard operation
activity-browser copied to clipboard

Improved Sankey Diagram

Open BenPortner opened this issue 1 year ago • 14 comments

Description

This PR improves the existing sankey diagram feature by two points:

  • Sankey diagrams now include biosphere nodes. Currently, one has to switch between the "Sankey" and "EF contributions" tab to see, which EFs cause impacts. Furthermore, one has to "guess" how the EF impacts are distributed across the activties. In the improved version, biosphere flows are drawn just like technosphere flows, allowing to see exactly where impacts come from.
  • Positive and negative flows from the same activity no longer cancel. Currently, a node, which produces +x units in one instance and -x units in another instance does not appear in the Sankey diagram because its total impact is zero. In the new version, both the node and the corresponding flows are drawn, allowing better insight into systems, which use the substitution method to account for by-products.

Demonstration

current: image

improved: image

Implementation Details

The central change is the introduction of a new module activity-browser.bwutils.superstructure.graph_traversal, which replaces bw2calc.graph_traversal. Changes in the other files are minor.

activity-browser.bwutils.superstructure.graph_traversal.py:

Basically a complete rewrite of bw2calc.graph_traversal version 1.8.1. The rewrite is based on the legacy version because bw 2.5 is not yet stable (at least I couldn't get it to run). I will push to include the new features in brightway so that the activity-browser does not have to maintain a separate version. Important changes to GraphTraversal.calculate:

  • function signature: introduced additional max_depth control parameter
  • ~~return values: the number of LCA calculations is no longer counted (counter is None)~~
  • return values: nodes keys are now actual activity keys instead of LCA matrix indices. This change was necessary because it is not clear from the matrix indices whether the activity is from the biosphere or technosphere, making it impossible to fetch metadata like name, location...
  • return values: for the same reason, from and to fields in edges are now activity keys
activity-browser.bwutils.superstructure.graph_traversal_with_scenario.py:

Uses new graph_traversal.py.

activity-browser.ui.web.sankey_navigator.py:

Adapted to account for new function signature and changed keys (reverse dictionaries no longer necessary). For biosphere nodes categories is used instead of location.

activity-browser.bwutils.commontasks.py and activity-browser.static.css.sankey_navigator.css:

Adapted to add a green frame to biosphere nodes.

BenPortner avatar Sep 13 '22 14:09 BenPortner

Coverage Status

Coverage decreased (-0.4%) to 54.037% when pulling 5986596c1b156e1a76c9435b396038ff59dc8880 on BenPortner:better-sankey into 44ceac14c5a2addb7a8938e15fdaa6366d89f459 on LCA-ActivityBrowser:master.

coveralls avatar Sep 13 '22 14:09 coveralls

@BenPortner Thanks a lot for this amazing PR! I'm really happy to see this contribution.

As you also mention, AB doesn't currently support BW2.5, we're looking at it behind the screens but it's mostly on @cmutel's side at the moment.

Personally, I think we should not merge this in the current form for 2 reasons:

  1. I think it's not wise to use a different version of graph traversal, even temporarily, but rather merge the graph traversal upstream and then implement it in AB.
  2. Less important/small detail: It looks like you write it for superstructure only. We have two implementations, and it looks like it's currently only written for scenarios.

@cmutel and @bsteubing Please have a look, this is very nice work!

marc-vdm avatar Sep 13 '22 14:09 marc-vdm

This is impressive and useful work. Here are the parts that I would hope could be addressed before it is merged:

  1. I cannot endorse adding more functionality to the AB which overlaps BW functionality, but is different. We already have too much of this, and it makes life difficult. The proposed changes can be added to the bw2legacy branch of bw2calc.
  2. The new GraphTraversal doesn't use the heap, and instead substitutes a breadth-first search with pruning. In my testing, the important-first search is always preferable; if we should switch, then this choice needs to be justified.
  3. Using activity keys is in direct contradiction to packages A and B of the BW strategic plan. bw2calc should be completely independent of whatever database is used, and should be usable in cloud installations where there are only datapackages. An alternative is to have two lists of edges (B/T), and to add a function to the 2.0 branch of bw2data to rehydrate metadata from matrix indices.
  4. I don't understand the justification for limiting to demand dicts with only one product.

cmutel avatar Sep 13 '22 16:09 cmutel

Hi @BenPortner , thanks a lot for these excellent ideas. I would very much like to have these features. However, let's first clarify a few content and implementation things.

Here some ideas for further improvement:

  • can the biosphere flows be visualized as going to "circles" instead of "rectangles", which are reserved for economic processes? I think that would visually help.
  • I think we need an option to switch "on/off" the biosphere flows as we might not want to see this in every case
  • have you tested what this will look like with real-world ecoinvent? I imagine that we have 1000s of processes emitting "CO2".
  • not sure if this idea is good, but "CO2", e.g., has several compartments and we may just be interested in "CO2" overall...
  • thinking further, if it becomes a big mess in real-world systems, an alternative implementation could be to "skip" the nodes of the biosphere flows and just have small horizontal arrows leaving the respective processes that indicate the size and name of the biosphere substance
  • the flow you show above seems to say that CO2 is "going into" process A, but I think this should be the other way around
  • question: IF a process shows up in the Sankey, does this mean that ALL of its biosphere flows (contributing to a specific impact category) will be shown or ONLY THOSE above the cut-off value. We need to be clear on this and we should test this for processes and impact categories where a single process has MANY biosphere flows. Take for example the plastics datasets (from plasticseurope), which are aggregated (system) processes. Here the process basically represents the life cycle inventory and that means that it will probably have 10s, 100s, or even more biosphere flows relevant to specific impact categories

Concerning implementation:

  • I tend to agree with @marc-vdm that this should directly be implemented in bw...
  • however, I recently learned that bw 2.5 may introduce things that break the graph traversal and I think before we take a decision, I'd like to hear what @cmutel has to say here
  • if implementing this in bw is not a good idea for some reason, then I think we can temporarily host it in the AB

bsteubing avatar Sep 13 '22 16:09 bsteubing

@marc-vdm

I think it's not wise to use a different version of graph traversal, even temporarily, but rather merge the graph traversal upstream and then implement it in AB.

I agree that this is the preferred option. However, this would require building a new conda package for bw2calc, which is not something I can do.

Less important/small detail: It looks like you write it for superstructure only. We have two implementations, and it looks like it's currently only written for scenarios.

I am afraid I do not understand this one. I tested this without scenarios, only.

@cmutel

I cannot endorse adding more functionality to the AB which overlaps BW functionality, but is different. We already have too much of this, and it makes life difficult. The proposed changes can be added to the bw2legacy branch of bw2calc.

See my answer to marc.

The new GraphTraversal doesn't use the heap, and instead substitutes a breadth-first search with pruning. In my testing, the important-first search is always preferable; if we should switch, then this choice needs to be justified.

The new approach is depth-first. I have trouble seeing how importance first is preferable. The number of calculations should be the same (importance-first calculates the LCA score for all inputs to set an order, depth-first calculates the LCA score for all inputs to decide whether to go deep).

bw2calc should be completely independent of whatever database is used,

The current implementation treats the biosphere differently from the technosphere. The improved implementation treats them equally. Hence, I would argue that the improved implementation is actually helping to achieve this goal ("treating all databases the same").

should be usable in cloud installations where there are only datapackages.

I don't see how the improved implementation prevents this.

An alternative is to have two lists of edges (B/T), and to add a function to the 2.0 branch of bw2data to rehydrate metadata from matrix indices.

As long as the biosphere and the technosphere are separate matrices, it is not possible to restore metadata from the matrix indices. How is the algorithm supposed to know if 1 means the first entry in the biosphere matrix or the technosphere matrix?

I don't understand the justification for limiting to demand dicts with only one product.

Single-product demands are what I could test with the AB.

@bsteubing

  • can the biosphere flows be visualized as going to "circles" instead of "rectangles", which are reserved for economic processes? I think that would visually help.
  • I think we need an option to switch "on/off" the biosphere flows as we might not want to see this in every case

This should be easy to implement.

have you tested what this will look like with real-world ecoinvent? I imagine that we have 1000s of processes emitting "CO2".

Then these 1000s of processes will link to CO2. The real question is: Who wants to have 1000s of processes in their sankey diagram in the first place? My implementation makes it easier to limit the amount of processes by introducing a depth limitation. With a reasonable number of processes, I am not worried about clearness.

not sure if this idea is good, but "CO2", e.g., has several compartments and we may just be interested in "CO2" overall...

This might be true for CO2, which has the same CF for all categories. But users might be interested in different categories for other impact methods (e.g. toxicity).

thinking further, if it becomes a big mess in real-world systems, an alternative implementation could be to "skip" the nodes of the biosphere flows and just have small horizontal arrows leaving the respective processes that indicate the size and name of the biosphere substance

The way to implement this is to keep the nodes and make the size very small. Should be easy to implement if desired.

the flow you show above seems to say that CO2 is "going into" process A, but I think this should be the other way around

This is how brightway currently works. Likewise, electricity is shown as "going into" process B, even though it really leaves as a by-product. If we are to start switching directions for biosphere flows, we should do it for technosphere by-products, too.

question: IF a process shows up in the Sankey, does this mean that ALL of its biosphere flows (contributing to a specific impact category) will be shown or ONLY THOSE above the cut-off value

Only those above the cut-off.

we should test this for processes and impact categories where a single process has MANY biosphere flows. Take for example the plastics datasets (from plasticseurope), which are aggregated (system) processes. Here the process basically represents the life cycle inventory and that means that it will probably have 10s, 100s, or even more biosphere flows relevant to specific impact categories.

There might be a lot of edges in cases where three conditions are fulfilled: a) a node in the diagram is an aggregated dataset and b) many of the biosphere flows of this node are above the cut-off and b) the chosen impact method has lots of CFs (e.g. ecotoxicity)

In my opinion, all three conditions coming together is rather rare. If this proves to be false, one could i) implement a limit to the number of biosphere flows shown or ii) aggregate the nodes according to some logic

Concerning implementation:

  • I tend to agree with @marc-vdm that this should directly be implemented in bw...
  • however, I recently learned that bw 2.5 may introduce things that break the graph traversal and I think before we take a decision, I'd like to hear what @cmutel has to say here
  • if implementing this in bw is not a good idea for some reason, then I think we can temporarily host it in the AB

I would say this depends on Chris' willingness to further develop (and package) legacy versions of bw.

BenPortner avatar Sep 13 '22 17:09 BenPortner

Less important/small detail: It looks like you write it for superstructure only. We have two implementations, and it looks like it's currently only written for scenarios.

I am afraid I do not understand this one. I tested this without scenarios, only.

I had a look at the code, things are in places that they shouldn't be, but that's not your fault, it was already like this. We'll need to move some things around I think. Anyway, that's not a problem for now.

not sure if this idea is good, but "CO2", e.g., has several compartments and we may just be interested in "CO2" overall...

This might be true for CO2, which has the same CF for all categories. But users might be interested in different categories for other impact methods (e.g. toxicity).

Agree, this may be problematic. IMO better to keep this as individual flows. This would be the same as grouping all processes with product name electricity, high voltage -good for process contributions, not for Sankey-.

the flow you show above seems to say that CO2 is "going into" process A, but I think this should be the other way around

This is how brightway currently works. Likewise, electricity is shown as "going into" process B, even though it really leaves as a by-product. If we are to start switching directions for biosphere flows, we should do it for technosphere by-products, too.

Could you clarify on your example of electricity and process B?

From how I understood the sankey process B just has 2 inputs for electricity, one positive and one negative of the same amount. If my understanding is correct, the flows should both still point from the electricity to B. With splitting pos/neg flows, we get potentially 4 edges between nodes, the arrow should indicate the direction of the edge, the color should indicate the sign (positive/negative).

As for applying this to biosphere: I think we should think about this more. The convention in LCA is that both resource extraction (coal mined) and emissions (CO2 emitted) are represented as positive numbers. -Personally I think this convention is wrong as for the technosphere, inputs are negative and outputs are positive, IMO biosphere should follow that convention too, but I don't have the power to change that.-

I don't see an immediate way to do this automatically unless we had a list of all biosphere flows that were classed on their direction (though we could perhaps infer them based on the categories column.

We could fix this by simply not having arrows in the biosphere edges? Or alternatively: we could fix this partly by having a checkbox for users with flip biosphere flows, to use at their own discretion?

we should test this for processes and impact categories where a single process has MANY biosphere flows. Take for example the plastics datasets (from plasticseurope), which are aggregated (system) processes. Here the process basically represents the life cycle inventory and that means that it will probably have 10s, 100s, or even more biosphere flows relevant to specific impact categories.

There might be a lot of edges in cases where three conditions are fulfilled: a) a node in the diagram is an aggregated dataset and b) many of the biosphere flows of this node are above the cut-off and b) the chosen impact method has lots of CFs (e.g. ecotoxicity)

In my opinion, all three conditions coming together is rather rare. If this proves to be false, one could i) implement a limit to the number of biosphere flows shown or ii) aggregate the nodes according to some logic

I agree, I don't see this as a problem. Users can currently also make unwieldy graphs by misusing the cutoff/depth options -e.g. with electricity markets with many process inputs-, this should just be user discretion on choosing a proper cutoff/depth. Though we may look into this and refine the automatic cutoff/depth values we provide for starters.

marc-vdm avatar Sep 14 '22 07:09 marc-vdm

A couple points of clarification:

The new approach is depth-first.

Indeed, sorry for my misunderstanding.

I have trouble seeing how importance first is preferable. The number of calculations should be the same (importance-first calculates the LCA score for all inputs to set an order, depth-first calculates the LCA score for all inputs to decide whether to go deep).

They will produce the same results if max_calc is infinite; otherwise, importance-first is strictly better, as it will produce the most information. Depth first can exhaust the number of calculations permitted in one branch of the supply chain before even seeing other branches.

should be usable in cloud installations where there are only datapackages.

I don't see how the improved implementation prevents this.

Yes, you are right. Sorry.

An alternative is to have two lists of edges (B/T), and to add a function to the 2.0 branch of bw2data to rehydrate metadata from matrix indices.

As long as the biosphere and the technosphere are separate matrices, it is not possible to restore metadata from the matrix indices. How is the algorithm supposed to know if 1 means the first entry in the biosphere matrix or the technosphere matrix?

Well, if you are using a list called biosphere_edges you might have a hint :)

cmutel avatar Sep 14 '22 08:09 cmutel

Thank you for your input!

@marc-vdm

Less important/small detail: It looks like you write it for superstructure only. We have two implementations, and it looks like it's currently only written for scenarios.

I am afraid I do not understand this one. I tested this without scenarios, only.

I had a look at the code, things are in places that they shouldn't be, but that's not your fault, it was already like this. We'll need to move some things around I think. Anyway, that's not a problem for now.

Please feel free to move things around as necessary. Sorry if this causes extra work.

the flow you show above seems to say that CO2 is "going into" process A, but I think this should be the other way around

This is how brightway currently works. Likewise, electricity is shown as "going into" process B, even though it really leaves as a by-product. If we are to start switching directions for biosphere flows, we should do it for technosphere by-products, too.

Could you clarify on your example of electricity and process B?

From how I understood the sankey process B just has 2 inputs for electricity, one positive and one negative of the same amount. If my understanding is correct, the flows should both still point from the electricity to B. With splitting pos/neg flows, we get potentially 4 edges between nodes, the arrow should indicate the direction of the edge, the color should indicate the sign (positive/negative).

B has a negative electricity input = positive electricity output (by-product). A has a (positive) electricity input. My point is that the arrow direction and the sign of the flow amount express the same thing. Currently, AB keeps the flow directions true to the model and indicates negative flow amounts° by changing the color of the arrow to green. Alternatively, one could flip all green arrows, ending up with positive flow amounts only. I think that both approaches are fine, but they shouldn't be mixed. Therefore, if we start switching arrows for biosphere flows, we should do the same for by-products and materials for treatment.

°Actually the color indicates negative impacts, but since most processes have positive impacts, a negative impact usually indicates a reversed direction of flow.

As for applying this to biosphere: I think we should think about this more. The convention in LCA is that both resource extraction (coal mined) and emissions (CO2 emitted) are represented as positive numbers. -Personally I think this convention is wrong as for the technosphere, inputs are negative and outputs are positive, IMO biosphere should follow that convention too, but I don't have the power to change that.-

I don't see an immediate way to do this automatically unless we had a list of all biosphere flows that were classed on their direction (though we could perhaps infer them based on the categories column.

I agree that this is rather unintuitive. The way of thinking here is as follows: An emission of CO2 is treated as a damage to the environment, as is the extraction of a resource. Hence, both have a positive sign (indicating environmental damage rather than benefit). The sign is not about the flow of mass but about the flow of damages. The same goes for the CO2 flow in the demonstration example: Although CO2 leaves from A, a damage is allocated to A.

We could fix this by simply not having arrows in the biosphere edges? Or alternatively: we could fix this partly by having a checkbox for users with flip biosphere flows, to use at their own discretion?

Doable if desired. But I don't feel there is a necessity. What do the others think?

@cmutel

I have trouble seeing how importance first is preferable. The number of calculations should be the same (importance-first calculates the LCA score for all inputs to set an order, depth-first calculates the LCA score for all inputs to decide whether to go deep).

They will produce the same results if max_calc is infinite; otherwise, importance-first is strictly better, as it will produce the most information. Depth first can exhaust the number of calculations permitted in one branch of the supply chain before even seeing other branches.

Makes sense. This should be changed.

An alternative is to have two lists of edges (B/T), and to add a function to the 2.0 branch of bw2data to rehydrate metadata from matrix indices.

As long as the biosphere and the technosphere are separate matrices, it is not possible to restore metadata from the matrix indices. How is the algorithm supposed to know if 1 means the first entry in the biosphere matrix or the technosphere matrix?

Well, if you are using a list called biosphere_edges you might have a hint :)

Haha, true. My mistake. But do you still see a necessity for keeping biosphere and technosphere edges separate? I don't think it should cause incompatibilities with the bw agenda.

BenPortner avatar Sep 14 '22 10:09 BenPortner

I tried changing the shape of the biosphere nodes to ellipses and it turned out more complicated than expected. Unfortunately, there are problems with overflowing labels. I tried to fix it by centering the labels, but the result is still not very pleasing. Judge for yourselves:

image

BenPortner avatar Sep 15 '22 10:09 BenPortner

Personally, I don't think the current labeling is problematic this way.

But in the top left, why is the node Carbon dioxide, fossil, air, urban close to ground 0% and it's flow Carbon dioxide, fossil 60%? That doesn't sound right to me.

marc-vdm avatar Sep 15 '22 11:09 marc-vdm

@marc-vdm Just like the technosphere nodes, the color of the biosphere nodes shows the total impact that the corresponding activity adds to the functional unit (i.e. not just the impact of one edge but the sum of all edges, also the ones below the cutoff). In this particular case, the total impact caused by Carbon dioxide, fossil, air, urban close to ground is zero, because electricity is consumed by A but the same amount of electricity is produced by B , thereby cancelling the consumption of electricity and the carbon emission. As a matter of fact, the edge list correctly reflects this by having a negative and a positive edge between Carbon dioxide... and electricity production...:

edges = [
...
{'to': ('apos371', 'e53ed2834a03c816b6e233e0dc7e1dfa'), 'from': ('biosphere3', 'f9749677-9c9f-4678-ab55-c607dfdc2cb9'), 'amount': -0.6000000238418579, 'exc_amount': 0.6000000238418579, 'impact': -0.6000000238418579, 'from_type': 'biosphere'}
...
{'to': ('apos371', 'e53ed2834a03c816b6e233e0dc7e1dfa'), 'from': ('biosphere3', 'f9749677-9c9f-4678-ab55-c607dfdc2cb9'), 'amount': 0.6000000238418579, 'exc_amount': 0.6000000238418579, 'impact': 0.6000000238418579, 'from_type': 'biosphere'}
]

However, for some reason the negative edge is not drawn by d3-dagre. I need to look into this.

Edit: fixed.

image

BenPortner avatar Sep 15 '22 11:09 BenPortner

Update @marc-vdm @cmutel @bsteubing:

  • I did another major rewrite of graph_traversal.py. There are now classes for nodes, edges and their respective lists. This makes the code cleaner and self-documenting.
  • biosphere nodes and technosphere nodes are now stored in separate lists
  • added function for importance first traversal and made it the default
  • added switch use_keys: True means activity keys will be used in node list and edge list, False means matrix indices will be used (note: the latter option is not compatible with ABs downstream processing)
  • added switch include_biosphere: whether to include biosphere nodes in the node list (and hence sankey diagram) or not
  • improved speed: Under comparable conditions (excluding biosphere nodes and keys), the new algorithm is slightly faster than the bw2calc version:
from bw2data import projects, Database
from activity_browser.bwutils.superstructure.graph_traversal import GraphTraversal as newGT
from bw2calc import GraphTraversal as oldGT
import time

projects.set_current("ab")
method = ("ILCD 2.0 2018 midpoint", "climate change", "climate change total")
db = Database("apos371")
acts = [db.random() for _ in range(10)]

# new
start = time.time()
res_new = [newGT(include_biosphere=False, use_keys=False).calculate(demand={act:1}, method=method, cutoff=0.05, max_calc=250) for act in acts]
end = time.time()
print(f"New GraphTraversal needed {end-start:.1f} s for 10 activities.")
# New GraphTraversal needed 39.9 s for 10 activities.

# old
start = time.time()
res_old = [oldGT().calculate(demand={act:1}, method=method, cutoff=0.05, max_calc=250) for act in acts]
end= time.time()
print(f"Old GraphTraversal needed {end-start:.1f} s for 10 activities.")
# Old GraphTraversal needed 42.5 s for 10 activities.

BenPortner avatar Sep 16 '22 15:09 BenPortner

@BenPortner Nice.

I think we need a new library, bw_graph_traversal, which should include your work and the multifunctional traversal class. If you decide to create this, I will make PRs/issues with some small changes. It would be a pity to have this hidden away in the AB code.

I would made the node and edge classes dataclasses - they are exactly the use case for this functionality.

There is no reason to limit this to one functional unit, you can just iterate over the keys here and add multiple nodes. The traverse functions don't need a starting node, just take the top one on the heap.

I changed the name of this class to AssumedDiagonalGraphTraversal, as it assumes values exist on the diagonal and are special. This is not generally the case, see above multifunctional merged PR.

cmutel avatar Sep 17 '22 10:09 cmutel

@cmutel

I think we need a new library, bw_graph_traversal

I don't see any advantage in creating yet another library. On the contrary, I think it has a lot of disadvantages for developers (e.g. more dependencies to manage manually) and users (e.g. unclarity where to post issues). I have described these before here. Due to the reasons stated there I am opposed to a new library.

I would made the node and edge classes dataclasses - they are exactly the use case for this functionality.

Done.

There is no reason to limit this to one functional unit, you can just iterate over the keys here and add multiple nodes.

The improved version now supports multiple functional units.

The traverse functions don't need a starting node, just take the top one on the heap.

It needs a starting node to keep compatible with the traverse_depth_first function syntax.

I changed the name of this class to AssumedDiagonalGraphTraversal, as it assumes values exist on the diagonal and are special. This is not generally the case, see above multifunctional merged PR.

I personally feel like AssumedDiagonalGraphTraversal is a rather irritating name. It sounds like the traversal happens along the diagonal of a matrix, rather than describing the implicit assumption this graph traversal class assumes that diagonal values exist and are special . Perhaps it would be better to have a boolean parameter has_diagonal that changes the behavior of GraphTraversal? What do the others think?

BenPortner avatar Sep 20 '22 17:09 BenPortner