sisl
sisl copied to clipboard
WIP: Refactoring the viz module to a more modular design.
The discussion for this PR started in #446, but the final aim is very different from what I intended at the beggining of that PR, so I prefer to open a new one.
Basically, I refactored the module to have a more modular and functional design while keeping the functionalities that I thought were important. The refactoring is still not complete. I have refactored PDOS and bands as a proof of concept and wanted to know your opinion before carrying on.
As a first introduction and to display the differences, this is now the code of BandsPlot
, which at a high level is much easier to understand from scratch than the previous design:
@Plot.from_func
def BandsPlot(bands_data: BandsData = None,
Erange=[-2, 2], E0=0, E_axis: Literal["x", "y"] = "y", bands_range=None, spin=None,
bands_style={'color': 'black', 'width': 1, "opacity": 1}, spindown_style={"color": "blue", "width": 1},
gap=False, gap_tol=0.01, gap_color="red", gap_marker={"size": 7}, direct_gaps_only=False, custom_gaps=[],
bands_mode: Literal["line", "scatter", "area_line"] = "area_line"
):
if bands_data is None:
raise ValueError("You need to provide a bands data source in `bands_data`")
# Filter the bands
filtered_bands = filter_bands(bands_data, Erange=Erange, E0=E0, bands_range=bands_range, spin=spin)
# Add the styles
styled_bands = style_bands(filtered_bands, bands_style=bands_style, spindown_style=spindown_style)
# Determine what goes on each axis
x = get_axis_var(axis="x", var="E", var_axis=E_axis, other_var="k")
y = get_axis_var(axis="y", var="E", var_axis=E_axis, other_var="k")
# Get the actions to plot lines
bands_plottings = PlotterNodeXY(data=styled_bands, x=x, y=y, set_axrange=True, what=bands_mode)
# Gap calculation
gap_info = calculate_gap(filtered_bands)
# Plot it if the user has asked for it.
gaps_plottings = draw_gaps(bands_data, gap, gap_info, gap_tol, gap_color, gap_marker, direct_gaps_only, custom_gaps, E_axis=E_axis)
return CompositePlotterNode(bands_plottings, gaps_plottings, composite_method=None)
The main ideas are explained in this notebook (I include the html output so that you don't need to run it), because I thought it was much easier to explain in a notebook. I tried to not go into the details and explain the framework at a high level, so I expect it will be a fast read even if it seems a bit long.
Looking forward to hear your thoughts whenever you have time!
This pull request introduces 140 alerts and fixes 5 when merging 439171148b5e4af063cfe297eadbb00b2e5097b8 into ea81dcdca561ad37180972e00d44ad4773cd6989 - view on LGTM.com
new alerts:
- 26 for Unused import
- 22 for 'import *' may pollute namespace
- 17 for Unused local variable
- 11 for Modification of parameter with default
- 10 for Except block handles 'BaseException'
- 7 for Conflicting attributes in base classes
- 7 for Missing call to `__init__` during object initialization
- 7 for Unnecessary 'else' clause in loop
- 7 for Module is imported with 'import' and 'import from'
- 7 for Wrong number of arguments in a call
- 6 for Unreachable code
- 4 for Wrong name for an argument in a call
- 3 for Redundant assignment
- 2 for Signature mismatch in overriding method
- 2 for Unnecessary pass
- 1 for `__eq__` not overridden when adding attributes
- 1 for Variable defined multiple times
fixed alerts:
- 4 for 'import *' may pollute namespace
- 1 for Unused import
This pull request introduces 140 alerts and fixes 5 when merging 471fc2cff35758e0146b096e90dff709e09bc0a9 into ea81dcdca561ad37180972e00d44ad4773cd6989 - view on LGTM.com
new alerts:
- 26 for Unused import
- 22 for 'import *' may pollute namespace
- 17 for Unused local variable
- 11 for Modification of parameter with default
- 10 for Except block handles 'BaseException'
- 7 for Conflicting attributes in base classes
- 7 for Missing call to `__init__` during object initialization
- 7 for Unnecessary 'else' clause in loop
- 7 for Module is imported with 'import' and 'import from'
- 7 for Wrong number of arguments in a call
- 6 for Unreachable code
- 4 for Wrong name for an argument in a call
- 3 for Redundant assignment
- 2 for Signature mismatch in overriding method
- 2 for Unnecessary pass
- 1 for `__eq__` not overridden when adding attributes
- 1 for Variable defined multiple times
fixed alerts:
- 4 for 'import *' may pollute namespace
- 1 for Unused import
Codecov Report
Attention: 1033 lines
in your changes are missing coverage. Please review.
Comparison is base (
8d9c331
) 87.42% compared to head (bc1ea37
) 87.27%. Report is 8 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #476 +/- ##
==========================================
- Coverage 87.42% 87.27% -0.16%
==========================================
Files 376 355 -21
Lines 48172 47680 -492
==========================================
- Hits 42113 41611 -502
- Misses 6059 6069 +10
Files | Coverage Δ | |
---|---|---|
src/sisl/_environ.py | 87.17% <100.00%> (+0.33%) |
:arrow_up: |
src/sisl/io/siesta/tests/test_eig.py | 100.00% <100.00%> (ø) |
|
src/sisl/io/tests/test_xsf.py | 100.00% <ø> (ø) |
|
src/sisl/nodes/__init__.py | 100.00% <100.00%> (ø) |
|
src/sisl/nodes/context.py | 88.46% <ø> (ø) |
|
src/sisl/nodes/dispatcher.py | 0.00% <ø> (ø) |
|
src/sisl/nodes/syntax_nodes.py | 100.00% <100.00%> (ø) |
|
src/sisl/nodes/tests/test_context.py | 97.43% <ø> (ø) |
|
src/sisl/nodes/tests/test_node.py | 97.76% <100.00%> (-0.19%) |
:arrow_down: |
src/sisl/nodes/tests/test_workflow.py | 86.17% <ø> (ø) |
|
... and 83 more |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This pull request introduces 150 alerts and fixes 5 when merging 0b8546a4eddebe5d167dd43bbba644523d826912 into 7f1f979cfc4c2de3f4d0230c8dbfcbb1e3ea357b - view on LGTM.com
new alerts:
- 33 for Unused import
- 23 for 'import *' may pollute namespace
- 18 for Unused local variable
- 11 for Modification of parameter with default
- 10 for Except block handles 'BaseException'
- 7 for Conflicting attributes in base classes
- 7 for Missing call to `__init__` during object initialization
- 7 for Unnecessary 'else' clause in loop
- 7 for Module is imported with 'import' and 'import from'
- 7 for Wrong number of arguments in a call
- 6 for Unreachable code
- 4 for Wrong name for an argument in a call
- 3 for Redundant assignment
- 2 for Signature mismatch in overriding method
- 2 for Unnecessary pass
- 2 for Variable defined multiple times
- 1 for `__eq__` not overridden when adding attributes
fixed alerts:
- 4 for 'import *' may pollute namespace
- 1 for Unused import
This pull request introduces 152 alerts and fixes 5 when merging 0b035be475d788f7760d92731027c8f434395767 into 7f1f979cfc4c2de3f4d0230c8dbfcbb1e3ea357b - view on LGTM.com
new alerts:
- 33 for Unused import
- 23 for 'import *' may pollute namespace
- 18 for Unused local variable
- 12 for Modification of parameter with default
- 10 for Except block handles 'BaseException'
- 7 for Conflicting attributes in base classes
- 7 for Missing call to `__init__` during object initialization
- 7 for Unnecessary 'else' clause in loop
- 7 for Module is imported with 'import' and 'import from'
- 7 for Wrong number of arguments in a call
- 6 for Unreachable code
- 4 for Wrong name for an argument in a call
- 3 for Unnecessary pass
- 3 for Redundant assignment
- 2 for Signature mismatch in overriding method
- 2 for Variable defined multiple times
- 1 for `__eq__` not overridden when adding attributes
fixed alerts:
- 4 for 'import *' may pollute namespace
- 1 for Unused import
This pull request introduces 149 alerts and fixes 5 when merging 5f40203aca9454afb16d72846a3a3119e43f86ac into 29b81cfeea48985c8514bc4d64e483dd6565d614 - view on LGTM.com
new alerts:
- 33 for Unused import
- 20 for 'import *' may pollute namespace
- 18 for Unused local variable
- 12 for Modification of parameter with default
- 10 for Except block handles 'BaseException'
- 7 for Conflicting attributes in base classes
- 7 for Missing call to `__init__` during object initialization
- 7 for Unnecessary 'else' clause in loop
- 7 for Module is imported with 'import' and 'import from'
- 7 for Wrong number of arguments in a call
- 6 for Unreachable code
- 4 for Wrong name for an argument in a call
- 3 for Unnecessary pass
- 3 for Redundant assignment
- 2 for Signature mismatch in overriding method
- 2 for Variable defined multiple times
- 1 for `__eq__` not overridden when adding attributes
fixed alerts:
- 4 for 'import *' may pollute namespace
- 1 for Unused import
This pull request introduces 149 alerts and fixes 5 when merging 597399b28f815d37691383f01a975a1a592e1838 into 51b57c75d0301b7e3d2d43b804a9fab0ba4f54d4 - view on LGTM.com
new alerts:
- 33 for Unused import
- 20 for 'import *' may pollute namespace
- 18 for Unused local variable
- 12 for Modification of parameter with default
- 10 for Except block handles 'BaseException'
- 7 for Conflicting attributes in base classes
- 7 for Missing call to `__init__` during object initialization
- 7 for Unnecessary 'else' clause in loop
- 7 for Module is imported with 'import' and 'import from'
- 7 for Wrong number of arguments in a call
- 6 for Unreachable code
- 4 for Wrong name for an argument in a call
- 3 for Unnecessary pass
- 3 for Redundant assignment
- 2 for Signature mismatch in overriding method
- 2 for Variable defined multiple times
- 1 for `__eq__` not overridden when adding attributes
fixed alerts:
- 4 for 'import *' may pollute namespace
- 1 for Unused import
Heads-up: LGTM.com's PR analysis will be disabled on the 5th of December, and LGTM.com will be shut down ⏻ completely on the 16th of December 2022. It looks like GitHub code scanning with CodeQL is already set up for this repo, so no further action is needed :rocket:. For more information, please check out our post on the GitHub blog.
This is not to push you, what is your ETA on this. I would like to get the currents devs out pretty soon, so if your PR is close to ending, I could wait, otherwise I would try and push out a 0.13 release ASAP.
Hmm I think it's better not to rush it, so you can publish the release without it. Actually I was waiting for your opinion on the core ideas, because maybe there are some things that can be simplified/made more functional.
Whenever you have time if you could look at the html file that I provided here with a rendered notebook that would be nice :)
Ok, I'll have a look (hopefully this week or next)! Thanks!
Ok, I have had a look. And all-in-all I really like your ideas!
I have uploaded a new notebook with comments in them, Sislvizrefactordisplay.txt
(change extension to ipynb
:)
I say we go! :)
Thanks for looking at it!
I will go through some of your comments here:
- Will this be problematic for nodes storing large data-sets? Imagine nodes with grids and doing a workflow: read -> do function -> reduce grid. This would result in 2 full grids, and one at reduced size? Not only that, also the arguments to the functions (which needs to be in memory for recalculations).
Yes, I also thought that this would be problematic with grids. But then my thought was that if it does not make sense to store its inputs and outputs it shouldn't be a node. E.g. if you have a plot that does ... -> interpolate_grid -> smooth_grid -> reduce_grid -> ...
you should probably pack all that functionality into a single node: ... -> process_grid -> ...
.
- Will a static code analyzer figure out that a node with a return value is valid input? (not necessary, just out of interest)
No, it doesn't. This is one of the reasons I want to get rid of defining all functions as nodes (see design tweak below).
- Perhaps there is something I am missing, but couldn't you just have done @Node.from_funct as well? What is the difference from a Workflow?
Yeah this is a bit tricky, but I think Workflow
deserves a class of its own. The difference between a simple Node
and a Workflow
is that a node can contain any computation inside it, while a workflow is just a way of organizing nodes. A workflow just links nodes with each other, there is nothing computed outside the nodes it contains.
Of course you can also consider that you can split all the functionality in a node into individual pieces that are linked together, but you wouldn't want to store checkpoints for all of them :sweat_smile:
My thoughts for a design tweak
I didn't like the fact that there was @Node.from_func
on top of every function of sisl.viz
(I had to introduce the default lazy_context=False to make them usable as normal functions). Also I didn't like that to use any other function in a workflow you need to define it explicitly as a node. I.e.:
from sisl import some_func
from sisl.viz import Node, Workflow
node_func = Node.from_func(some_func)
@Workflow.from_func
def my_workflow(...):
...
a = node_func()
...
I have now an implementation (not merged here) that on Workflow
definition builds replaces all function calls with Node calls. I.e. it allows you to do:
from sisl import some_func
from sisl.viz import Workflow
@Workflow.from_func
def my_workflow(...):
...
a = some_func()
...
and it will be exactly the same as the previous code.
To me this is a great improvement because:
- You don't need to redefine functions as nodes.
-
sisl.viz
functionality doesn't need to be defined as nodes. - It makes going from workflow to function very simple (just remove the decorator). It is also much easier the other way.
- Because of the previous to points, there is no need now for the
lazy_context
. If you use a node it will be lazy. Otherwise just use the normal functions.
I do the change from function calls to node calls getting the Abstract Syntax Tree, modifying it and recompiling. I don't know if it is too "hacky", but it seems very robust to me, and cleaner than any other implementation. Do you think it is fine or should I stick to "more normal" python trickery?
Also, if you don't need to define all functionalities as nodes I thought that this might be useful not only for sisl.viz
. You could create a workflow with any functions in sisl
(in fact with any function in any module). So do you think it makes sense to have something like sisl.nodes
to keep this functionality?
Cheers!
I'll have to wrap my head a bit more around the node stuff, I am a bit worried it might make the code more complicated and not as intuitive.
I have however thought about the dispatch methods since there are some short-comings, here the nodes might be interesting.
A note on the workflows, it seems to me that if you do it via ast
you'll be forcing all things in the node tree and thus you'll be surprised about a workflow that does the process_grid
thing, no?
I'll have to wrap my head a bit more around the node stuff, I am a bit worried it might make the code more complicated and not as intuitive.
Which part will be more complicated? I don't mean to define sisl functions as nodes, I just mean to store the Node
and Workflow
class in sisl.nodes
instead of sisl.viz.nodes
. Although since the functionality has nothing to do with sisl or science I also thought of creating a separate python package that sisl.viz
would depend on. I don't know 😅
A note on the workflows, it seems to me that if you do it via ast you'll be forcing all things in the node tree and thus you'll be surprised about a workflow that does the process_grid thing, no?
What do you mean by doing the process_grid
thing? Inside the workflow you would just call process_grid()
, so process_grid
is just converted to a node without caring about what's inside. All you know is that the developer wants to checkpoint the process_grid
function.
- Ok, sure lets do that! Could you make a separate PR, then we can get it in immediately, I'll promise to just accept any PR's in that sub-module so you can quickly go about it ;)
- It is more that end-users needs to be aware of the shortcomings of the workflow. Meaning they have to consider that every function call will be a node (static memory consumption). And also, what happens for
x = y + z
code lines? Will this be turned into a lambda and then a node? I found this article which arguably has some points to the matter https://www.georgeho.org/manipulating-python-asts/ It might be dangerous to change the codein-place
. But if you like it the way it is, then fine!
- I want to clarify that I'm not replacing the code in place. When you create a workflow, you get a new object that is the workflow, but the original function remains untouched (it's the same as for a normal node).
Regarding x = y + z
, I don't know what to do with it. The thing is that if y
or z
are a node, then the sum produces a node as well. So the only problem is if y
and z
are both direct inputs of the workflow, i.e. not nodes. Then you would need to do what you are saying. I don't know if it is worth it to try and catch these things. You need to determine all the dependencies of that line of code to turn it into a node.
(update) Just converting all workflow inputs to a Node
when they are received, e.g. y
-> WorkflowInputNode(y)
, would solve the problem. In fact I didn't realise, but this is what I'm doing now.
My first idea was to restrict the usage of workflows to very simple things. At the beggining I thought I would only allow assignments from function calls. E.g.:
@Workflow.from_func
def my_workflow(a, b):
val = calc_val(a, b)
shift = calc_shift(a)
return shift_val(val, shift)
I think if you do this you are forcing devs to create easy to understand workflows and it is very clear where the checkpoints are.
Then I thought I could just replace function calls by node calls and everything would work the same. So in my current implementation it is now valid to do:
@Workflow.from_func
def my_workflow(a, b):
return shift_val(calc_val(a, b), calc_shift(a))
and it produces the same workflow with the same checkpoints. My idea in doing this was that it would be much simpler to convert any given function into a workflow, which is nice.
However, I don't know if I should go back to forcing the first syntax. I like it because it explicitly declares the checkpoints. And you can access checkpoints of the workflow from variable names (e.g. workflow.val
or workflow.shift
). In some sense these are not just checkpoints, it is as if you had a state, only that you know what are the minimal functions to run to keep it up to date.
What I'm thinking now is that I could support the second syntax, but a variable declaration could mean that we should create a breakpoint. This is a bit different because checkpoints would be a thing external to nodes instead of using their intrinsic functionality. One could allow:
@Workflow.from_func
def my_workflow(a, b):
# Private variables indicate that these are just temporal values and should not be checkpointed
_val_left = calc_val(a, b)
_val_right = calc_val(b, a)
# I want to create a checkpoint here
val = _val_left + _val_right
# Another checkpoint is implicitly created for the output.
return shift_val(val, calc_shift(a))
Sorry for the long text :) Any ideas on what you think is best?
In the last idea, you wouldn't necessarily need to use _*
for temporal variables, it could also be:
@Workflow.from_func(store=["val"])
def my_workflow(a, b):
# Temporal values that are not checkpointed
val_left = calc_val(a, b)
val_right = calc_val(b, a)
# I want to create a checkpoint here
val = val_left + val_right
# Another checkpoint is implicitly created for the output.
return shift_val(val, calc_shift(a))
Would all nodes
(calc_val
, calc_shift
) be present in the workflow node list? I think one shouldn't be able to access variables, rather they should access nodes and use get
:
val = my_workflow.nodes.calc_val.get()
should be the way to get it, no? Since each node is accessible and the values them selves are .get()
'ed then you wont have this problem.
In this way changing of variable names is not a problem! Consider taking a workflow that does A
, but now it should also do B
, but you accidentally used a variable name in the first workflow that clashes with the understanding of the 2nd workflow...
To add to that, if a user wants a check-point, they should use a node! :)
So wanting to checkpoint your sum, you would need to nodify
it ;)
This of course gets problematic when you have multiple calls to the same node function... hmm... How to do that!
So wanting to checkpoint your sum, you would need to nodify it ;)
Yes, but that's what happens automatically, thanks to your advice on Ufuncs
. Any operation that you do with nodes results in a node. E.g. if you do y = node_1 + node_2
, then y
is a "SumNode
" with it's inputs linked to node_1
and node_2
outputs. That is as if you did: y = SumNode(node_1, node_2)
.
This of course gets problematic when you have multiple calls to the same node function... hmm... How to do that!
Yes, exactly. Naming checkpoints by their node name has the downside that if you use the same node more than once you need to modify the names. What this PR implements right now is to add a number to the name if that's the case. I.e. there is calc_val
and calc_val_2
.
I thought that giving names to checkpoints was much nicer for this and because a normal user would understand it much better without knowing the inner workings I believe. calc_val
is the generic name for the function that computes the value, but if you give a name to its output it can be much more meaningful in the context of the workflow.
The naming for the checkpoints would just be a local table to access the nodes outputs, it doesn't impose anything on the node itself. So a separate workflow would be able to use the same node and map its output to another name. (this is just the normal way variables work in python, nothing particularly strange, you can have multiple variables pointing to the same object and each named differently).
One could even argue if the variable name is more stable than the node name. You might change the way you compute a given variable of your workflow, or the function name from a third party package might simply change.
Well, everything is up for changes ;) The problem is sharing workflows. It requires documentation.
With this I think your idea of notifying the workflow what to expose is the best idea.
@Workflow.from_func(store=["val"])
You could imagine a Workflow.add_store("variable")
in case users wants to examine a temporary variable.
The above approach ensures that the implementors of the workflow will document the stored variables for consistency and ease of use.
You could imagine a Workflow.add_store("variable") in case users wants to examine a temporary variable.
Yes, my idea was exactly that! It would generate a copy of the workflow with this extra checkpoint.
The problem is sharing workflows. It requires documentation.
I would say it requires the same amount of documentation as any other object, no? You need to document the variables that you expose and try not to change them, private variables need less documentation since they are not expected to be used by users, etc...
You could imagine a Workflow.add_store("variable") in case users wants to examine a temporary variable.
Yes, my idea was exactly that! It would generate a copy of the workflow with this extra checkpoint.
Good.
The problem is sharing workflows. It requires documentation.
I would say it requires the same amount of documentation as any other object, no? You need to document the variables that you expose and try not to change them, private variables need less documentation since they are not expected to be used by users, etc...
Yes, but here it is a bit more difficult, right.
The workflow returns something
, but you want users to be able to access temporary variables defined (in the workflow)!
This is not the same as documenting a function! Variables in a function is by definition private
, what you are doing is making them public in some way.
Ideally the add_store
should change the documentation and insert the variable name and some documentation on the variable, in this way it becomes automatic, and users will also more easily document their changes.
The workflow returns something, but you want users to be able to access temporary variables defined (in the workflow)! This is not the same as documenting a function! Variables in a function is by definition private, what you are doing is making them public in some way.
Yes, I think of the workflow more as an object with state instead of a function. So what I meant is that you need to document the checkpoints of your workflow as you do with the state of any other object in python.
Yes, just so we agree that the documentation needs to follow the store values ;)
Ok, so it is now clean enough for you to check (not finished yet). I feel it's better to look at the code directly instead of looking at the changes.
The new viz
module has the following submodules:
-
data
: Contains the data structures with multiple entry points, to make sure the data looks the same regardless of where it came from. -
processors
: Contains many functions to process data in a functional way. -
plotters
: Functions that determine the actions to perform on the plot (backend agnostic). -
figure
: Functions that actually draw the plot. Each backend has their own figure implemented, implementing the drawing actions that thesisl.viz
uses (draw_line
,draw_scatter
...) -
plots
: Combining all of the above, the implementation of specific plots (bands, pdos, geometry) as functions. We also define the workflow (sisl.node.Workflow
) version of them. When registering plotables, the workflow version is used, so that the plot can be modified (withupdate_inputs
,update_settings
has been deprecated). -
data_sources
: This is a bit experimental. The functionality here could be inprocessors
. It contains functions that process data to generate usually styles for atoms, bonds, etc... The idea is that they can help link different plots. E.g. anAtomPDOS
data source will help styling atoms in a geometry plot from pdos data, etc... I still need to think about the best way to organize this.
This modular and functional design has allowed the tests to be much more "normal" now. See for example processors/tests/*
, where I am able to test the functionality of each function separately, as it should be :) It also makes it easier to have a high coverage (> 90 % for processors).
Arguably data
and most functionality in processors
should not belong to the viz
module, but I found it best for now to encapsulate the refactoring of the viz
module in something that is self-contained.
Now it's almost finished (I'd say it is finished but I would like to give it a final look after I sleep).
All the documentation has been updated (it is much simpler :partying_face: ).
Thanks for this.
Some comments:
- your naming choices seem very blended, I have a hard time distinguishing what
data
vsdata_sources
exactly means, are they related, if so, coulddata_sources
be moved intodata
? - same thing with
plots
andplotters
, could we name them something that defines them more appropriately? -
processors
is probably just me, but these looks like tools to manipulate sisl objects, or results thereof?processors
makes me think of CPU's... Otherwise, looks great! It is big, so will take me time to get used to this :)
Also, could you run isort
on the code?