pymatgen [WIP] Fixed the mask_sphere function in cube.py

Summary

Include a summary of major changes in bullet points:

The mask_sphere function returned a mask of distance values that doesn't vary with site
Changed this use the get_points_in_sphere from Lattice

TODO (if any)

A discussion about what this returns:
I think just returning the indices rather than a full array would be better.
Was it intentional to multiply by distance or is how I have set it up now (just 1 or 0)
Should I type hint the functions?

Checklist

Work-in-progress pull requests are encouraged, but please put [WIP] in the pull request title.

Before a pull request can be merged, the following items must be checked:

[x] Code is in the standard Python style. The easiest way to handle this is to run the following in the correct sequence on your local machine. Start with running black on your new code. This will automatically reformat your code to PEP8 conventions and removes most issues. Then run pycodestyle, followed by flake8.
[x] Docstrings have been added in the Google docstring format. Run pydocstyle on your code.
[ ] Type annotations are highly encouraged. Run mypy to type check your code.
[ ] Tests have been added for any new functionality or bug fixes.
[ ] All linting and tests pass.

Jan 28 '22 17:01 adam-kerrigan

Coverage decreased (-0.6%) to 83.392% when pulling b83b8031d324e387c02c6deb1b2f9deebf8cb9ac on adam-kerrigan:masked_sphere_correction into 07794afb624f7613b3dbcebdf0034cab4c9ee7ea on materialsproject:master.

Jan 28 '22 17:01 coveralls

Thanks @adam-kerrigan!

@jmmshn, do you have any thoughts on this? I remember you having issues with masking in VolumetricData/Chgcar.

Jan 31 '22 20:01 mkhorton

https://github.com/materialsproject/pymatgen/blob/fedfdd0a69b75d9d7890942954a4ffa3604ac4ba/pymatgen/analysis/defects/utils.py#L1272 I think these are indeed the same. I think the right thing to do is to refactor the distance mask code into Volumetric data, since my old code used a "private" function for that part it should be minimally invasive.

Jan 31 '22 21:01 jmmshn

I think using the sphere function of the lattice instead of distances would still be better though as you aren't spawning massive grids for what is often only a couple of hundred points around a site. This is why I thought just returning the indices for the mask might be ideal also, so you could store several sites and not have to use them one at a time for fear of OOMing but I don't think that is what people would expect returned from a mask function.

I think personally, in my unqualified-to-address-the-scale-of-the-task opinion, I would also prefer it was refactored into VolumetricData (with that being moved into core, with a sub-class for charge densities and another for local potentials etc?) and used as a parent for cubes as well as chgcars etc but I don't know how many other classes are using specifically the Chgcar class or this Cube one that would then need to be made more general.

On the code you linked @jmmshn I can see it is called in the function above. Here you define sites whereby the charge density is a minimum and integrate around it. I was wondering if you could tell me what you were using this for? I've just started looking at critical points and haven't seen many uses for cage points and am interested to learn more.

Jan 31 '22 22:01 adam-kerrigan

On the code you linked @jmmshn I can see it is called in the function above. Here you define sites whereby the charge density is a minimum and integrate around it. I was wondering if you could tell me what you were using this for? I've just started looking at critical points and haven't seen many uses for cage points and am interested to learn more.

The code in there was used in this paper to identify candidate cation insertion sites with low steric interactions.

https://www.nature.com/articles/s41524-020-00422-3

Feb 01 '22 06:02 jmmshn

I think using the sphere function of the lattice instead of distances would still be better though as you aren't spawning massive grids for what is often only a couple of hundred points around a site. This is why I thought just returning the indices for the mask might be ideal also, so you could store several sites and not have to use them one at a time for fear of OOMing but I don't think that is what people would expect returned from a mask function.

I can see pros and cons here but I think we should always err on the side of more explicitly and less efficient unless an efficiency bottleneck is actually encountered. Also returning the indices of the mask still might not allow for fast access even if the data is much smaller. My general impression is that the masking operation in NumPy is pretty fast and looping over indices is pretty slow.

Feb 01 '22 06:02 jmmshn

I agree completely that being more explicit over being more efficient is the right way to go, however this function returns a spherical mask by calling get_points_in_sphere(), over creating a distance matrix for the entire array and then using that to create another array based on a conditional. I think arguments can be made both ways for explicitness.

For an idea of comparison between the two I ran them both using a reasonably small sized grid (140x112x378) and obviously the _dist_mat function is invariant to radius (0.98 seconds per site) where as there was a greater than 300 times speed up for radii of 1.6 and under (0.0028 seconds per site) and 50 times speed up for radii of 3.2 and under (0.019 seconds per site). This meant I could look at every site in the file (912) in under 3 seconds for a typical ionic radius length, in under 20 seconds for 3.2 Ang which is massive and this all compared to 15 minutes using the _dist_mat method.

Regarding your thoughts on the indices vs masking I don't think the function you linked uses any masking. From what I can see it just does multiplication between two arrays of one of density one of bools. Again I ran a quick test and for a 1.6 Ang radius masking this way takes 0.01 seconds to create the new array, indexing takes 0.00006. But I think what is important about this note is the memory efficiency. When working with voxel data I think considerations around memory are important, as the data itself can be huge. Storing the whole array in several places (3 times as the output of a meshgrid, another 3 times as a vstack) feels excessive when there is a way to do it without copying the array once.

I have attached the tests I did incase you want to have a play around. test.zip

Feb 01 '22 13:02 adam-kerrigan

Sorry, I had a brain fart there. I forgot that the mask needs to be generated for each origin. So this all makes sense to me now.

I think personally, in my unqualified-to-address-the-scale-of-the-task opinion, I would also prefer it was refactored into VolumetricData (with that being moved into core, with a sub-class for charge densities and another for local potentials etc?)

I also agree with this, but this has to be done without too much disruption to the codebase so we might not be able to change what we used defined in VolumetricData and where it is defined. Maybe we can add a PeriodicField class somewhere (either in pymatgen.io or pymatgen.core) that contains some common stuff between the VASP and CUBE codes. Since these voxel data is usually used in post-processing I lean towards creating a file or even an io/common directory to hold similar codes. @mkhorton Thoughts?

Feb 01 '22 16:02 jmmshn

pymatgen pymatgen copied to clipboard

[WIP] Fixed the mask_sphere function in cube.py

Summary

TODO (if any)

Checklist

pymatgen
pymatgen copied to clipboard