pymatgen icon indicating copy to clipboard operation
pymatgen copied to clipboard

Bader analysis via Pymatgen is extremely slow

Open Andrew-S-Rosen opened this issue 2 years ago • 5 comments

Describe the bug When using the bader_caller in Pymatgen (specifically bader_analysis_from_path), it can take many times longer to run a Bader analysis than just using the standalone scripts provided by the Henkelman group. I suspect this is related to excessive file I/O related to reading in the CHGCAR, AECCAR0, and AECCAR2 every time and also writing out the CHGCAR again every time (in addition to the summed AECCARs, but that is necessary). I've run Bader analyses on systems with hundreds of atoms, and it never takes more than a minute or so, but with Pymatgen the same analysis can take tens of minutes in comparison.

We should track down the bottleneck here and then figure out how to fix this. I think having an option to do the following would be ideal:

  • Copy the AECCAR0, AECCAR2, CHGCAR, and POTCAR files to scratch (is this even necessary? I don't think bader will overwrite anything).
  • Use chgsum.pl AECCAR0 AECCAR2.pl directly. Don't read in the AECCARs via Pymatgen. This would require chgsum.pl to be in the PATH in addition to just the bader executable.
  • Run bader CHGCAR -ref CHGCAR_sum. Repeat this step but yield the spin densities if the calculation is spin-polarized.

At no point in the above should the CHGCAR be read or written, in contrast with the current approach.

Tagging @mkhorton since we briefly discussed this earlier today.

Andrew-S-Rosen avatar Apr 05 '22 23:04 Andrew-S-Rosen

If we believe bader doesn't modify anything, we can do the analysis without copying to scratch. Feel free to make the modifications to pymatgen and submit a PR.

shyuep avatar Apr 07 '22 16:04 shyuep

I'm also fine with these changes, but interested:

Use chgsum.pl AECCAR0 AECCAR2.pl directly. Don't read in the AECCARs via Pymatgen. This would require chgsum.pl to be in the PATH in addition to just the bader executable.

Is this substantially faster? If so, why is pymatgen's parsing so slow, comparatively?

Repeat this step but yield the spin densities if the calculation is spin-polarized.

Can bader handle the spin densities automatically, or does it require writing out a new CHGCAR with just the spin channel included (i.e., requiring the extra I/O step)?

mkhorton avatar Apr 07 '22 18:04 mkhorton

I have to do some benchmarking on the individual steps. If the AECCAR summing ends up being comparable, then it's easier to just leave that step in Pymatgen. Right now, the CHGCAR file is also parsed in that step, so we will want to avoid that at the least.

Regarding spin densities, a new CHGCAR file will need to be written, which is usually done via chgsplit.pl CHGCAR if using the VTST scripts.

I'll do some digging into this because there's definitely something substantially increasing the overall runtime. I'll report back.

Andrew-S-Rosen avatar Apr 07 '22 18:04 Andrew-S-Rosen

@arosen93 I just want to clarify that in pymatgen, the default is to go through the Python objects because we never know where the CHGCAR can come from. For example, we can conceivably download a CHGCAR from MP and do a Bader analysis, which would require writing it out. However, the IO is extremely slow (we are parsing hundreds of MB or Gb and then writing them out). I think a good compromise would be simply to have a wrapper that deals with simple Bader analysis on existing files without IO for the majority of cases where we are doing analysis on local files.

shyuep avatar Jun 22 '22 15:06 shyuep

@shyuep -- Thanks. I agree with everything you said. I still need to do some benchmarking to confirm the rate-limiting step is what I think it is, but I'm a bit over-subscribed at the moment. I will come back to this and my other opened issues... in time.

Andrew-S-Rosen avatar Jun 22 '22 23:06 Andrew-S-Rosen