pyiron_atomistics
pyiron_atomistics copied to clipboard
new sx structure parser that tolerates unexpected items
In part, this is an experiment on parsing.
Idea 1 is to parse while reading the file, instead of reading the entire file first. Idea 2 is to use a more tabular way of programming, by creating a table of keywords with associated parsing routines. Whenever the keyword is encountered, the corresponding parsing routine is called. This is organized in a multi-level way, so certain keywords will simply add additional keywords and their associated parsing routines, which will be removed again from the when the next higher-level keyword appears. This allows to parse larger sections that have multiple subitems. The hope is to produce more readable (and thus maintainable) parsing routines.
I have implemented this for the rarely used sphinx.structure.read_atoms routines, which we need to parse some old data I created outside of pyiron. The new version supports multiple structures in the same file and ignores unexpected stuff (such as labels) much more reliably than the old version.
ok after having added all the comments I finally understood what it does, but wouldn't it be easier to parse the entire file once and use whatever is needed?
This is what you guys currently do almost everywhere, but it becomes annoying and memory-wasteful when you need to parse very large files. It also renders certain parsing tasks more difficult than the stream-reading. Typical log files, for instance, have "sections" in their output with a rather well-defined beginning, but often no unique end marker. Instead, you will find a new output section, and it may depend on the input which other output sections appears next. The current parsing concept is adapted to this.
A sequential parser can parse GBs of data with a memory demand of a few KB (or whatever the maximum size of input is to be able extract information).
Alternatively, I would suggest to do it based solely on
regex. I feel that it's very difficult to keep an eye on the consistency online,lineview,linenoetc.
The KeywordTreeParser is supposed to encapsulate the line reading, the actual parser derived from it only uses the lineview. If you can parse from a single line alone, you need not worry at all. If you need more lines, currently there is only the read_until auxiliary, but one could easily think of additional auxiliaries such as "append_line" or "next_line".
I like the new architecture and agree that going line by line only once is important for both speed and memory consumption. But I do have some python nits, that I'll add later.
I am wondering if we can combine this with the concepts we developed for the interactive LAMMPS parser. Here is a current example for the LAMMPS parser inside the atomistics package. https://github.com/pyiron/atomistics/blob/main/atomistics/calculators/lammps/calculator.py#L111
Writing the LAMMPS parser is easier, as LAMMPS already provides Python bindings, still the general principle is:
def parse_lammps(
file,
quantities=("energy", "forces", "stress", ...),
**kwargs,
):
interactive_getter_dict = {
"forces": function_to_parse_force_from_line,
"energy": function_to_parse_energy_from_line,
"stress": function_to_parse_stress_from_line,
...
}
result_dict = {q: [] for q in quantities}
for line in file:
for q in quantities:
interactive_getter_dict[q](previous_result=result_dict[q], line=line)
return result_dict
This approach combines both, the flexibility to parse only for specific properties and the option to parse each line only once.
One thing that makes it generally a bit hard to read is that the file status (current line, file object, etc.) is passed around implicitly via attributes. It would be more readable, if that was passed around explicitly via arguments to the parser generator functions, imo, but I can't judge how much rewriting that would be.
@freyso Are you available to join the pyiron meeting today at 3pm? https://github.com/orgs/pyiron/discussions/207 Then we could start with this topic, that might be faster than the asynchronous discussion and separate developments at different locations.
This is what you guys currently do almost everywhere, but it becomes annoying and memory-wasteful when you need to parse very large files.
Annoying I don't know, but do we really have so large files in sphinx?
It also renders certain parsing tasks more difficult than the stream-reading. Typical log files, for instance, have "sections" in their output with a rather well-defined beginning, but often no unique end marker. Instead, you will find a new output section, and it may depend on the input which other output sections appears next. The current parsing concept is adapted to this.
But the end of a section is currently recognised by a marker, right? We can do the same with regex.
A sequential parser can parse GBs of data with a memory demand of a few KB (or whatever the maximum size of input is to be able extract information).
This part is anyway quite horrible right now in pyiron, because pyiron currently loads the whole input and output anyway. SPHInX allows @pmrv's lazy loading, but it's not supported elsewhere and is so buggy, so I don't think we should rely on this.
The KeywordTreeParser is supposed to encapsulate the line reading, the actual parser derived from it only uses the lineview. If you can parse from a single line alone, you need not worry at all. If you need more lines, currently there is only the read_until auxiliary, but one could easily think of additional auxiliaries such as "append_line" or "next_line".
I understand the idea, and if python was a human, I would definitely think the way you describe it is how it should work. The reality, however, is that regex is so blazing fast, that anything we do with for loop and string checking becomes totally obsolete. As a matter of fact, reading line by line was what we used to do for SPHInX, but the parsing was slow beyond imagination at that time.
In addition to this, regex makes @jan-janssen's idea also easier to implement, because the output parsers become more modular, because parsing takes place only if the user asks for it, while in your case the parsing must take place again if there's something the user wants to have additionally.