PSyclone Generalise ModuleManager?

In working on #2413 (and unifying the way we search for Containers in particular) I thought it would be good to use the ModuleManager as an efficient way of only parsing the files that we have to and those only once. (This is needed for instance when attempting to 'module-inline' called routines. It's likely that we're going to end up querying the same modules repeatedly.)

However, it turns out the ModuleManager is currently LFRic specific so we should probably make it generic and then subclass it for LFRic. Initially I thought that perhaps we should use fparser to determine which modules a source file contains but, as @hiker predicted, this is prohibitively expensive. (Especially since we'll have to do it every time psyclone is run.) A better solution may therefore be to use a regex to identify modules. The LFRic sub-class could optimise this out as the coding standards mean we know that there is only one module per file and that its name is essentially given by that of the file.

EDIT: another selling point for using the ModuleManager is that it finally frees us from the assumption that a module is found in a source file of the same name.

Jan 09 '24 13:01 arporter

I've tried a simple regex solution in #2413 and that seems to work OK. It might be good to cache the Fortran sources that we read. Currently, this is held in a ModuleInfo object but we don't create that until after we've determined which modules a source file contains.

Jan 09 '24 13:01 arporter

Another thought, ModuleInfo currently offers caching of the fparser2 parse tree. It might also be good if it offered caching of the PSyIR of that tree? (Although it's the parsing that is the most expensive part.)

Jan 09 '24 13:01 arporter

My solution currently looks like:

import re
_MODULE_PATTERN = re.compile(r"^\s*module\s+([a-z]\S*).*$",
                             flags=(re.IGNORECASE | re.MULTILINE))

def get_modules_in_file(self, filename):
    with open(filename, "r", encoding='utf-8') as file_in:
        source_code = file_in.read()
    mod_names = _MODULE_PATTERN.findall(source_code)
    return mod_names

Jan 09 '24 13:01 arporter

Another thought, ModuleInfo currently offers caching of the fparser2 parse tree. It might also be good if it offered caching of the PSyIR of that tree? (Although it's the parsing that is the most expensive part.)

It might be worth checking #2296 first (PR #2367), the caching of the psyir is implemented there already. Plus there was some significant restructuring between master and #2296, I am actually not very keen to touch the module manager before 2296 is merged, it might create a lot of hard to resolve conflicts :(

Jan 09 '24 13:01 hiker

It might be worth checking #2296 first (PR #2367), the caching of the psyir is implemented there already. Plus there was some significant restructuring between master and #2296, I am actually not very keen to touch the module manager before 2296 is merged, it might create a lot of hard to resolve conflicts :(

I'll take a look. I was pushing #2413 forwards as it's my way of exploring how we handle searching for things and that will influence my opinion of #2367! I think I'm pretty much there now though.

Jan 09 '24 16:01 arporter

Also, it's worth noting that my quick solution above failed for at least one NEMO source file (zpshde.f90) because of an encoding error. Sometimes the comments include characters that they shouldn't. We have a workaround for this in fparser so we may need to reuse that in whatever we do in ModuleManager.

Jan 09 '24 16:01 arporter

A suggestion for a different approach (caused by me trying to kernel extract LFRic with UM physics ... which do not follow LFRic coding style, so in order to avoid renaming a more flexible approach might be great :) ). Instead of API-specific implementation, maybe we could have a three-level implementation, which could be selected in the config file (or command line??):

level: use the file name to try to deduce the name of the module(s) in that file. The rule could be 'somehow' specified (flexible enough to specify things like 'if _mod at end, remove _mod and rest is module name', or 'append _mod' to filename to determine the module name) .... or a None operation, which will just return nothing.
level: Re-search (as discussed above).
level: Use fparser :)

So the config file would select which rules to enable/disable (e.g. you could specify a condition for 1st level, disable 2 and only use 3 etc).

The module manager would search for candidates based on the lowest available rule, then confirm going up till the highest (e.g. in LFRic it would find the file, and might then use a reg exp. to confirm, or only use fparer, assuming that we will need the AST anyway etc.

I would need to work on the details a bit :) But before doing this, it would be great to get some feedback if this kind of approach would work for you (you would just disable the file-based rule in the config file, and use regex to find candidates, and then fparser to confirm or so). To some degree it would make the the actual usage of the module manager easier (since it would still be just one, API-independent singleton), and would easily to allow a mixture of coding styles.

Jan 10 '24 11:01 hiker

I'd be a bit worried that a user would have to make sure that the config file was setup to match the API they are using (unless we allow rule configuration in each API section of the file?).

I do like your idea of some initial rule that perhaps looks at a set of filenames and decides which have "most in common" (e.g. longest shared substring) with the module we are looking for. That would work pretty well in most codes I suspect. Presumably there's a Python package that will do the "most in common" search? As you say, we could then confirm with a regex and then, if that is OK, go ahead and parse the file. That way it is general but means we don't parse everything!

Jan 10 '24 11:01 arporter

As an aside, it would be really useful if fparser could serialise its parse tree to file and read it back in again. That way we can cache the result of a parse on disk and have it live between invocations of PSyclone.

Jan 10 '24 11:01 arporter

I'd be a bit worried that a user would have to make sure that the config file was setup to match the API they are using (unless we allow rule configuration in each API section of the file?).

I see it differently: since (with my suggestion) the module manager would be API independent, and only depend on the coding style (or lack of :) ) - which is dependent on the project. And each project should (imho) use its own config file, and so it would be easy to define the rules in that one place.

I do like your idea of some initial rule that perhaps looks at a set of filenames and decides which have "most in common" (e.g. longest shared substring) with the module we are looking for. That would work pretty well in most codes I suspect. Presumably there's a Python package that will do the "most in common" search? As you say, we could then confirm with a regex and then, if that is OK, go ahead and parse the file. That way it is general but means we don't parse everything!

I'll think about the details then :)

Jan 10 '24 13:01 hiker

As an aside, it would be really useful if fparser could serialise its parse tree to file and read it back in again. That way we can cache the result of a parse on disk and have it live between invocations of PSyclone.

Yes, that would indeed be great. Or (I was getting this idea for a totally different project): run fparser as a service/daemon: parsing would just contact this service if it needs something, and the parse tree would be 'somehow' communicated back. And obviously, this service would be persistent, so it can cache data :)

Jan 10 '24 13:01 hiker

I've just happened across the fparser2 code that deals with encoding errors so thought I'd put it here:

import codecs

def log_decode_error_handler(err):
    """
    A custom error handler for use when reading files. Removes any
    characters that cause decoding errors and logs the error.

    :returns: 2-tuple containing replacement for bad chars (an empty string \
          and the position from where encoding should continue.
    :rtype: Tuple[str, int]

    """
    message = f"character in input file. Error returned was {str(err)}."
    # Log the fact that this character will be removed from the input file
    logging.getLogger(__name__).warning("Skipped bad %s", message)
    return ("", err.end)

codecs.register_error("fparser-logging", log_decode_error_handler)

Jan 10 '24 13:01 arporter

I see it differently: since (with my suggestion) the module manager would be API independent, and only depend on the coding style (or lack of :) ) - which is dependent on the project. And each project should (imho) use its own config file, and so it would be easy to define the rules in that one place.

Ah yes, that does make sense. I guess I'm coloured by my own practices where I have just one config. file and switch between projects. However, I'm thinking that it would probably be best to avoid have to try to encode rules in the config file at all. That feels like quite a lot of work when the general approach we're discussing will probably be fine.

Jan 10 '24 13:01 arporter

Interestingly, I could get um-physics+lfric to work when just assuming that any file XX.f90 contains a module names XX. That worked for the LFRic coding style (XX_mod.f90 contains XX_mod), and UM (where no _mod if required). I wonder if this would be sufficient for our 'first filter', I don't think it should be PSyclone's task to enforce/verify that each module source file name ends in _mod.

That would then end up with three levels:

Filename minus extension is the module name
Use a regex to search for module statements
Use fparser

The config file could then store up to three values (keywords to be decided):

module_finding_rules = filename, regex, fparser

So if an apps has no coding style, you could just remove filename, or if you know that most applications don't need the AST, you could remove fparser.

Jan 11 '24 05:01 hiker

I just realise that there is a bad assumption in #2296 - originally, ModuleInfo.get_psyir() would return a FileContainer. That does not make any sense in case of multiple modules in one file. I refactor this in 2296 so that we now get the module Container object (instead of the FileContainer). Still, for full support for multi-module-in-one-file, when parsing is triggered for one module, we need to update all modules in that file to get the right PSyIR pointer.

Jan 15 '24 13:01 hiker

I'm getting to the point where #2413 is almost ready for review. However, it is big and one of the changes it contains is to modify ModuleManager to use a regex search instead of the current LFRic-specific naming convention. It's not a big change (https://github.com/stfc/PSyclone/pull/2413/files#diff-7eda7179b2125cf3871f4e242b34370125db37a87e786f6b68024e732cdd1d84). Would putting this on (I could create a separate PR for it) cause trouble for you @hiker?

Apr 29 '24 15:04 arporter

Actually, my simple change means that every source file is read twice in two different locations - once for the regex to check what modules it contains (in ModuleManager) and a second time in ModuleInfo. This is clearly not good. I've a feeling you had a plan to address this @hiker?

Apr 29 '24 19:04 arporter

I've just refactored what I have in #2413 to remove this duplication. If you have a look at my latest changes you'll see I've altered ModuleInfo to take an optional src argument. What do you think?

Apr 29 '24 20:04 arporter

I've just refactored what I have in #2413 to remove this duplication. If you have a look at my latest changes you'll see I've altered ModuleInfo to take an optional src argument. What do you think?

I had a quick look. As far as I can see, you are basically reading in and storing all source files of all search directories (assuming that the file you are looking for is the last ;) ). That seems to be a huge overhead (e.g. in my driver creation I pass in the whole LFRic source tree, so potentially this would mean all of the LFRic sources would need to be read - over 6800 files). Wouldn't it be better to rely on the coding style to do a first filter, and only read them all if we can't find anything?

I just read the history of this ticket, and my assumption was always that there will be an option to use the filename to determine the module names (potentially verified by then doing a regex on that one file only) - with the idea that more rules to select filenames based on module names could be added (e.g. by a setting in the config file).

Also, the handling of src feels wrong. We read in the source code using a static method - why is that in ModuleInfo, and not the ModuleManager? That somehow doesn't feel right,

I am not sure if this would affect me. I would expect a performance (which could be an issue, though it needs to be measured) and memory usage impact (which probably doesn't matter much). But reading a significant(?) part of LFRic over and over for each file we process feels like a huge waste of resources.

Apr 30 '24 15:04 hiker