PSyclone
PSyclone copied to clipboard
(Closes #2462) generalise ModuleManager
@hiker commented on this implementation (in #2462):
I had a quick look. As far as I can see, you are basically reading in and storing all source files of all search directories (assuming that the file you are looking for is the last ;) ). That seems to be a huge overhead (e.g. in my driver creation I pass in the whole LFRic source tree, so potentially this would mean all of the LFRic sources would need to be read - over 6800 files). Wouldn't it be better to rely on the coding style to do a first filter, and only read them all if we can't find anything?
I just read the history of this ticket, and my assumption was always that there will be an option to use the filename to determine the module names (potentially verified by then doing a regex on that one file only) - with the idea that more rules to select filenames based on module names could be added (e.g. by a setting in the config file).
Also, the handling of src feels wrong. We read in the source code using a static method - why is that in ModuleInfo, and not the ModuleManager? That somehow doesn't feel right,
I am not sure if this would affect me. I would expect a performance (which could be an issue, though it needs to be measured) and memory usage impact (which probably doesn't matter much). But reading a significant(?) part of LFRic over and over for each file we process feels like a huge waste of resources.
My concern is not LFRic but everything else. However, I take your point about potentially, repeatedly reading 6800 files! I've taken a look at the NEMO source and the rule there appears to be that a file <some_name>.f90 contains module some_name
. In WaveWatchIII that rule largely holds but there are exceptions. @LonelyCat124 and @sergisiso, could you comment on the other sources that you've been looking at? As @hiker said earlier in #2462, a good first test would be if module_name in file_name
.
The "where to put the file reading" problem is a bit of a chicken-and-egg. ModuleManager
has to read a file to confirm that it contains a given module (unless this is strictly enforced as it is in LFRic) and it is only then that it creates a ModuleInfo
object for that module. However, currently it is the ModuleInfo
that contains the functionality for reading (and parsing, processing) the source file.
Using:
from difflib import SequenceMatcher
score = SequenceMatcher(None, str1, str2).ratio()
gives a floating-point measure of how similar two strings are (1.0 for identical, 0.0 for nothing in common).
We could have several rules implemented, e.g. module 'xxx' is in file xxx.[fF]90, xxx_mod.[fF]90, and then only handle a greatly reduced number of files (to then either regex and/or parse them).
And/or additionally, we could provide a list of exceptions in the config file (since this is project specific)? When I tried the kernel extraction on the um physics, I had to rename a file or two, but being able to just add them to a config file would have been great.
I love ❤️ the similarity one! Can we try to test various module names with the basename of the files? We could define a threshold (again, in the config file?) for which files are similar enough.
I love ❤️ the similarity one! Can we try to test various module names with the basename of the files? We could define a threshold (again, in the config file?) for which files are similar enough.
Yes, I think that might be a good, general-purpose solution that doesn't require too much work. (I need this in a hurry really.)
Socrates is ~ 700 files - it seems to be either module_name_mod.f90 or module_name.f90. There are some non-module containing files but I guess this just doesn't deal with those.
The problem at the moment is that the ModuleManager expects to process all files it encounters as it searches for a particular module (in get_module_info
) whereas we really need it to search through all files and identify the most likely candidates before actually processing them. We may need to keep a record of all files we have seen but not processed in order to do this which is quite a big change to how it currently works?
Can it (or indeed PSyclone at all) deal with multiple modules inside a single file?
The problem at the moment is that the ModuleManager expects to process all files it encounters as it searches for a particular module (in
get_module_info
) whereas we really need it to search through all files and identify the most likely candidates before actually processing them. We may need to keep a record of all files we have seen but not processed in order to do this which is quite a big change to how it currently works?
Yes. ATM we immediately store a mapping from the (unverified/assumed) module names to ModuleInfo object, which (initially) only stores the path to the file. So, if we can't deduce the mod name (based on coding style), we should store this information to avoid having to back to the file system again.
Can it (or indeed PSyclone at all) deal with multiple modules inside a single file?
Somewhat. It can map several module information to the same file. But (from memory), if you then request the source code (or fparser or psyir info), each module info will independently read and parse the source code again. It shouldn't be hard to fix that though, each module info could keep a list of other modules (and then use the module manager to update the other objects. I feel there might be an even better solution ... maybe we can share some state info between these module info objects that come from the same file??
I feel there might be an even better solution ... maybe we can share some state info between these module info objects that come from the same file??
Should we have an intermediate e.g. FileInfo
object that then stores references to related ModuleInfo
instances if they've been created for it (or an empty list if the contents of the file are yet to be examined)? These FileInfo
objects would then keep track of the files we've found and their full path (+...?). I did wonder whether this should be handled in fparser
but decided that fparser should probably just stick to parsing.
Should we have an intermediate e.g.
FileInfo
object that then stores references to relatedModuleInfo
instances if they've been created for it (or an empty list if the contents of the file are yet to be examined)? TheseFileInfo
objects would then keep track of the files we've found and their full path (+...?). I did wonder whether this should be handled infparser
but decided that fparser should probably just stick to parsing.
That was actually my very first thought, but then I expected you to say that this should be done in the PSyIR - just make FileContaier to store that info (and be populated later on parsing time) :) That was pretty stupid reasoning, so yes, I think that would make sense, and the very clean solution to handle multiple modules in one file.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 99.86%. Comparing base (
b136831
) to head (fddba96
).
Additional details and impacted files
@@ Coverage Diff @@
## master #2564 +/- ##
=======================================
Coverage 99.86% 99.86%
=======================================
Files 354 356 +2
Lines 48366 48458 +92
=======================================
+ Hits 48301 48393 +92
Misses 65 65
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I'm on to final tidying now but have a design question for @hiker and @sergisiso: should I extend the Config class' setter for include_paths
so that it also passes these paths on to the ModuleManager
or should I leave that to be the responsibility of the code that wants to use the ModuleManager? I prefer the former because, just as with Config, ModuleManager is a singleton. Actually, now that I write that I realise that we will then have two separate places that store include/search paths and that's not good. Perhaps they should only live in ModuleManager? What do you think?
@arporter I am not sure if I understand why do you need to extend the setter? I would expect the Module manager to use the same include_dir (what is passed with the -I option). So it just needs to consult the property of this value. Or is this different? I do agree they should live just in a single place.
The issue is that currently, the ModuleManager and Config objects are both singletons and have their search paths set separately (albeit, the ones passed in to ModuleManager are looked up from Config). On reflection, probably ModuleManager should not store these paths at all and simply look them up from the Config object?
EDIT: actually it's not that simple. ModuleManager recursively searches directories and caches all those that it has seen. This is a key part of its optimised search strategy.
@arporter Does it cache if the regex has found the modules? I am surprised this is necessary, the regex should be very fast.
(Unrelated to this PR I am going to propose to move away from singletons as they cause many issues which are currently in the code inadvertently - but this is a separate discussion maybe for next psyclone meeting)
No, it caches which directories it has searched. (The concern I think is recursively exploring the directories.) It also caches the contents of a file once it has been read (which obviously it has to be to use the regex). I'm thinking we just set its search path at the same time as we store it in the Config object. Since it recursively searches, what it actually stores is not just the include paths that the Config object has.
No, it caches which directories it has searched. (The concern I think is recursively exploring the directories.)
Still, I can grep for a string in all LFRic trunk in 0.2s
I understand doing it once and getting a map of module_name -> file_where_it_lives. But why do we have to repeat this (since the include_path won't change in a psyclone execution), couldn't we do this once and the beginning and then just look up the map?
I have to admit that I am also unsure about the recursive functionality, because then it diverges from the -I flag classic of compilers.
No, it caches which directories it has searched. (The concern I think is recursively exploring the directories.)
Still, I can grep for a string in all LFRic trunk in 0.2s
Wow.
I understand doing it once and getting a map of module_name -> file_where_it_lives. But why do we have to repeat this (since the include_path won't change in a psyclone execution), couldn't we do this once and the beginning and then just look up the map?
It isn't done repeatedly. The search just expands until it finds the requested module. The files it finds on the way are cached so that the next time a module is requested they don't have to be found again.
I have to admit that I am also unsure about the recursive functionality, because then it diverges from the -I flag classic of compilers.
True, it's optional but the default is to do it.
All of this is really a question of how the ModuleManager was originally designed and isn't something I was proposing to change here. Perhaps @hiker can comment?
It isn't done repeatedly. The search just expands until it finds the requested module. The files it finds on the way are cached so that the next time a module is requested they don't have to be found again.
Would it impact the performance to remove the current stop and let all the -I path files be mapped at this stage?
Would it impact the performance to remove the current stop and let all the -I path files be mapped at this stage?
I suppose the issue is that if we do that then we pay the full cost of the mapping every time we launch PSyclone (that's probably only a small percentage of the total PSyclone run time though). Apart from a reduction in code complexity, I'm not sure what we gain by doing this? Would it be that the ModuleManager then stores a map of files rather than a list of directories (and thus has a clearer distinction from the info. held by Config)?
Would it impact the performance to remove the current stop and let all the -I path files be mapped at this stage?
I suppose the issue is that if we do that then we pay the full cost of the mapping every time we launch PSyclone (that's probably only a small percentage of the total PSyclone run time though). Apart from a reduction in code complexity, I'm not sure what we gain by doing this? Would it be that the ModuleManager then stores a map of files rather than a list of directories (and thus has a clearer distinction from the info. held by Config)?
I would very much prefer to stop asap instead of doing a full scan. At this stage with the Fab build system we have no way of communicating this mapping from the worker processes back (we might look at that some time in the future), and the same holds for the old makefile-based builds. Which means we potentially have dozens of PSyclone processes (48 on our machine) running at the same time, each one then doing a full scan of all source files (at least when driver creation is requested, we need to look at all directories, including jules, ukca, um, ...). I am not sure how happy lustre will be about this. I am uncertain if the information will be/can be cached, since potentially new files will be created in the tree (psy and alg files) - and this is mostly meta-data info, which can be expensive.
I have to admit that I am also unsure about the recursive functionality, because then it diverges from the -I flag classic of compilers.
True, it's optional but the default is to do it.
All of this is really a question of how the ModuleManager was originally designed and isn't something I was proposing to change here. Perhaps @hiker can comment?
First of all, the behaviour was implemented to mirror existing behaviour of PSyclone when specifying the kernel path - it searches recursively as well. Secondly, we don't really want to constantly ask users to update the build system when adding a new directory (alternatively, the build system would need to scan for directories and provide them on the command line).
No, it caches which directories it has searched. (The concern I think is recursively exploring the directories.)
Still, I can grep for a string in all LFRic trunk in 0.2s
Are you sure that the data is not cached by the OS? Here what I see on our research computer on a lustre filesystem:
[jxh903@gadi-login-07 lfric]$ time grep -ir "where is this" .
real 1m1.211s user 0m0.693s sys 0m3.839s [jxh903@gadi-login-07 lfric]$ time grep -ir "where is this" .
real 0m4.780s user 0m0.589s sys 0m2.192s
So >10 times faster the second time around. And I got similar timings on a different node (I don't think I can drop cache myself)
The issue is that currently, the ModuleManager and Config objects are both singletons and have their search paths set separately (albeit, the ones passed in to ModuleManager are looked up from Config). On reflection, probably ModuleManager should not store these paths at all and simply look them up from the Config object?
They should be only in one place, I agree. The reason why I didn't do more towards this was that I think the kernel path and module manager path should be combined, but since module manager will contain more files (kernel path only includes lfric.../kernels
, while the module manager will contain lfric.../*
) - and that could in theory change the behaviour, so instead I opened #2011 to discuss this :) So I guess, this PR will also close #2011.
I really don't want to do #2011 here :-) I want this one to go on as soon as possible because it's needed for the LFRic GPU work. I can add some TODOs to the relevant places for now.
Are you sure that the data is not cached by the OS?
Aha, yes it was (but also for potentially consecutive psyclone invocations). In my laptop it goes from 5s (first) to 0.2s (following). But I do see the problem with Luster now, which is a common target and agree with the need for caching/stop searching.
I still see some problems with the recursive behaviour for generic code, as I would expect to put the same INCLUDE_PATH as its provided to the compiler, but since it acts differently in may get a different result if it has another module with the same file in a subdirectory. I suppose we at least do a BFS search so we potentially always pick the one in the top directory?
I still see some problems with the recursive behaviour for generic code, as I would expect to put the same INCLUDE_PATH as its provided to the compiler, but since it acts differently in may get a different result if it has another module with the same file in a subdirectory. I suppose we at least do a BFS search so we potentially always pick the one in the top directory?
It uses os.walk to find the directories, which (by default) goes top to bottom. It then searches all files in the first directory found, then the next etc - so yes, BFS.
Same file name in the build tree is a recipe for disaster :) I don't think we need to be concerned about that. FWIW, Fab will immediately abort if it finds the same file name twice (so there's a lot of cherry-picking involved then importing um, jules, ukca, ... packages, since they often have the same utilities file ... in different versions).
I just realised that recursive is a flag when adding the search path (defaulting to True), so it would be fairly easy to switch this off (e.g. different command line options could add a path either recursive or not).
Ready for review now from either @hiker or @sergisiso . It generalises the ModuleManager so that it no longer assumes a strict mapping between filename and the module that it contains. More significantly, it also changes the interface-resolving mechanism to use the ModuleManager. Integration tests are running.