fasten
fasten copied to clipboard
[FASTEN server] Fix paths in input records on-the-fly
Description
This PR makes the following changes to the FASTEN server:
- Adds an optional argument
--pi
, which is a base directory for reading input records. E.g., the MetadataDBExtension requires this argument as it reads CG files from the disk. - Adds the method
fixPathInRecord
, which adds the base directory to the relative paths so that the plug-ins can read input records. - Stores relative path in the
dir
field ofpayload
in the output records.
This is a backward-compatible change. Therefore, output records with absolute paths can still be processed.
Motivation and context
As discussed in the previous dev. call, in output records, we would like to store relative paths to the FASTEN data on the disk such as CGs and GID graphs.
Testing
Tested with the DC locally.
Task list
- [x] Fix paths in input records on-the-fly
- [ ] Add the
--pi
argument to the Docker compose setup. - [ ] Add the
--pi
argument to the production deployments
@mir-am I am a bit confused, didn't the FASTEN server have the --po
option already?
Should it say that an optional --pi
argument is added? What does it mean though optional? What happens if you don't add it, will it still work with absolute paths as before?
I am not 100% sure yet, whether the proposed fix addresses the problem, as the broken path might be in different locations in the JsonObject
and it is not necessarily in the root... I tend to say that an easy (yet more expensive) fix would be to do a replacement directly on the Json string, before the JsonObject is even created. For example, replace("/mnt/fasten/pypi-test/", "/mnt/fasten/pypi/")
or replace("/mnt/fasten/pypi/pypi/", "/mnt/fasten/pypi/")
or maybe we should simply remove this base path altogether (replace with ""
) in favor of using an environment variable.
@gdrosos or @michelescarlato: Could you chime in here and help to clarify a bit, in which cases messages currently contain broken paths and how these paths should be fixed?
One fundamental problem with our current file handling is that we have multiple base paths and it is impossible to distinguish them... we have the FASTEN base folder that is mounted (/mnt/fasten/
), plus we have base folders for every plugin (e.g., the folder into which repos are checked out or call graphs are placed). If a message now contains a full path to a repository checkout, it is not clear where the FASTEN file structure ends and the repo content starts.
What I would therefore advocate for is that we should either 1) have multiple of these base path variables (e.g., baseDir
, baseRepoCloner
, baseCallGraphs
), so plugins can use these config variables instead of constructing paths themselves or 2) what I would find even better would be to introduce an abstraction that will take care of finding the right folder... for example, we could have a RepoClonerIO
that could be asked for the right path to the checked out folder of a particular Revision
.
The challenge that we are facing is that replacing file paths is a fundamental change which might break multiple components in the system, so we need to be extra careful before we put this into production.
@mir-am As a minor request, please stop introducing these ultra-short names, as it becomes really hard to decipher what they mean. There is no harm in using a full name to describe the option, the only moment in which they are used is in a config file, so "readability" > "saved keystrokes".
@mir-am I am a bit confused, didn't the FASTEN server have the
--po
option already?
Sorry, I've corrected the PR text. The introduced CLI arg is --pi
. The --po
option was already there.
Should it say that an optional
--pi
argument is added? What does it mean though optional? What happens if you don't add it, will it still work with absolute paths as before?
Only plug-ins that have dir
in the payload of their input records like metadataDBExtension
and CallableIndexerPlugins
must add --pi
to specify the base path where their input records are stored.
So, if a plug-in has dir
in the payload of its input record and the --pi
is not given, then the FASTEN server throws RuntimeException
and reminds the user to specify --pi
.
It also works with absolute paths for backward compatibility.
I agree with Sebastian that this is quite a fundamental change, which can impact many components. What path-containing-fields are we actually changing here? There is dir
for plugins that output but there is also eg. sourcePath
and perhaps others.
We should keep in mind that other plugins, also non-Java ones, should be able to be configured easily to account for paths becoming relative.
Since we are trying to synchronise these paths across plugins, I always thought have a sort-of "shared root" was not a bad idea, and it seems /mnt/fasten
currently has this role. I would like to understand what issue it actually causes, and whether there are not other solutions.
I agree with Sebastian that this is quite a fundamental change, which can impact many components. What path-containing-fields are we actually changing here? There is
dir
for plugins that output but there is also eg.sourcePath
and perhaps others.We should keep in mind that other plugins, also non-Java ones, should be able to be configured easily to account for paths becoming relative.
Since we are trying to synchronise these paths across plugins, I always thought have a sort-of "shared root" was not a bad idea, and it seems
/mnt/fasten
currently has this role. I would like to understand what issue it actually causes, and whether there are not other solutions.
At the moment, there are no issues, but the issue arises, for example, when we want to move fasten data (/mnt/fasten/data
) to another place, then the paths in the input records are no longer valid.
That said, I agree that this is a fundamental change and we should consider all the possible consequences and involve partners.
@gdrosos or @michelescarlato: Could you chime in here and help to clarify a bit, in which cases messages currently contain broken paths and how these paths should be fixed?
Some messages of the output topic of the cscout which is employed to produce C Call Graphs contained broken paths due to a minor bug described here. I have fixed the issue and I am redeploying the plugin on the mnt/fasten/data/debian
folder.
@gdrosos or @michelescarlato: Could you chime in here and help to clarify a bit, in which cases messages currently contain broken paths and how these paths should be fixed?
Some messages of the output topic of the cscout which is employed to produce C Call Graphs contained broken paths due to a minor bug described here. I have fixed the issue and I am redeploying the plugin on the
mnt/fasten/data/debian
folder.
@gdrosos, is the path to C call graphs absolute or relative?
@gdrosos or @michelescarlato: Could you chime in here and help to clarify a bit, in which cases messages currently contain broken paths and how these paths should be fixed?
Some messages of the output topic of the cscout which is employed to produce C Call Graphs contained broken paths due to a minor bug described here. I have fixed the issue and I am redeploying the plugin on the
mnt/fasten/data/debian
folder.@gdrosos, is the path to C call graphs absolute or relative?
The path specifying the sourcePath is absolute in both CScout and PyCG. Also I noticed that the messages of the MetadataBBCExtension also contain aboslute paths in the payload field