opal icon indicating copy to clipboard operation
opal copied to clipboard

Improve handling of `java/lang/System` entry points in CG generation

Open johannesduesing opened this issue 1 year ago • 11 comments

This PR resulted from issues discussed in #230, in particular the fact that calls to System.out.println are not found by OPAL with multiple CG algorithms when the RTJar is loaded (either interfaces-only or fully). The reasons for this are depending on the CG algorithm and configuration, but boil down to:

  • RTA cannot find an instantiation of PrintWriter (the type of System.out) when RTJar is loaded as interfaces only - it is simple never instantiated
  • Somewhere between JDK9 and JDK11, they switched from using System.initializeSystemClass (which we model as an initial entrypoint in the config) to using System.initPhase1, System.initPhase2 and System.initPhase3 - all of which are called from native code and currently not configered to be initial entry points. This is another reason why RTA - even when the RTJar is loaded fully - does not find calls to println.
  • Propagation-Based algorithms never found a write to System.out that initializes the field (even when RTJar is loaded fully) - the reason for this is that the fields are set in native code

Side-Note: This was never a problem when the RTJar was not loaded, because CallGraphAnalysis.unknownLibraryCall caught any calls to virtual methods (like println in this scenario).

Changes in this PR This PR introduces some changes to tackle the problems mentioned above. This includes:

  • Declare initPhase1, initPhase2 and initPhase3 of type java/lang/System to be initial entry points in the config - see the OpenJDK implementation for details on this.
  • Always consider java/lang/Thread, java/lang/ClassLoader, java/io/PrintStream and java/io/InputStream to be instantiated via config
  • Improve propagation-based cg algorithms - they now always assume that System.out, System.in and System.err are instantiated with their declared types by adding them to the respective typeSetEntity - furthermore, we now always assume that the configured instantiated types are available for the ExternalWorld type set entity.

With these changes CHA, RTA and all Propagation-Based CG algorithms will now find calls to System.out.println, regardless of whether the RTJar is loaded at all, as interfaces only, or fully.

What's still to do

  • [ ] Find more types that should always be considered instantiated - if possible define a method for identifying those types and add them to the config
  • [x] Analyze why the CFA-Algorithms both do not find calls to println and fix the issue
  • [ ] There is still an issue with library fields if the library is loaded as interfaces only. We will likely have no writes to those fields, as the library method bodys are not loaded, therefore propagation-based algorithms will never consider the fields instantiated. They are also not considered ExternalWorld because their definition is available, therefore they cannot be treated differently compared to "normal" project fields. Decide on whether those cases deserve special handling or not - we might just have to live with that fact. The only way of dealing with the would be something like if(isLibraryField(field) && librariesAreInterfacesOnly) considerInstatiated(typeSetEntity(field), field.fieldType)

johannesduesing avatar Nov 29 '24 15:11 johannesduesing

Thank you for the extensive investigation and the PR. I fully agree with the additions to the config and with adding instantiated types to ExternalWorld.

I'm a little wary however with System.{in,out,err} special handling for the propagation-based CGs. Could this not lead to scenarios where RTA is less sound than the propagation-based CGs because RTA doesn't have this modeling? Also, different to changes to the config, this can never be overridden for an analysis of a Java System that might behave differently (Android maybe?). Is there another way this could be achieved, e.g., modeling the native methods that set the fields? If we decide to keep your approach, I'd suggest to make it more general by not hardcoding the System class into it. It could also be made configurable via the config files which fields should be initialized.

For the last future work item: That is a fair point. I think we could deal with it like you suggested, though maybe rather than setting them to the field type, they could access the ExternalWorld like they would without loading the interfaces? That would also solve the issue above with System.{in,out,err}, right?

errt avatar Nov 29 '24 16:11 errt

I think you make a good point. I restructured parts of this PR so that now:

  • There is an InitialInstantiatedFieldsKey and corresponding InstantiatedFieldsFinder to detect fields that shall always be considered instantiated
  • Very similar to the InstantiatedTypesFinder it has a configurable aspect so that users can specify fields to be considered instantiated. By default the field type will be considered only, but specific types and subtype-trees can be specified in the config.
  • The SoundLibraryInstantiatedFieldsFinder (implemented but not used by default) would consider all library fields to be instantiated with their declared type if libraries are interfaces only - this could address the last future work item

I think this might be a good solution to the problem - analyses for android can just remove the three lines from the configuration, and they also don't introduce any overhead of libraries are not loaded at all.

Regarding your suggestion for associating library fields with ExternalWorld if libraries are interfaces only: I had that exact solution implemented earlier, but i does not really scale to any library besides the RTJar - users would have to manually specify the types of their library fields every time they use a new / different library.

What do you think of this idea, does that work for you?

johannesduesing avatar Dec 02 '24 16:12 johannesduesing

Oh i forgot to mention: The rta.InstantiatedTypesAnalysis and xta.InstantiatedTypesAnalysis now both use the InitialInstantiatedFieldsKey for their respective initialization, thus (hopefully) fixing the issue that RTA could sometimes be less sound than XTA.

johannesduesing avatar Dec 02 '24 16:12 johannesduesing

This solution sounds reasonable. Thank you for your work!

About the ExternalWorld: true for manual specification of types. A suitable InitialInstantiatedTypesFinder would do this automatically however.

errt avatar Dec 03 '24 09:12 errt

My latest commit adds support for the initial intantiated fields to CFA-1-0 and CFA-1-1. For the latter i had to create dummy allocation sites as the fields are defined in native code - i don't know if that is the way to go, but it seems to work as expected.

So regarding the library fields and the ExternalWorld we now have two options:

  • Leave it as it currently is. We have the SoundLibraryInstantiatedFieldsFinder that can detect those fields and mark them as instantiated. All CG algorithms respect the initial instantiated fields in their respective initialization phases.
  • Explicitly associate library fields (if libraries are interfaces only) with the ExternalWorld set entity. This will really only work with Propagation-Based algorithms, since RTA and CFA-1-* do not work on set entities (at least i did not see them using ExternalWorld anywhere). It would require implementing a suitable InitialInstantiatedTypesFinder that would maybe rely on the InitialInstatiatedFieldsFinder. The downside would be that we would have to adapt the CG algorithm(s) when it comes to finding instantiated fields in order to encode this specific case.

Is there any specific reason you think we should use the second option? I'm currently thinking the first one should be less overhead, but maybe i'm missing something .. 🤓

johannesduesing avatar Dec 03 '24 13:12 johannesduesing

You can have a look at ConfiguredMethodsPointsToAnalysis, it needs to do the same for allocation sites in native or otherwise unavailable methods.

The main reason for choosing the second option would be to have fewer necessary modifications. For RTA, it would work immediately, as RTA just uses instantiated types and thus the proper InitialInstantiatedTypesFinder will automatically make it work. For the propagation-based ones it would be just the single line that adds the InitialInstantiatedTypes to the ExternalWorld that we probably want to have anyway. The InstantiatedFieldsFinders would not be necessary. For the points-to-based algorithms, it should be similar: one handling for instantiated types (that we probably need anyway if we don't yet have it) instead of that and the additional one for the fields. So my question is, where do you see the benefit of the first option? The one that I see is that it could be more specific, in adding types only to individual fields instead of a (potentially large) set of initially instantiated types being available through all (matching - but that could still be many, e.g., Object fields) fields.

errt avatar Dec 03 '24 15:12 errt

There are two points why i think the first option may be the better one:

  • Like you said, it's more specific and should keep loss of precision to a minimum for Propagation-Based algorithms and CFA - if we load large libraries as interfaces, the ExternalWorld could become very large.
  • The second option does not work for fields being set in native code if the library contents are actually loaded. If we go with the second option, at least for System.{out, err, in} the callgraphs will not capture println if the RTJar is actually loaded - it works if libraries are interfaces only.

So in my mind, if we want to capture fields being set in native code - and i'm open to discussing whether or not we want that - then we need an InitialInstatiatedFieldsFinder, otherwise it won't work in the case that libraries are actually fully loaded.

That being said, if you want me to implement the second option i'll do that - justed wanted to point that out 😄

johannesduesing avatar Dec 03 '24 16:12 johannesduesing

Shouldn't the second point already be addressed by the configured native methods analyses? I thought it was.

errt avatar Dec 03 '24 16:12 errt

After discussion, we will proceed as follows:

  • Remove InitialInstantiatedFieldsFinder
  • Use configured native methods to model effects of setOut0 & others
  • Add more instantiated types, see here
  • Also model effects of initPhase1 and others so calls can be detected if RTJar is interface only

johannesduesing avatar Dec 05 '24 14:12 johannesduesing

After some further inspection i think this is not going to be as easy as we thought it would. I found that:

  • setOut0 and others are already modeled with their field-setting effects in the config - with the new entrypoints in the config as well, RTA out of the Box finds calls to System.out.println if the RTJar is loaded
  • I was confused that Propagation-Based Algorithms did not find those calls with the RTJar loaded. I found that the PropagationBasedCallGraphKeys use the rta.ConfiguredNativeMethodsInstantiatedTypesAnalysisScheduler (see here). This does not make sense to me, this analysis was specifically written for RTA and attaches all configured instantiated types to the Project entity - as far as my understanding of any of the propagation-based CG algorithms goes, this is conceptually not compatible with XTA & co, which use typeSetEntities to attach types to.
  • When loading the RTJar as interfaces only, i am facing another issue: The EntryPointsFinder does not consider methods without bodies. This means that even though i carefully configured the effects of System.initPhase1 with its instantiations and invocations, it is simply never considered by the CG framework, which expects entrypoints to have a body.

If my assessment is correct, i think the following steps would be sensible to take next:

  • Write a new ConfiguredNativeMethodsInstantiatedTypesAnalysis for XTA & co. I started doing this, but found that because the configuration can state that "This fields is assigned the value of the first parameter" (which is what we have for setOut0), it needs to be a lot more complex than the one for RTA, as it would need continuations and dependencies on its caller property - maybe it would be sensible to integrate it into the existing TypePropagationAnalysis for XTA? (It would be structurally very similar and use a lot of the same constructs)
  • Consider entrypoints even if they are empty, provided that native method configurations exist for them - unfortunately the correpsonding configuration (plus deserialization utilities) are in the TAC project while entrypoints are in BR, meaning i can't access them directly. I removed the emptiness check for debugging purposes and it did not work as expected out of the box, so this would likely require more debugging

What do you think of this? Is this the right way to go?

johannesduesing avatar Dec 10 '24 10:12 johannesduesing

You are right, they are configured to use the RTA class there, which doesn't seem to make much sense. Not sure why that is, I thought there already was a variant for XTA, but apparently, there isn't. So yes, it looks like we need to add that.

For the other issue, it looks like entry points should always be considered, even without a body. I don't think it is necessary that they have configuration attached, they could still just exist without any callees. They should probably exist (so you don't get spurious methods in the CG that are configured as entry point for a different setup than you analyze), but bodies should not be necessary. I wonder why removing the emptiness check didn't just work, so yes, that should be looked into.

Thank you for your investigation!

errt avatar Dec 10 '24 10:12 errt

/it:test

johannesduesing avatar Sep 12 '25 18:09 johannesduesing

In my opinion, this PR is done now. My local experiments showed that OPAL can now correctly detect calls to System.out.println for CHA, RTA, XTA, 1-0-CFA and 1-1-CFA when the JRE is not loaded, loaded as interfaces only or fully loaded. Many things changed throughout the development, so here's the most important points:

  • The EntryPointsFinder now works on DeclaredMethods instead of Methods. Consequently, it now propagates configured entry point methods even if the do not exist in the project - they are then represented as VirtualDeclaredMethods.
  • All call graph related analyses- especially those related to reading configured behavior - can now deal with the fact that entry points may be virtual
  • DefinitionSites now use DeclaredMethods instead of Methods. This enables us to propagate virtual definition sites for configured points-to behavior in methods that are not actually part of the project. The points-to framework has been adapted to deal with this change.
  • The XTA framework got its own ConfiguredNativeMethodsInstantiatedTypesAnalysis - previously it used the same as RTA, which did work in the XTA context.

Let me know what you think of those changes - i'll wait for the IT tests to complete and see if the changes broke any test cases.

johannesduesing avatar Sep 12 '25 18:09 johannesduesing

As the failed it:test shows, this PR needs re-generation of the FPCFAnalysesIntegrationTest reference files. This seems to be because of changes to the entities for the escape analysis? Given the scope of this PR, a deeper look into the files might be sensible to ensure that nothing else was broken.

errt avatar Sep 15 '25 09:09 errt

/it:test

johannesduesing avatar Sep 17 '25 09:09 johannesduesing

@errt I updated most expected values for the IT test, i unfortunately do not have the resources at hand to generate the ExtensiveAnalysesTest-OPAL-MultiJar-SNAPSHOT-01-04-2018-dependencies.txt.gz file. I looked through the changes of two archives in depth and found that they really only differed in the toString representation of methods. If you want me to i can do some further investigation, but i am confident that there are no unintended behavioral changes.

Can you regenerate the missing expected value befor i rerun the IT tests?

johannesduesing avatar Sep 17 '25 12:09 johannesduesing

/it:test

johannesduesing avatar Sep 17 '25 14:09 johannesduesing