jpype icon indicating copy to clipboard operation
jpype copied to clipboard

Discussion: Minimising code execution at import time / the downsides of the JPype import system

Open pelson opened this issue 3 years ago • 27 comments

The new(ish) JPype import hook (added in #224) is a neat trick to allow us to declare Java imports in a Pythonic way rather than using string based package access (e.g. from java.utils import Object vs Object = jpype.JClass('java.utils.Object')). There are some major downsides to using it though, so I wanted to open this ticket to discuss (and hopefully mitigate) them.


Pros / Cons

Downsides

Python imports should be side-effect free & import order shouldn't matter

In Python it is frowned upon to have a major side-effect at import-time. I'm no Java expert, but I've seen a few Java packages in which side-effects are common (starting a background thread, initialising static members, etc.), so perhaps this is a significant cultural difference between the Python & Java languages?

Unfortunately the JPype import mechanism forces us to either start the JVM in our code before importing Java packages, or to import things in a specific order such that another module has a JVM starting side-effect. For example:

import jpype
import jpype.imports
jpype.startJVM()

from java.lang import String

Or

import the_previous_example

# Works because the previous example started the JVM and enabled the ``jpype.imports`` hook.
from java.lang import String

Both examples demonstrate import side-effects (the JVM gets started!) and the import order is critical to the successful operation of the code. This is more than just a theoretical "you shouldn't do that" - it puts our code against essential development tools such linters and our helpful IDE/isort (to fix import order automatically) will actively break both examples by "fixing" our import order for us.

The import side-effects / order requirements alone mean that I've been unable to recommend the use of the JPype import system for anything other than end-user applications (and categorically not libraries). In all honesty, this is the single reason why I'm writing this issue - I'd love to find a single solution that is workable for all types of code which use JPype.

JPype takes ownership of Java TLD namespaces

The JPype import hook involves commandeering the top-level names used in Java. This is done by opt-in, except for a few pre-registered names. If a registered name collides with a Python package, the Java package wins if jpype.imports is enabled (and not otherwise). I believe it is possible to create an alias/special-prefix if there is a collision you particularly want to avoid, but this mechanism seems to be undocumented (the whole of jpype.imports is undocumented currently, including registerDomain).

Lots of imports

This is a fairly minor point, but in order to use something from Java using the new import system, you have to import it. This is a fair reflection of what you have to do in both Python and in Java, but it does indeed represent more LOC than the old mechanism. You could argue this is an advantage, as it is much more explict, and allows things like import aliases etc. (see advantages below).

Only works on py3.6+

Honestly, this doesn't bother me in the slightest. Python 3.5 was end-of-life in September 2020.


Upsides

No more programming by string :tada:

The old jpype.JClass('java.lang.String') approach of accessing a class by string rather than using the import mechanism feels a little hacky, and is certainly not something that is easy to validate statically (I had an idea on that, but hit a brick wall in https://github.com/python/mypy/issues/10004, more to follow). Tab completion in an IDE is also not possible.

I don't fully understand the comment in the motivating PR though:

It prevents accidental use of non-imported classes which improves type safety and makes python code more robust to code refactors of the java code

Could you provide more detail on this please @Thrameos? I'm a little confused as JPype could/does manage the imports for us as we access names on a JPackage instance.

Fail fast

If we have the JVM running when we define classes/interfaces we can validate the implementation there and then, rather than deferring any exception to later on. This follows the fail-fast philosophy (definitely Pythonic style!), and means that exceptions are raised at the point of issue, rather than some other place such as at the constructor.

Take the following example:

import jpype
import jpype.imports
jpype.startJVM()

from org.foo import JavaInterface

@jp.JImplements(JavaInterface)
class MyImplementation:
    pass

If MyImplementation doesn't implement the full set of features defined in org.foo.JavaInterface then we get an exception at import time. This is a good thing - we haven't defined the interface correctly, and the behaviour is just like Python's own ABC which validates the implementation at import-time.

There are means to avoid doing this at import-time in JPype. I added the deferred flag for JImplements in #659, and the previous example can be written as the following (example from the docs):

import jpype as jp

@jp.JImplements("org.foo.JavaInterface", deferred=True)
class MyImplementation:
    pass

This time, since JPype doesn't yet have a running JVM, it cannot find out more about the org.foo.JavaInterface and therefore cannot know if the implementation is good or not. The result is that we end up getting the exception at instantiation:

MyImplementation()  # Can raise if the MyImplementation doesn't implement the interface correctly

This can be highly non-obvious to a user. Ideally a user would never see such an exception (even in the non-deferred case) - in Java for example this error would be seen by the MyImplementation developer, as their code would fail to compile (thought: perhaps we are missing a stage that would help us validate our implementation at a pseudo-compile-time, somehow).

Type annotations possible

Having the JVM running at import time means we can use things like Java classes as return type annotations:

import jpype
import jpype.imports
jpype.startJVM()

from org.foo import MyClass

def factory_of_my_class() -> MyClass:
    return MyClass(...)

In #714 there is a prototype which generates stubs for Java packages/classes which would give us the ability to run static analysis (using tools like mypy) on our Java interactions, and this will also enable IDEs such as PyCharm to give us very convenient tab-completion, for example. In the example above this would work by generating a stub-package for the org (pseudo-)package exposed by JPype's import hook.


Summary

For full transparency: The reason I'm opening this discussion is because I find it hard to get fully behind the JPype imports mechanism - for me the requirement to have the JVM running at import time is a deal-breaker, despite the numerous very attractive benefits that the mechanism brings.

As a result, I'm curious if these is another approach that might be able to give us the benefits, without the drawbacks. I have one or two prototypes (mostly extending the ideas in #714 (stubgenj)) and believe there might be mileage in extending JPype's use of typing / annotations as a form of pseduo-compile step. It currently is based on the idea of accessing jp.JPackage(top_level_only).<thing_on_top_level> which can be fully type-checked by providing an exhaustive set of typing.Literal[top_level_name] overload annotations for jpype.JPackage. I don't really want this discussion to go too far down this route - I'm more than happy to open up another issue for that. Instead...

My objective for this issue is to have an exhaustive set of pros/cons such that any alternative proposal to the JPype import mechanism could ensure that it is hitting the various requirements. Please comment if there is something I've missed, or if there is detail that can be added to any of the points. This issue doesn't need to remain open indefinitely, so please feel free to close as soon as the discussion has taken its course.

pelson avatar Feb 03 '21 09:02 pelson

@pelson Happy to discuss this.

First there have been many changes since the original PR most of which have been documented in https://jpype.readthedocs.io/en/latest/imports.html

Most notably we now support importing of a tld (ie "import org") without issues. In the old version we have to import all the way down the one item to have proper error checking. In addition JPackage and JImports are merged to a common object so they have the same caching process. The only difference between them is that JImport has an import handler that produces ImportError rather than AttributeError. Last at some point we did add support for Python 2.6 and 2.7 which likely included 3.5. However this may have been removed when 2.7 was dropped.

About the comment

The comment about failing on imported simply that using the style

from org.pkg import MyObject
from org.pkg import MyAttr

is better in that it will fast fail if the Java classes can not be located immediately. This is a huge advantage when the Java code gets refactored and you don't need to wait for a specific branch to be hit.

Also at the time JPackage did not check if the class actually existed. This was a huge defect. If you did "JPackage('org').pkg.MyObject and MyObject did not exist it gave a meaningless error on JPackage can not be called. This is because JPackage was basically an implementation of Mock which just added attributes whenever it was requested.

The new version of JPackage does not have any of these problems. It now will correctly error when an object does not exist and actually has a dir function so you can see what is available. I still strongly encourage the fast fail method of importing though as it is much less error prone.

Order and Side effects

As far as the side effect at import time, this is unfortunately a product of how the import system hooks work. You can't use an import hook until it is installed and you can't install a hook without importing a module with the hook. So for example if you wanted to use maven to install import hooks for python so that it automatically pulls an import then the order of the imports matters. This is separate from the requirement to start the JVM, so lets deal with startup process in general as another subject.

Startup

The jpype startup pattern predates me by a lot. I have studied other packages, but unfortunately they all have similar problems. The main issue is that Python does not allow you to pass parameters into an import so there is no way to configure the JVM.

Lets consider two different styles. JPype current style... ( for simplicity lets assume that tld and import registration was automatic)

import jpype
jpype.startJVM(options_to_jvm)  # explicit starting point
import java
from org.pkg import MyObject

versus the pyjnius method

import jpype.config
jpype.config.options = options_to_jvm
import jpype  # This does not actually start the JVM, but does prime the system to start on the first call.
import java   # JVM starts here when the first call to use a Java, but may start earlier 
from org.pkg import MyObject

Same: In both the order of the calls are important. jpype must be imported before other calls because the import hooks.

Worse: the pyjnius formulation we have an implicit start of the JVM which violates "explicit is better than implicit".

Better: The implicit form is much better when you have two jpype using modules. But there is still the problem of who gets to set the options and the class path. The situation is slightly improved with the recent ability to add to the class path after the JVM is started, but there are still a number of issues on that concept that need to be resolved, because Java requires certain things like database drivers to be loaded by the base class loader.

I do think that people have a preference towards the passive loading, and we could add a jpype.config module which would enable automatic starting of the JVM on the first call to JClass or JPackage. But that really doesn't solve order issue.

If Python had a well defined method of how to configure a package or pass options to it, I think we should conform. But the convention is that we should have as little side effects as possible so import jpype does not start the JVM.

Programming by String

The other key advantage on using the imports is that it is actually much faster. Calling JClass with a string is going to call Class.forName which is a very slow call and requires the string to get passed. I cringe every time I see code that looks like

   ls = [ JClass('org.pkg.MyFunc').call(i) for i in some_list ]

and I see this formulation or similar frequently when I review JPype using code.

TLDs

I debated a lot on the tlds. I believe it acceptable for these reasons.

  1. Jython already reserves the tlds for use by Python. In fact this was a early source of errors we had to clean up. Many packages that supported Jython would test if it was Jython simply by attempting to import a tld and if it worked assumed they could access Jython internals.

  2. Java packages are always last to be checked. The order of Python objects loads local directories, then system, then checks for extensions. So as my coworker discovered if you add a directory called "java" under a project you will suddenly get that python module loading and not the Java one. This means that no matter what you have under the classpath the conflict is always resolved in Pythons favor. If you must deal with a conflict there is an alias command or use JPackage.

Unfortunately a lot of Java packages don't use tlds. They are unfortunately on their own. Language conventions were defined for a reason and package names are not some stylistic thing.

Conclusion

The JImports system was meant to be a bolt on because I was unwilling to force the tld registration on the user. It has improved a lot since it was introduced. I often debate if it is sugar or a necessity. With the merge of JPackage it is more of a nicety as now JPackage does everything that JImport does in terms of safety. But it does mirror the same system in Jython and JEP so it is unlikely to get removed. Prior to the merge my feeling is it should be the default so the "jpype.imports" is not necessary. But to do so the JVM startup issue needs to be resolved.

The order issue is more of a start up method which is another topic.

Thrameos avatar Feb 03 '21 17:02 Thrameos

The jpype startup pattern predates me by a lot. I have studied other packages, but unfortunately they all have similar problems. The main issue is that Python does not allow you to pass parameters into an import so there is no way to configure the JVM.

Agreed entirely. I doubt that it will ever be possible to pass parameters at import - so we are left having to figure out a workable solution with the tools in our box... the trick that we can play is that we get a "startup" execution when a module is imported - so long as we don't blow that trick on code execution / major side-effects.

In a non-interactive context it is also entirely reasonable to assume that all imports should happen at the start (that is recommended in PEP8: "Imports are always put at the top of the file, just after any module comments and docstrings, and before module globals and constants"). Given this, we could gather together all of our JVM requirements before actually starting the JVM:

import jpype as jp
from jpype.doesnt_exist_yet import jvm_config, requires_jvm

@jvm_config.add_classpath(Path(__file__).parent / 'my_package.jar')
@requires_jvm
def my_factory() -> jp.JPackage('my_package').some_subpackage.SomeClass:
    return jp.JPackage('my_package').some_subpackage.SomeClass(...)

Many modules can do this kind of declaration, and the jvm_config thing can be the place that the requirements are gathered centrally. Once in place, there is no need to manually start the JVM - it is implicit by the requires_jvm, so if you call the my_factory function, it will start the JVM if it hasn't already been started.

If the JVM has already started before the module is imported (because it happens, especially in an interactive environment), then we have the context to know that the JVM config needed to be in a certain way. We can raise if we can't do anything about it, or if we have clever tricks up our sleeve for classpath etc., we can apply them automagically.

One thing I don't know: How the type annotation would work-out. You need the JVM running to be able to resolve the annotation :thinking:. For Python before 3.10 that annotation has an import-time side-effect, so definitely wouldn't work without the JVM running. This is addressed in PEP563. With that in place there is a little trick we might be able to play such that inside JPype itself:

import typing

if typing.TYPE_CHECKING:
    # Replace the JVM requiring JPackage with one which doesn't actually do anything
    # other than mocking stuff
    JPackage = MockedJPackage

As discussed in #900, I would go one step further for packages - I would expose hooks to allow Java dependency declaration as part of the package metadata, rather than as runtime-metadata. This would mean that JPype can know about installed packages which need a running JVM to function, without having to import them before the JVM is started. This would dramatically help in the REPL environment, because in that context users often don't import things in a certain order.

In interactive mode the above looks like:

>>> import jpype as jp
>>> jp.startJVM(options_to_jvm)

>>> my_obj = jp.JPackage('org').pkg.Object(...)

>>> from jvm_using_pkg1 import my_factory
>>> my_factory()  # Works. The JVM was already started.

>>> from jvm_using_pkg2_with_runtime_config_defn import my_other_factory
ImportError: The JVM was already started, but my_other_factory needs to set some startup flags.
             Please import before starting the JVM

>>> from jvm_using_pkg2_with_pkg_metadata_config_defn import my_other_factory
# Works, because JPype knew before starting the JVM that it needed some extra config 
# for this package, even though the package itself wasn't imported until now.

The new version of JPackage does not have any of these problems. It now will correctly error when an object does not exist and actually has a dir function so you can see what is available. I still strongly encourage the fast fail method of importing though as it is much less error prone.

Great to hear about the significant improvements to jpype.JPackage! As I've previously said, I think you've done an awesome job with the improvements to JPype so far. In my view JPype is becoming a tool which could become seriously mainstream. For that to happen it is essential that JVM-using libraries can be written in a seamless way such that the user doesn't need to much care that the JVM is running (other than the fact that they need the JVM installed perhaps). Hence this discussion and an eagerness to try to dog-food this a little bit :stuck_out_tongue_winking_eye:.

For what it is worth, and just in case this is also one of the hurdles for convincing you, I don't love the spelling of jpype.JPackage('my_tld').my_subpackage. I'd happily move to a less explicit spelling such as jpype.jvm.my_tld.my_subpackage (or similar). By the nature of avoiding a dependence on the JVM running at import time though I would explicitly not allow import jpype.jvm.my_tld (or even import jpype.jvm) - I basically see this jvm thing as a package-level instance (possibly with other behaviour, such as a start() method). I don't want to muddy the conversation too much though - I don't have strong feelings about this change in API, and am much more interested in the discussion about the separation of import-time and runtime behaviour.

pelson avatar Feb 04 '21 05:02 pelson

Unfortunately we still have too many issues on the table for me to resolve in this thread.

My initial thoughts are there were two abandoned PR that I worked on long ago that may be relevant.

The first was jpype.classes. In order to give a home to all active classes for pickle, every JClass instance was automatically homed here dynamically and it was the reported module would be a subpackage of this dynamic module. This was abandoned for two reasons. First it is not possible to get a dir of this module because there is not dir of root in Java. Second it wasnt necessary for pickle.

The second was @jpype.onload. This was a decorator placed on a class (iirc) that had the effect of executing the contents when the jvm is loaded and adding the contents to the module it was defined from. The result being a portion of the module (imports, classes, and functions) would not appear into a module until later. This was abandoned because it interacted with the all import mechanism. The forward stubs stayed in their original form while the qualified symbols properly resolved. It is very unclear how this would interact with the typing system.

Perhaps there are solutions to these problems as both seem to be in the direction you are requesting.

Thrameos avatar Feb 04 '21 13:02 Thrameos

Here is an example of the onload that I had been working on. (recreated from memory)

import jpype
import jpype.imports

def export(mod, ldic):
    for p,v in ldic.items():
        print(p,v)
        setattr(mod, p, v)
    pass


def onload(func):
    import inspect
    mod = inspect.getmodule(func)
    # print(func.__code__.co_varnames) # We can use the names of locals to predefine stubs                                                                                      def call():
        ldic = func()
        if ldic is not None:
            export(mod, ldic)
    jpype.onJVMStart(call)
    return call


######################################################

@onload
def deferred():
    import jpype
    import java
    import java.lang.Object as Object

    def bar(o:Object) -> Object:
        pass

    @jpype.JImplements(java.io.Serializable)
    class Test(object):
        pass

    return locals()  # This is required export the locals

jpype.startJVM()
print(Object)
print(bar)
print(Test)

Here the symbols in deferred don't actually get executed until after the JVM starts. The symbols appear in the correct module. But they can't be accessed ahead of time nor appear in __all__ nor can they be fast forward checked.

The use of this pattern would be in __main__:

   import jpype
   import module_using_jpype
   import module2_using_jpype

   jpype.startJVM()

The modules would defined their classpaths and jvm parameters using jpype.addClassPath() and jpype.addOption(). And the define a deferred section within each module for any jvm dependent code. The onload may be able to modify the getattr for the declaring module so that any deferred symbols that are accessed throw an exception. Later when startJVM is called, all of the deferred sections are executed in order which replaces any stubs with the actual symbols.

As no Java code gets executed on the initial import this satisfies the requirements that imports and import order do not product side effects. The actual evaluation of the deferred code happens at a defined place and explicit location.

Things I don't like here, we have an explicit call to return locals() which is easy to miss. Of course we can catch this error as we can check the locals expected versus the dict we got back and fail so this may be acceptable. (I tried byte code patching but it is way too fragile). Pushing a bunch of code into a function looks really funky and is likely to break code analysis tools especially with the symbol relocation routine.

Thrameos avatar Feb 04 '21 18:02 Thrameos

Perhaps Python needs a when statement that embodies the concept of deferred actions whose results will appear in the scope they were called just like locals used prior to being defined. So then we can have clean code like

import jpype

when jpype.jvm_loaded:
     import java
     
     @jpype.JImplements(java.io.Serializable)
     class MyImpl(object):
         pass

MyImpl() # Error, deferred element accessed before condition met
jpype.startJVM()
MyImpl() # Success

Thrameos avatar Feb 04 '21 19:02 Thrameos

Any thoughts on the deferred module loading as a solution?

Thrameos avatar Feb 07 '21 19:02 Thrameos

Any thoughts on the deferred module loading as a solution?

The when statement is a neat idea. It has some curious side-effects though, and I can imagine it being hell to debug or run a static analysis on. Might be worth discussing a python-ideas to see if there is any interest in moving it forward - I suspect though it would take a huge amount of discussion to wrinkle out the details (& implement). In practice we are going to need to do something else until such a construct is available...

Your onload implementation is consistent with the when concept, but is truly magical - I doubt an IDE will be able to look at that and not cry about undefined symbols everywhere :smiley:.

The first was jpype.classes. In order to give a home to all active classes for pickle

Your jpype.classes concept is the closest to what I was proposing I think. For a usecase such as pickle to be addressed though you'd want to be able to refer to "non-active" classes too (you shouldn't have to have fetched/imported a class in order to be able to unpickle it). It sounds like this usecase may have already been addressed, though I note that you can't currently use the built-in pickle module (I think this is doable in the future, fwiw. I have some history with the pickle module :wink:). I opened #935 to further the discussion on pickling using standard library.

The only difference between my straw-man above and the jpype.classes concept is that I was working at the package level, rather than the class level.

import jpype as jp

@jp.JImplements(jp.jvm.java.io.Serializable)  # Deferred validation
class MyImpl:
     pass

if __name__ == '__main__':
    MyImpl()  # Potential to start the JVM implicitly, or just raise as we do today.

In this case, MyImpl has a few behaviours:

  • It can start the JVM automatically, since we know we need it (I'm 50/50 on this, explicit vs implicit, cost of checking if JVM running each time, etc.) OR It can raise if the JVM isn't running (simpler, but involves explicit JVM startup, which can only be done once, so is the task of an scipt/application and never a library)
  • It can validate, at JVM startup time, not at instantiation time and either
    • raise if the implementation isn't complete
    • raise if the implements interface doesn't exist
  • It can be pickled

To see how it might look, I took a shot at implementing something which can catch the issue in the following code:

import jpype as jp
from jp_jvm_prototype import JVM

String = JVM.java.lang.WoopsThisDoesntExist


def my_jvm_using_func() -> String:
    return String('foo')


if __name__ == '__main__':
    jp.startJVM()

    my_jvm_using_func()

My prototype doesn't actually work in that:

  • String is never a thing that can be used in JPype (even if it were a valid class). This is basically just a job of proxying everything (I think this is pretty much what JPype does, so we can perhaps hook into the low-level stuff to have similar performance)
  • The type of JVM.java.lang.String should isinstance-able to jp.JClass('java.lang.String') once the JVM is running
  • I didn't look at implementing the JImplements construct

What is nice though is that the exception is rather good considering we don't have immediate (fail-fast) exceptions as a result of our "no import-time side-effects" rule:

Traceback (most recent call last):
  File "/home/pelson/github/jpype/jpype/support/deferred_import_error.py", line 4, in <module>
    String = JVM.java.lang.WoopsThisDoesntExist
ImportError: Unable to access java.lang.WoopsThisDoesntExist from the JVM. Have you set your class path correctly?

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/pelson/github/jpype/jpype/support/deferred_import_error.py", line 12, in <module>
    jp.startJVM()
  File "/home/pelson/github/jpype/jpype/jpype/_core.py", line 223, in startJVM
    initializeResources()
  File "/home/pelson/github/jpype/jpype/jpype/_core.py", line 310, in initializeResources
    _jinit.runJVMInitializers()
  File "/home/pelson/github/jpype/jpype/jpype/_jinit.py", line 53, in runJVMInitializers
    func()
  File "/media/important/github/jpype/jpype/support/jp_jvm_prototype.py", line 76, in _realise_deferreds
    ref._resolve_jreference()
  File "/media/important/github/jpype/jpype/support/jp_jvm_prototype.py", line 59, in _resolve_jreference
    raise ImportError(
ImportError: Delayed unresolvable reference to java.lang.WoopsThisDoesntExist found
The implementation of this prototype looks like:
import inspect
import types
from weakref import WeakSet

import jpype as jp


class JReference:
    """Represents either a JPackage or a JClass available on the not yet running JVM.

    """
    #: All instances of JReference.
    _instances: 'WeakSet[JReference]' = WeakSet()

    def __init__(self, item, parent=None):
        self._item = item
        self.parent = parent

        # Track where this thing is coming from in the form of an exception.
        # This will allow us to have a better context of a failed access to the JVM.
        try:
            raise ImportError(f"Unable to access {self} from the JVM. Have you set your class path correctly?")
        except ImportError as err:
            # Two frames up is where the exception should come from.
            frame_info = inspect.stack()[2]
            frame = frame_info.frame
            tb = types.TracebackType(
                err.__traceback__.tb_next, frame,
                tb_lasti=0, tb_lineno=frame_info.lineno,
            )
            self._exception = err
            err.__traceback__ = tb

        # Track all instances of JReference for future resolving.
        self._instances.add(self)

    def __getattr__(self, item):
        return self.__class__(item, parent=self)

    def __str__(self):
        if self.parent is None:
            return self._item
        else:
            return f'{self.parent}.{self._item}'

    def _resolve_jreference(self):
        # Change this thing into a real JPackage or JClass from the JVM.
        # TODO: Should we somehow be distinguishing packages from classes to
        #  avoid this trial and error?
        try:
            r = jp.JPackage(str(self), strict=True)
            dir(r)  # Raises AttributeError if the thing doesn't exist. (strict doesn't work?)
        except (ImportError, AttributeError):
            try:
                r = jp.JClass(str(self))  # Raises TypeError if the thing doesn't exist.
            except (ImportError, TypeError):
                # Make sure we include the source exception so that we get good
                # context of what the problem is (and where to fix it).
                raise ImportError(
                    f"Delayed unresolvable reference to {self} found"
                ) from self._exception
        return r


class _JpypeJVM:
    def __getattr__(self, root_item):
        return JReference(root_item)


# Construct a JVM instance.
JVM = _JpypeJVM()


def _realise_deferreds():
    for ref in JReference._instances:
        ref._resolve_jreference()


# Register the realisation function to be run when the JVM starts.
jp.onJVMStart(_realise_deferreds)

pelson avatar Feb 09 '21 14:02 pelson

Just for clarification jpype.classes contains packages so it is just like what you are describing.

So lets see if i can boil this down a bit. You cant check if something is a package or a class until the JVM is started nor if it exists. The auto JPackage simple creates JPackage instances mock style until it hits a valid class. As once the object is created we can't unmake it, this is a problem.

Radical solution... make JClass and JPackage have the same memory layout. Then we can make them do mock behavior until the jvm is started. Then we check which are classes then we have to some how polymorph an existing class object. I have tried this sort of magic in the past but it is really meta magic as you dangerously write over the pointers of an existing type object. I was planning this trick to make JString into java.lang.String when the jvm goes live. But as it was the only one, I decided to shelved it.

The downside is we only get fast fail after the jvm is started.

Thrameos avatar Feb 09 '21 15:02 Thrameos

I wanted to round off the ideas that are flying around with a couple of simple examples. In each case I don't explicitly state where the JVM starts because it can be in a number of places (before import, during import or after import) and it is also plausible that we could move to an auto-start (non explicit) model, should we so wish. (note that if we don't have auto-start then it would be probably sensible to have a decorator to allow us to declare that a function/class needs the JVM to be started in order to work)

Examples (for googlers, these don't work, they are pseudo-code):

import typing

import jpype as jp
from jpype import jvm


@jp.JImplements(jvm.java.io.Serializable)
class MyImpl:
     pass


def offset_date(
        offset: int,
        date: typing.Optional[jvm.java.time.temporal.Temporal] = None,
        unit: jvm.java.time.temporal.TemporalUnit = jvm.java.time.temporal.ChronoUnit.DAYS,
):
    if date is None:
        date = java.time.LocalDate.now()
    return date.plus(offset, unit)

It is entirely conceivable that we can support both the attribute form and the import form. note:

  • the jvm is prefixed to avoid the namespace concern raised at the outset
  • this would basically use the same mechanism as is used currently in JPype
  • the imports would work without the JVM running, and would only be validated after the JVM has started (just like the attribute form)

Example:

import typing

import jpype as jp
from jpype import jvm

from jpype.jvm.java.io import Serializable
from jpype.jvm.java.time import LocalDate
from jpype.jvm.java.time.temporal import Temporal, CronoUnit
from jpype.jvm.java.time.temporal import TemporalUnit


@jp.JImplements(Serializable)
class MyImpl:
     pass


def offset_date(
        offset: int,
        date: typing.Optional[Temporal] = None,
        unit: TemporalUnit = ChronoUnit.DAYS,
):
    if date is None:
        date = LocalDate.now()
    return date.plus(offset, unit)

The above examples are both entirely type-annotate-able I believe (using the same stubs), giving us static analysis that will warn if we try to access things that don't exist, and offering auto-completion (and in the latter case, auto-import and package identification for a class name). (note: I probably need to look in more detail at exactly how the CronoUnit.DAYS thing can work as a default argument - I think it basically means that the JReference thing in my prototype above needs to proxy everything in the same way that JObject would do, I don't know JPype well enough to say how hard/practical that is)

Reminder: both of the above examples can co-exist. The preferred style is entirely down to the developer. There are pros and cons to each, and I can see an advantage in being able to mix-and-match. The key thing though is that neither approach requires the JVM to be running at import time and both styles can be used by scripts, applications, and libraries without the concern for having to have the substantial side-effect.

pelson avatar Feb 09 '21 15:02 pelson

Radical solution... make JClass and JPackage have the same memory layout. Then we can make them do mock behavior until the jvm is started. Then we check which are classes then we have to some how polymorph an existing class object.

I looked at morphing the thing too, but in truth, isn't JPackage and JClass just a proxy in JPype anyway (i.e. it has a list of methods & data and routes them through to Java at runtime)? It sounds like we just need to be able to modify the "proxy" such that once the JVM starts it no longer allows arbitrary method access, and gets a list of things that can be accessed. I agree that if we bake deferred as a concept in JPype then the logical conclusion is that you can't tell the difference between a JPackage and a JClass until the JVM is running.

The downside is we only get fast fail after the jvm is started.

Agreed. This is the logical conclusion of avoiding import-time behaviour whilst still allowing "Pythonic" code (e.g. import order isn't important, imports at top of code, type-annotate-able, ). Either way you have to have the JVM running to get the exception, the major difference is that the exception won't happen at the moment the erroneous code is executed, so we'd have to be careful to provide a helpful traceback (hence my prototype).

pelson avatar Feb 09 '21 15:02 pelson

Proxing has some big speed implications over a morph. There are a lot of implications with referencing as the typing model for isinstance would have major implications. This is hit very hard during method resolution.

Thrameos avatar Feb 09 '21 16:02 Thrameos

Also my resistence to the name jpype.jvm is I consider that to be where the jvm controls belong. I couldn't move them due to compatiblity. But I consider anything that doesnt have a capital J as currently being out of place with the exception of subpackages in the jpype module.

Jpype itself already contains java and javax and should have had all tld as well as a package called notld for non conforming packages.

Thrameos avatar Feb 09 '21 16:02 Thrameos

Proxing has some big speed implications over a morph. There are a lot of implications with referencing as the typing model for isinstance would have major implications. This is hit very hard during method resolution.

I'd like to explore that a little bit. You definitely wouldn't want to be doing a lookup on each attribute access, but if you save a successful lookup then the next lookup can be at the same speed as normal Python method resolution.

I've confirmed this with the following code:
class ObjectProxy:
    """A proxy to an object.

    We don't want to touch this class, so we make new classes as needed using

    NameOfThing = type("NameOfThing", (ObjectProxy, ), {'a_class_attr': 123, ...})

    """
    _java_canonical_name: str

    def __getattr__(self, name):
        # Some expensive operation to find/build an appropriate method.
        # We want to do this only once for a given name.

        if name == 'some_method':
            def meth(self, an_arg):
                return f'Calling _jpype.some_method_dispatcher_fn({self.__class__.__name__}, {name}, {an_arg})'
            setattr(self.__class__, name, meth)
            result = getattr(self, name)  # Get back the bound method.
        elif name == 'some_property':
            result = lambda self: f'Calling _jpype.some_property_dispatcher_fn({self.__class__.__name__}, {name})'
            result = property(result)
            setattr(self.__class__, name, result)
            result = getattr(self, name)
        else:
            # TODO: Cache this, so that we don't do expensive lookups for things that don't exist.
            raise AttributeError(f'Unknown attribute {name}')
        return result


String = type("String", (JType, ), {
    '_java_canonical_name': 'java.lang.String',
})



# Now we build something which is pure python and has no expensive attribute lookup:

class JTypeSimpleBase:
    # This subclass exists exclusively to provide comparable classes for timing.

    _java_canonical_name: str

    def __getattr__(self, name):
        raise AttributeError(name)


class DirectString(JTypeSimpleBase):
    def some_method(self, an_arg):
        name = 'some_method'
        return f'Calling _jpype.some_method_dispatcher_fn({self.__class__.__name__}, {name}, {an_arg})'

    @property
    def some_property(self):
        return f'Calling _jpype.some_property_dispatcher_fn({self.__class__.__name__}, {name})'


if __name__ == '__main__':
    import timeit
    print('Direct string:', timeit.timeit(
        "DirectString().some_method('foo')",
        setup="from __main__ import String, DirectString",
        number=1000000,
    ))
    print('Proxy string:', timeit.timeit(
        "String().some_method('foo')",
        setup="from __main__ import String, DirectString",
        number=1000000,
    ))

The results are pretty clear that there is no measurable cost to the proxy approach when done in this way:

Direct string: 0.27402336400700733
attaching some method
Proxy string: 0.27132017398253083

There are a lot of implications with referencing as the typing model for isinstance would have major implications

Agreed. I think the conclusion that we'd have to draw is that you can't have an instance until the JVM is running, so a isinstance('foo', jp.jvm.lang.String) would have to raise if the JVM isn't running. Once the thing is running though, we can morph type inheritance and/or implement __isinstance__ appropriately.

What is slightly uncomfortable is the distinction between packages, classes and instances (or anything else you can reference in Java). If a class has a static variable, then it is impossible for us to know that we are accessing an instance from a class vs accessing a class from a package. As a result we need to be able to handle any reference-able thing from Java in our proxy. However there is a clear difference in Python between a type and an instance, and it isn't possible to morph from one to the other.

pelson avatar Feb 09 '21 21:02 pelson

I should point out the lookups for method resolution are in C++ currently and use much faster methods than you can do when testing from within Python. It really isn't possible to mock up tests of the cost of proxies for method resolution to test does is this a java object or a java proxy. We were 4 to 20 times slower when we used pure Python lookups that were using attribute based lookups. When I switched to a dedicated slot, it was much faster. Unfortunately if we have more than one path to lookup the type during method resolution. I did a lot of testing at the time and determined that supporting a secondary path was as bad as if all objects were pure python. The problem is when doing a method resolution (not simply finding the name of the method but choosing the overload by matching each JPype/Python type to each Java type ) often has to fall through unless the overload was the first found. Thus doing two paths would hit the full cost of pure python 90% of the time or more. Thus I had to reject the old proxy method on matching as a backup path.

Not saying that we can't do a proxy of some kind but it would have to use the same slot mechanism to have reasonable efficiencies. We would simply have to copy the slots over to proxy so that it can be a direct lookup.

Important speedups that we are running under the hood.

  1. Every JClass, JObject, and JArray has a dedicated slot which is O(1) lookup to get the Java type and jvalue from a Python wrapper. It uses the second trick using the memory allocator slot (which is hard to reset when deriving a class).
  2. Classes which lack java slots encode using special values in certain slots. Thus if I see a slot that points to a JPype method then I can double check using the slow method, but this trick works only if the type cannot be derived as the derived classes replace the slots.
  3. Classes that must be derived use a fast type lookup by assuming the mro order. That is if the mro is stable and controlled we can place a certain type in a specific place (the second and third element to the last in most cases). Again this is O(1) lookup.

I know this are black magic. Python internally does most of this using a 1 hot encoded bit field for the types that need acceleration. I also found one case of using the slot comparison trick. The Java slot trick uses the slot trick on the allocator followed by "extra" memory on the end of the object (invisible to Python). It is basically a replication of the dict and weakref system but hidden as there is no type slots to support it.

I can do these sort of tricks on certain items but it would require the Proxy object to be in C but at that point morphing objects is just as easy. It would take me a week to study the current details (memory footprint) to be sure. I know that I can make JClass and JPackage work, but it would get much harder if I have to deal with enum values or static fields as well. Many objects have different layouts as they have to derive from exceptions, methods, int, long, object, and str. They simply can't morph. They are very polymorphic in memory layout.

The best I can do there would be a C reference that has a Javaslot reserve and then a proxy to real class. When they get referenced they would polymorph by setting the Java slot to the deferred item (which makes Java think they are the proper type) and then proxy their methods to a second copy which is built late (which makes Python think they are the proper type). But there are a large number of edge cases (such as isinstance and issubclass) that have to be considered. Another problem, the memory footprint of the object would be huge if they are used as type they must be type objects. JImplements may be able to handle operating without actual type object, but once you can extend objects that won't be true. There may be ways to be work around this (intercepting the base classes when the JClass meta is building). Though lets not let perfect be the enemy of the good and I will see what I can cook up first.

Ultimately I will be limited by the Python object model. The more deeply I patch it the more likely it will break in some future version of Python (unless they formalize some of my tricks).

Thrameos avatar Feb 09 '21 23:02 Thrameos

It isn't before the JVM is running that I am worried about. Most of the behaviors that happen when you mix objects before or after the JVM is created. For example....

A = jpype.java.lang.String  # gets a JFuture
jpype.startJVM()  # JFuture is resolved 
print( type(A) == type(jpype.java.lang.String))
print( A is jpype.java.lang.String) # false because the proxy was replaced in the type tree.
print( isinstance("a", A) )
pritn( issubclass(A, java.lang.Object)) 

Thrameos avatar Feb 09 '21 23:02 Thrameos

Some great detail here, thank you!

When I switched to a dedicated slot, it was much faster.

Indeed, I could have updated the prototype to use __slots__ (and morph the class as we go) - that would definitely be faster at lookup time, but you then end up defining the whole interface upfront (once JVM started). Given it is highly unlikely that somebody wants to use all of the methods on a class, does current JPype have to define everything on a class upfront (i.e. before we know what is used) or is there some just-in-time lookup of methods which haven't been accessed previously?

My guess is that overall, JIT method resolution (slow-ish) + caching would be faster than class slots (super fast C++) for every single method of every single imported class in most cases (unless you are building a stubgenerator or something and need to access all methods that are available).

... I am worried about... Most of the behaviors that happen when you mix objects before or after the JVM is created

Agreed. I think inevitably the conclusion is that the "pre-JVM proxy type" has to be the same thing as the "post-JVM proxy type", i.e. jpype.java.lang.String always gives a JThing subclass (which can still be specialised at JVM run-time to include the correct bases etc.), just with different lookup resolution semantics pre-and-post JVM startup (i.e. we error if you try to access something that doesn't exist once the JVM is running). We have hooks to allow us to override isinstance and issubclass, and these operations can only be done once the JVM is running.

I think this sounds like your "C reference that has a Javaslot reserve and then a proxy to real class" description, but I'm not sure.

it would get much harder if I have to deal with enum values or static fields as well

Yep. Perhaps this should raise when the JVM starts (we can know then if the thing was a class or an instance), and we have some explicit syntax to access a morphable reference (e.g. for the default value of my unit kwarg of the function defined in an earlier post, you'd have to always declare it as a proxy since it can't be morphed automatically. You'd end up typing: my_func(..., unit=jp.JProxied(jvm.java.time.temporal.ChronoUnit.DAYS))). The contract would be that:

  • this has a performance penalty for method resolution etc., so only use it when you can't guarantee that the JVM is running (i.e. globals, argument defaults, class attributes, etc.)
  • if you choose to use it, everything can continue to work and be passed around to both the JVM and the Python interpreter, except that:
  • it is not possible to do object id checking: (jp.JProxied(jvm.java.time.temporal.ChronoUnit.DAYS) is jvm.java.time.temporal.ChronoUnit.DAYS) is False always
  • but you can check types e.g. isinstance(jp.JProxied(jvm.java.time.temporal.ChronoUnit.DAYS), jvm.java.time.temporal.ChronoUnit) is True
  • and check that the thing is proxied: isinstance(thing, jp.JProxied) is True
  • and resolve the thing to be given the actual instance: (jp.JProxied.resolve(thing) is jvm.java.time.temporal.ChronoUnit.DAYS) is True

What we are talking about here is quite a bit of effort. I suspect a few things I've said aren't new, and indeed I have the feeling that some of the things already exist in JPype (apologies for my re-invention of the wheel in those cases).

I think we should decide whether this is this something you'd like to explore further? Are there any red-lines that are being crossed for you (aside from the jpype.jvm name, and the fact that we have to maintain backwards compatibility)?

If you wanted to proceed, then I suggest we could thrash out a pure-Python prototype in fairly short order to fully understand the implications of the decisions, and then we can look at optimising the hell out of it later on. I don't mind leading on the prototype if that is helpful, but truth is that you'd probably have to lead on the "make this thing go like the clappers" given you JNI/JPype experience.

pelson avatar Feb 10 '21 09:02 pelson

I worked a bit on the prototype last night to see if it is workable. I am going with JForward as proxy is a very different thing with a specific meaning in Java.

I still think that you are a bit unclear on the details of the method resolution process. (Or maybe I misunderstand) It seems like you are viewing it as

   A= JForward("java.lang.String")
   Jpype.startJVM()
   A.__resolve__()
   a=A()
   a.substring(5) <== finding substring in A

Only here a is a real object so nothing has changed. No speed penality. Unless we try to call for static fields or methods we dont even see the forward.

The issue is actually some where unexpected.

  C=JForward("java.time.temporal.ChronoUnit.DAYS")
jpype.startJVM()
C.__resolve__()
A=java.lang.StringBuilder()
A.append(1)

Seem innocent... the call to append is not involving the JForward. And we didnt even use a Java type. Except append is highly overloaded. Thus each and every type match for each potential overload must check if the argument is type JForward. So what is the deal here. Well isinstanceof for all but bit vector mapped types means getting checking if there is a meta class, look for instancecheck and calling it, else access the bases list and perform a linear search it for the base. Thus if we have overloads, were we need to add an else path and that else has a high cost then the effect is paid everywhere. Even if no one uses a single forward declaration. As method resolution is the largest bottleneck, any feature that touches it needs to be carefully weighted on a cost value analysis.

This is the exact same issue that the Python folks would have with when. Implementing a general concept of forward in Python, requires creating a proxy type which automatically resolves when accessed. Simply checking the forward bit whenever the byte code to look up in the global/local/builtin and the second path for local. If the bit is set and the obj is now set, then replace the dict entry with the real value. Once all dict copies go away as it is accessed from any path, it is resolved everywhere. (Even is contract is satisified.) The issue is we add burden to every call even if not a single forward is ever used. Of course if we only have to add it at the bytecode interpreter then it may be fine. Except C api may have a copy. So that means we need to check every C entry point. And unlike the dict copy we cant replace the original reference. The end result is burden everywhere. And modules that are not aware of the check may not work. Not impossible but a huge lift to convince developers that it is a feature worth the cost.

I can likely solve this. I cant make it a bit field type but i have other tricks to make type checking fast. And rather than placing the check on each type check, if instead the check is on a common point such as when the arguments get unpacked that reduces the burden. Now its cost does not scale with the overload count. But that leaves about 20 edge cases like when the forward is buried in a list. And the short cutting checks for some basic container types. So still doable but clearly would be a major feature requiring a large test suite. And as the JVM is a start once thing this requires a big subprocess type test bench. Perhaps there is another primary point where i can catch all points once without paying the cost on every use.

I think the requirements you listed are about right. And it doesnt look like morphing is going to work so it is likely going to have to be a proxy object instance as we dont want to pay the cost of a heap type. That means we have to proxy every potential slot used in every type in JPype. There is a certral place for that, the list of slots in pyjp_class constructor switch table.

Thrameos avatar Feb 10 '21 15:02 Thrameos

Example code:

import jpype

# This will be used for morphed types, everything must have a base tree
#  We need to make sure these are fast lookup so we don't change method resoluton cost
class _JForwardProto(object):
    pass


class _JResolved(_JForwardProto):
    def __getattribute__(self, name):
        instance = object.__getattribute__(self, '_instance')
        return type(instance).__getattribute__(instance, name)

    def __setattr__(self, name, value):
        instance = object.__getattribute__(self, '_instance')
        return type(instance).__setattr__(instance, name)

    def __str__(self):
        instance = object.__getattribute__(self, '_instance')
        return str(instance)

    def __repr__(self):
        instance = object.__getattribute__(self, '_instance')
        return repr(instance)

    def __eq__(self, v):
        instance = object.__getattribute__(self, '_instance')
        return instance == v

    def __ne__(self, v):
        instance = object.__getattribute__(self, '_instance')
        return instance != v

# We need to specialize the forward based on the resolved object
class _JResolvedType(_JResolved):
    def __call__(self, *args):
        instance = object.__getattribute__(self, '_instance')
        return type(instance).__call__(instance, *args)

    def __matmul__(self, value):
        instance = object.__getattribute__(self, '_instance')
        return instance @ value

    def __getitem__(self, value):
        instance = object.__getattribute__(self, '_instance')
        return instance[value]

    def __instancecheck__(self, v):
        instance = object.__getattribute__(self, '_instance')
        return isinstance(v, instance)

    def __subclasscheck__(self, v):
        instance = object.__getattribute__(self, '_instance')
        return issubclass(v, instance)


class _JResolvedObject(_JResolved):
    pass

def _jvmrequired(*args):
    raise RuntimeError("JVM must be started")

#This is an unresolved symbol
#  Must have every possible slot type.
class JForward(_JForwardProto):
    def __init__(self, symbol):
        object.__setattr__(self, '_symbol', symbol)

    def __getattribute__(self, name):
        forward = object.__getattribute__(self, '_symbol')
        try:
            return object.__getattribute__(self, name)
        except AttributeError as ex:
            if name.startswith("_"):
                raise ex
            if name.startswith('['):
                value = JForward(forward+name)
            else:
                value = JForward(forward+"."+name)
            object.__setattr__(self, name, value)
            return value

    def __getitem__(self, value):
        if isinstance(value, slice):
            return self.__getattribute__("[]")
        if isinstance(value, tuple):
            depth = 0
            for item in value:
                if isinstance(item, slice):
                    depth += 1
                else:
                    raise RuntimeError("Cannot create array without a JVM")
            return self.__getattribute__("[]"*depth)
        raise RuntimeError("Cannot create array without a JVM")

    __setattr__ = _jvmrequired
    __call__ = _jvmrequired
    __matmul__ = _jvmrequired
    __instancecheck__ = _jvmrequired
    __subclasscheck__ = _jvmrequired


def _resolve(forward, value):
    if not isinstance(forward, _JForwardProto):
        raise TypeError("Must be a forward declaration")

    # Find the value using Java reflection
    #   value = _get(object.__getattr__(forward, _symbol))

    # replace the instance
    object.__setattr__(forward, "_instance", value)

    # morph the so that the slots act more like their correct behavior
    #   types we have here
    #     - java classes
    #     - java array classes
    #     - java arrays
    #     - java packages
    #     - java static fields
    #     - java throwables
    #     - java primitives
    #     - other?
    if isinstance(value, type):
        object.__setattr__(forward, '__class__', _JResolvedType)
    elif isinstance(value, object):
        object.__setattr__(forward, '__class__', _JResolvedObject)
    else:
        raise TypeError("Forward type not supported")

    # Depending on how the base class is implemented and the amount of reserve space
    # we give it we may be able to mutate it into the actual object in C, but
    # only if we can dispose of its accessing resources (dict) and copy
    # over the new resources without overrunning the space requirements.


java = JForward("java")
Object = java.lang.Object
String = java.lang.String
Array = java.lang.String[:]
Array2 = java.lang.String[:,:]

import jpype
jpype.startJVM()

# This should happen automatically
_resolve(java, jpype.java)
_resolve(Object, jpype.java.lang.Object)
_resolve(String, jpype.java.lang.String)
_resolve(Array, jpype.java.lang.String[:])
_resolve(Array2, jpype.java.lang.String[:,:])

# Create instance
print(type(String("foo")) == type(jpype.java.lang.String("foo")))
print(String == jpype.java.lang.String) # works
print(not (String != jpype.java.lang.String)) # works
print(isinstance(String("foo"), String))  # works
print(not isinstance(1, String))  # works
print(issubclass(String, Object)) # fail

# Late use of static fields
print(String.CASE_INSENSITIVE_ORDER == jpype.java.lang.String.CASE_INSENSITIVE_ORDER)

Thrameos avatar Feb 10 '21 19:02 Thrameos

Seems like a total of 8 pointers should cover the needs for most morphs. 2 for base object, 2 for weak and dict, 2 for payload, and 2 hidden for java slot. This just leaves class, jchar, and throwable as resolved proxies.

Thrameos avatar Feb 11 '21 11:02 Thrameos

See https://github.com/Thrameos/jpype/blob/forward/forward.py for the current state.

Thrameos avatar Feb 11 '21 19:02 Thrameos

I posted a tweak to the branch in https://github.com/Thrameos/jpype/pull/55.

In there @Thrameos said:

So assuming we can port this in a C implementation with morphs for most field types (object, array, primitive) and leave class and exception as resolved would this satisfy what you would like to achieve? Does this prototype look reasonable?

Dont worry about the speed of instance and symbol as those will be dedicated C slots. The same with the getattr as in C we will have direct access.

The size of the forward object has to be two pointers anyway for the slots so the only types that dont fit are class and exceptions. Exceptions as static fields are very rare. Classes are common but they aren't customizable like the others so less likely to be have issues. Arrays will take some work to morph as we will have to use copy constructors on the internal bits. We will also need to properly dispose of the old guts so we dont leak the slot values on the morph.

In terms of "does this look reasonable", the answer is a resounding yes! The devil is in the detail, but I've just added a tweak that hooks us into the import machinery. With this, it is entirely possible to write a module (library) which has Java elements (even type annotations) which don't require the JVM to be running at import time. For example, the following works as expected without needing the JVM to be running until we actually want to execute the function:

from jpype.jroot.java.lang import String


def concat_str(v: String, prefix: String = "pre-") -> String:
    prefix = java.lang.String(prefix)
    return prefix.concat(v)

In the prototype currently we can't do things like implementing interfaces:

from jpype.jroot import java
import jpype as jp


@jp.JImplements(java.lang.Runnable)
class Runner1:
    # Not a valid interface, should raise only when the JVM has started.
    pass

But I think this one is perfectly resolvable given we know how to do deferred interface validation.

Furthermore, the prototype doesn't yet:

  • report tracebacks for invalid accesses as well as the prototype I presented earlier. e.g. when accessing java.doesnt_exist.
  • hold non-weak references to JForwards in the cache, meaning that if somebody accidentally accesses something invalid, but then deletes it again, the prototype will still try to resolve it and cause an exception.

From your perspective how is it feeling? Do you think it is workable in a sustainable fashion? Do you see this as something that could become a primary path for JPype usage?

pelson avatar Feb 17 '21 21:02 pelson

It may be possible to just have the JImplements check it it is currently a forward and set the deferred flag accordingly.

Thrameos avatar Feb 17 '21 22:02 Thrameos

If it appears reasonable that I can start cutting code on the morphing which will change it from pure Python to actually mucking with the CPython internals. That will cut the speed penalty for all but class and exception to negligible. I have done reinterpret_cast type stuff in the past, it is just a matter of getting all the existing resources (dict, weakref list) rewrite the object and then move them to the correct locations. It is a bit of a pain if I mess up the reference counting and cause a leak.

@marscher with regard to the root directory for java packages, we already have jpype.java and jpype.javax. I can either add all tlds and jroot for nontld packages, or we can just add jroot for all and repoint existing java and javax. Python recommends that we have only one obvious way, but the old system leaves the difficulty that notld package have no meaningful home. Or perhaps I can deprecate jpype.java. Do you have a preference on the name of the root package?

Thrameos avatar Feb 17 '21 22:02 Thrameos

If it appears reasonable that I can start cutting code on the morphing which will change it from pure Python to actually mucking with the CPython internals. That will cut the speed penalty for all but class and exception to negligible.

Is the C implementation purely about performance, or are there other advantages? It seems to me that it will get a whole lot more complex (and therefore harder to maintain & less portable), so we should be sure that we can't hit the desired performance on the pure Python side.

pelson avatar Feb 18 '21 04:02 pelson

There is a critical difference between the Python and C versions.

In Python, the best we can achieve is to make an object that appears to proxy to the real thing. Depending on how many slots we implement and how those slots are accessed by the internals of Python it may or may not appear to be the same. To verify this we would have to repeat the entire test bench to verify every behavior on the resolved object.

In C, we get a very different result. For every object which is the same or small memory foot print we will be rewriting the memory of the object to be the real deal. Every behavior we have will be the same because the object will be the same object that exists after the JVM is started. There is little need for additional testing except for the few object types that we can't actually morph, but those are very limited. If there is a change in how Python visits the slots or some new behavior that we add to an existing object, then we get exactly the behavior with the resolved object. We would even satisfy the contracts like type or is which are impossible in Python.

The reason this is possible has to do with how Python objects exist in memory. They are an arbitrary blob of memory which come in two flavors (gc or no gc) in which the first pointer is to their identity (the class). The actual interpretation of that data is inherited from pointer in the first position. Unlike C++ there is no vtable of private pieces that are internal and immutable. All of our types are gc so we don't need to worry about changing the collection policy. So if you change the class pointer you change its identity entirely. The Python version does something similar by replacing the __class__ but that has limits based on the inheritance tree (though you can break it by creating a tree adding slots and then changing the class pointer which will cause the memory foot print to become misaligned). After all Python doesn't keep track of the organization of the memory of objects at all. It is just the function that use that data that are given that memory meaning.

They can of course add checks for this which makes morphing of objects more difficult in the future within Python. In the C version, there is no limitations. We just have to make sure the memory footprint of the new object matches. If the dict is the 4th pointer in the struct and we change it to an object which is the third pointer, then we just have to find the old and new location and relocate it. If the dict doesn't exist then we simply have to free it. So ultimately, the portability and the number of edge cases is actually a whole lot lower lot less with a C version than a Python version.

The only difficulty is the coding the first time. When we implement it in C we have to manually code each slot and its behavior. And C is much more verbose (and laborious) that Python. So we are trading a bunch of two line Python implementations for 20 line C implementations. But then we only need two resolved classes and those are both pretty small.

Performance is not really the main driver. Unless the user made a whole lot of forwards or used those forwards in critical sections of code, it would be difficult to get a significant performance hit.

Does that clear it up?

Thrameos avatar Feb 18 '21 07:02 Thrameos

Does that clear it up?

Definitely, thank you! Though I have one follow-up one (sorry!) since this isn't about performance and more about fiddling with the underlying object pointers: how plausible is it to expose the morphing part (written in C) as a function in Python, thereby giving us the ability to continue to write the class based logic in Python, but still benefit from the low-level morphing. I'm essentially saying: can we get away with writing those 2 line Python implementations and still benefit from the C-level pointer morphing, or is it not possible to have our cake and eat it? :cake:

pelson avatar Feb 18 '21 10:02 pelson

It is possible. We can define only the base class in C and then derive it in Python to define most of its behavior in the derived class. That is usually the starting point for any implementation. However, this usually envolves adding hooks for the C version to call such as those that you find for implementing docstrings. That is if something in C needs to access a portion of the derived behavior then there must be a defined path to do so. Depending on how many hooks are required sometimes it is just easier to push it all in C rather that leave it half way.

But this is really just implementation details. Most of the C slots are just like those in Python. There is some boiler plate code then a call to the real version. So if I need to convert just one or two slots to avoid a hook, then cutting and pasting many slots and just changing the real call is likely just as easy. I wont really know until I finish if I can get away with most in Python or most in C.

Thrameos avatar Feb 18 '21 14:02 Thrameos