pljava icon indicating copy to clipboard operation
pljava copied to clipboard

Work-in-Progress on PL/Java refactoring, API modernization

Open jcflack opened this issue 3 years ago • 6 comments

As a work-in-progress pull request, this is not expected to be imminently merged, but is here to document the objectives and progress of the ongoing work.

Why needed

A great advantage promised by a PL based on the JVM is the large ecosystem of languages other than Java that can be supported on the same infrastructure, whether through the Java Scripting (JSR 223) API, or through the polyglot facilities of GraalVM, or simply via separate compilation to the class file format and loading as jars.

However, PL/Java, with its origins in 2004 predating most of those developments, has architectural limitations that stand in the way.

JDBC

One of the limitations is the centrality of the JDBC API. To be sure, it is a standard in the Java world for access to a database, and for PL/Java to conform to ISO SQL/JRT, the JDBC API must be available. But it is not necessarily a preferred or natural database API for other JVM or GraalVM languages, and its design goal is to abstract away from the specifics of an underlying database, which ends up complicating or even preventing access to advanced PostgreSQL capabilities that could be prime drivers for running server-side code in the first place.

The problem is not that JDBC is an available API in PL/Java, but that it is the fundamental API in PL/Java, with its tentacles reaching right into the native C language portion of PL/Java's implementation. That has made alternative interface options impractical, and multiplied the maintenance burden of even simple tasks like adding support for new datatype mappings or fixing simple bugs. There are significant portions of JDBC 4 that remain unimplemented in PL/Java.

Experience building an implementation of ISO SQL/XML XMLQUERY showed that certain requirements of the spec were simply unsatisfiable atop JDBC, either because of inherent JDBC limitations or limits in PL/Java's implementation of it. An example of each kind:

  • The INTERVAL data type cannot be mapped as SQL/XML requires, because the only ResultSetMetadata methods JDBC defines for access to a type modifier are precision and scale, which apply to numeric values; the API defines no standard way to learn what the modifier of an INTERVAL says about whether months or days are present.
  • The DECIMAL type cannot be mapped as SQL/XML requires; for that case, the fault is not with JDBC (which defines the precision and scale methods), but with their incomplete implementation in PL/Java.

Those cases also illustrate that mapping some PostgreSQL data types to those of another language can be complex. An arbitrary PostgreSQL INTERVAL is representable as neither a java.time.Period nor a java.time.Duration alone (though a pair of the two can be used, a type that PGJDBC-NG offers). One or the other can suffice if the type modifier is known and limits the fields present. A PostgreSQL NUMERIC value has not-a-number and signed infinity values that some candidate language-library type might not, and an internal precision that its text representation does not reveal, which might need to be preserved for a mathematically demanding task. The details of converting it to another language's similar type need to be knowable or controllable by an application.

It is a goal of this work to give PL/Java an API that does not obscure or abstract from PostgreSQL details, but makes them accessible in a natural Java idiom, and that such a "natural PostgreSQL" API should be adequate to allow building a JDBC layer in pure Java above it. (The work of building such a JDBC layer is not in the scope of this pull request.)

Parameter and return-value mapping

PL/Java uses a simple, Java-centric approach where a Java method is declared naturally, giving ordinary Java types for its parameters and return, and the mappings from these to the PostgreSQL parameter and return types are chosen by PL/Java and applied transparently (and much of that happens deep in PL/Java's C code).

While convenient, that approach isn't easily adapted to other JVM languages that may offer other selections of types. Even for Java, it stands in the way of doing certain things possible in PostgreSQL, like declaring VARIADIC "any" functions.

In a modernized API, it needs to be possible to declare a function whose parameter represents the PostgreSQL FunctionCallInfo, so that the parameters and their types can be examined and converted in Java. That will make it possible to write language handlers in Java, whether for other JVM languages or for the existing PL/Java calling conventions that at present are tangled in C.

Elements of new API

Identification of data types

A PostgreSQL-specific API must be able to refer unambiguously to any type known to the database, so it cannot rely on any fixed set of generic types such as JDBCType. To interoperate with a JDBC layer, though, the identifier for types should implement JDBC's SQLType interface.

The API should support retrieving enough metadata about the type for a JDBC layer implemented above it to be able to report complete ResultSetMetaData information.

The new class serving this purpose is RegType.

As RegType implements the java.sql.SQLType interface, an aliasing issue arises for a JDBC layer. Such a layer should accept JDBCType.VARCHAR as an alias for RegType.VARCHAR, for example. JDBC itself has no methods that return an SQLType instance, so the question of whether it should return the generic JDBC type or the true RegType does not arise. A PL/Java-specific API is needed for retrieving the type identifier in any case.

The details of which JDBC types are considered aliases of which RegTypes will naturally belong in a JDBC API layer. At the level of this underlying API, a RegType is what identifies a PostgreSQL type.

While RegType includes convenience final fields for a number of common types, those by no means limit the RegTypes available. There is a RegType that can be obtained for every type known to the database, whether built in, extension-supplied, or user-defined.

Other PostgreSQL catalog objects and key abstractions

RegType is one among the types of PostgreSQL catalog objects modeled in the org.postgresql.pljava.model package.

Along with a number of catalog object types, the package also contains:

  • TupleDescriptor and TupleTableSlot, the key abstractions for fetching and storing database values. TupleTableSlot in PostgreSQL is already a useful abstraction over a few different representations; in PL/Java it is further abstracted, and can present with the same API other collections of typed, possibly named, items, such as arrays, the arguments in a function call, etc.
  • MemoryContext and ResourceOwner, both subtypes of Lifespan, usable to guard Java objects that have native state whose validity is bounded in time
  • CharsetEncoding

Mapping PostgreSQL data types to what a PL supports

The Adapter class

A mapping between a PostgreSQL data type and a suitable PL data type is an instance of the Adapter class, and more specifically of the reference-returning Adapter.As<T,U> or one of the primitive-returning Adapter.AsInt<U>, Adapter.AsFloat<U>, and so on (one for each Java primitive type). The Java type produced is T for the As case, and implicit in the class name for the AsFoo cases.

The basic method for fetching a value from a TupleTableSlot is get(Attribute att, Adapter adp), and naturally is overloaded and generic so that get with an As<T,?> adapter returns a T, get with an AsInt<?> adapter returns an int, and so on. (But see this later comment below for a better API than this item-at-a-time stuff.) (The U type parameter of an adapter plays a role when adapters are combined by composition, as discussed below, and is otherwise usually uninteresting to client code, which may wildcard it, as seen above.)

A manager class for adapters

Natural use of this idiom presumes there will be some adapter-manager API that allows client code to request an adapter for some PostgreSQL type by specifying a Java witness class Class<T> or some form of super type token, and returns the adapter with the expected compile-time parameterized type.

That manager hasn't been built yet, but the requirements are straightforward and no thorny bits are foreseen. (Within the org.postgresql.pljava.internal module itself, things are simpler; no manager is needed, and code refers directly to static final INSTANCE fields of existing adapters.)

Extensibility

PL/Java has historically supported user-defined types implemented in Java, a special class of data types whose Java representations must implement a certain JDBC interface and import and export values through a matching JDBC API. In contrast, PL/Java's first-class PostgreSQL data type support—the mappings it supplies between PostgreSQL and ordinary Java types that don't involve the specialized JDBC user-defined type APIs—has been hardcoded in C using Java Native Interface (JNI) calls, and not straightforward to extend. That's a pain point for several situations:

  • A mapping for another PostgreSQL data type (either a type newly added to PostgreSQL, or simply one that PL/Java does not yet have a mapping for) is not easily added for an application that needs it, but generally must be added in PL/Java's C/JNI internals and made available in a new PL/Java build.
  • A mapping of an existing PostgreSQL data type to a new or different Java type—same story. When Java 8 introduced the java.time package, developers wishing to have PL/Java map PostgreSQL's date and time types to the improved Java types instead of the older java.sql ones had to open issues requesting that ability and wait for a PL/Java release to include it.
  • Not every PostgreSQL data type has a single best PL type to be mapped to. One application using the geometric types might want them mapped to the Java types in the PGJDBC library, while another might prefer the 2D classes supplied by some Java geometry library. One application might want a PostgreSQL array mapped to a flat Java List, another to a multi-dimensioned Java array, another to a matrix class from a scientific computation library. The choices multiply when considering the data types not only of Java but of other JVM languages. C coding and rebuilding of PL/Java should not be needed to tailor these mappings.

Adapters implementable in pure Java

With this PR, code external to PL/Java's implementation can supply adapters, built against the service-provider API exposed in org.postgresql.pljava.adt.spi.

Leaf adapters

A "leaf" adapter is one that directly knows the PostgreSQL datum format of its data type, and maps that to a suitable PL type. Only a leaf adapter gets access to PostgreSQL datums, which it should not leak to other code. Code that defines leaf adapters must be granted a permission in pljava.policy.

Composing adapters

A composing, or non-leaf, adapter is one meant to be composed over another adapter. An example would be an adapter that composes over an adapter returning type T (possibly null) to form an adapter returning Optional<T>. With a selection of common composing adapters (there aren't any in this pull request, yet), it isn't necessary to provide leaf adapters covering all the ways application code might want data to be presented. No special permission is needed to create a composing adapter.

Java's generic types are erased to raw types for runtime, but the Java compiler saves the parameter information for runtime access through Java reflection. As adapters are composed, the Adapter class tracks the type relationships so that, for example, an Adapter<Optional<T>,T> composed over an Adapter<String,Void> is known to produce Optional<String>.

It is that information that will allow an adapter manager to satisfy a request to map a given PostgreSQL type to some PL type, by finding and composing available adapters.

Contract-based adapters

For a PostgreSQL data type that doesn't have one obvious best mapping to a PL type (perhaps because there are multiple choices with different advantages, or because there is no suitable type in the PL's base library, and any application will want the type mapped to something in a chosen third-party library), a contract-based adapter may be best. An Adapter.Contract is a functional interface with parameters that define the semantically-important components of the PostgreSQL type, and a generic return type, so an implementation can return any desired representation for the type.

A contract-based adapter is a leaf adapter class with a constructor that accepts a Contract, producing an adapter between the PostgreSQL type and whatever PL type the contract maps it to. The adapter encapsulates the internal details of how a PostgreSQL datum encodes the value, and the contract exposes the semantic details needed to faithfully map the type. Contracts for many existing PostgreSQL types are provided in the org.postgresql.pljava.adt package.

ArrayAdapter

The one supplied ArrayAdapter is contract-based. While a Contract.Array has a single abstract method, and therefore could serve as a functional interface, in practice it is not directly implementable by a lambda; there must be a subclass or subinterface (possibly anonymous) whose type parameterization the Java compiler can record. (A lambda may then be used to instantiate that.) An instance of ArrayAdapter is constructed by supplying an adapter for the array's element type along with an array contract targeting some kind of collection of the mapped type. As with a composing adapter, the Adapter class substitutes the element adapter's target Java type through the type parameters of the array contract, to arrive at the actual parameterized type of the resulting array or collection.

PostgreSQL arrays can be multidimensional, and are regular (not "jagged"; all sub-arrays at a given dimension match in size). They can have null elements, which are tracked in a bitmap, offering a simple way to save some space for arrays that are sparse; there are no other, more specialized sparse-array provisions.

Array indices need not be 0- or 1-based; the base index as well as the index range can be given independently for each dimension. PostgreSQL creates 1-based arrays by default. This information is stored with the array value, not with the array type, so a column declared with an array type could conceivably have values of different cardinalities or even dimensionalities.

The adapter is contract-based because there are many ways application code could want a PostgreSQL array to be presented: as a List or single Java array (flattening multiple dimensions, if present, to one, and disregarding the base index), as a Java array-of-arrays, as a JDBC Array object (which does not officially contemplate more than one array dimension, but PostgreSQL's JDBC drivers have used it to represent multidimensioned arrays), as the matrix type offered by some scientific computation library, and so on.

For now, one predefined contract is supplied, AsFlatList, and a static method, nullsIncludedCopy, that can be used (via method reference) as one implementation of that contract.

Java array-of-arrays

While perhaps not an extremely efficient way to represent multidimensional arrays, the Java array-of-arrays approach is familiar, and benefits from a bit of dedicated support for it in Adapter. Therefore, if you have an Adapter a that renders a PostgreSQL type Foo as Java type Bar, you can use, for example, a.a2().build() to obtain an Adapter from the PostgreSQL array type Foo[] to the Java type Bar[][], requiring the PostgreSQL array to have two dimensions, allowing each value to have different sizes along those dimensions, but disregarding the PostgreSQL array's start indices (all Java arrays start at 0).

Because PostgreSQL stores the dimension information with each value and does not enforce it for a column as a whole, it could be possible for a column of array values to include values with other numbers of dimensions, which an adapter constructed this way will reject. On the other hand, the sizes along each dimension are also allowed by PostgreSQL to vary from one value to the next, and this adapter accommodates that, as long as the number of dimensions doesn't change.

The existing contract-based ArrayAdapter is used behind the scenes, but build() takes care of generating the contract. Examples are provided.

Adapter maintainability

Providing pure-Java adapters that know the internal layouts of PostgreSQL data types, without relying on JNI calls and the PostgreSQL native support routines, entails a parallel-implementation maintenance responsibility roughly comparable to that of PostgreSQL client drivers that support binary send and receive. (The risk is slightly higher because the backend internal layouts are less committed than the send/receive representations. Because they are used for data on disk, though, historically they have not changed often or capriciously.)

The engineering judgment is that the resulting burden will be manageable, and the benefits in clarity and maintainability of the pure-Java implementations, compared to the brittle legacy Java+C+JNI approach, will predominate. The process of developing clear contracts for PostgreSQL types already has led to discovery of one bug (#390) that could be fixed in the legacy conversions.

For the adapters supplied in the org.postgresql.pljava.internal module, it is possible to use ModelConstants.java/ModelConstants.c to ensure that key constants (offsets, flags, etc.) stay synchronized with their counterparts in the PostgreSQL C code.

Adapter is a class in the API module, with the express intent that other adapters can be developed, and found by the adapter manager through a ServiceLoader API, without being internal to PL/Java. Those might not have the same opportunity for build-time checking against PostgreSQL header files, and will have to rely more heavily on regression tests for key data values, much as binary-supporting client drivers must. The same can be true even for PL/Java internal adapters for a few PostgreSQL data types whose C implementations are so strongly encapsulated (numeric comes to mind) that necessary layouts and constants do not appear in .h files.

Known open items

In no well-defined order ....

  • [ ] The to-PostgreSQL direction for Adapter, TupleTableSlot, and Datum.Accessor. These all have API and implementation for getting PostgreSQL values and presenting them in Java. Now the other direction is needed.
  • [ ] Provide API and implementation for a unified list-of-slots representation for a variety of list-of-tuple representations used in PostgreSQL, by:
    • [x] factoring out the list-of-TupleTableSlot classes currently found as preliminary scaffolding in TupleTableSlot.java
    • [x] providing such a representation for SPITupleTable ...
    • [ ] CatCList ...
    • [ ] Tuplestore? ...
    • [ ] ...?
  • [ ] Implement some form of offset memoization so fetching attributes from a heap TupleTableSlot stays subquadratic
  • [ ] Finish the unimplemented grants methods of RegRole and the unmplemented unary one of CatalogObject.AccessControlled. (Needs the CatCList support, for pg_auth_members searches.)
  • [x] A NullableDatum flavor of TupleTableSlot. One of the last prerequisites to enable pure-Java language-handler implementations, to which the function arguments will appear as a TupleTableSlot.
  • [ ] Complete the implementation of isSubtype with the rules from Java Language Specification 4.10. (At present it is a stub that only checks erased subtyping, enough to get things initially going.)
  • [ ] The adapter manager described above. (Requires isSubtype.)
  • [ ] Adapters for PostgreSQL types that don't have them yet (starting, perhaps, with the ones that already have contracts defined in org.postgresql.pljava.adt).
  • [ ] TextAdapter does not yet support the type modifiers for CHAR and VARCHAR. It needs a contract-based flavor that does.
  • [ ] ArrayAdapter (or Contract.Array) should supply at least one convenience method, taking a dimsAndBounds array parameter and generating an indexing function (a MethodHandle?) that has nDims integer parameters and returns an integer flat index. Other related operations? An index enumerator, etc.?
  • [ ] A useful initial set of composing adapters, such as:
    • [ ] one of the form As<Optional<T>,T>
      • [x] implement in an example class
      • [ ] integrate into PL/Java proper
    • [ ] one extending As<T,T> that returns null for null and values unchanged
      • why? because with adapter autoboxing, it can be composed over any primitive-returning adapter to enable it to handle null, by returning its boxed form
      • [x] implement in an example class
      • [ ] integrate into PL/Java proper
    • [ ] a set composing over primitive adapters to use a specified value in the primitive's value space to represent null.
      • [x] implement in an example class
      • [ ] complete the set and integrate into PL/Java proper
  • [ ] More work on CatalogObject invalidation. RegClass and RegType are already invalidated selectively; probably RegProcedure should be also. PostgreSQL has a limited number of callback slots, so it would be antisocial to grab them for all the supported classes: less critical ones just depend on the global switchpoint; come up with a good story for invalidating those. Also for how TupleDescriptor should behave upon invalidation of its RegClass. See commit comments for 5adf2c8.
  • [ ] Better define and implement the DualState behavior of TupleTableSlot.
  • [ ] Reduce the C-centricity of VarlenaWrapper. Goal: DatumUtils.mapVarlena doing more in Java, less in C.
    • [x] more of VarlenaWrapper's functionality moved to DatumImpl
    • [x] client code no longer casting Datum.Input to VarlenaWrapper to use it.
  • [ ] Adapter should have control over the park/fetch/decompress/lifespan decisions for VarlenaWrapper; currently the behavior is hardcoded for top-transaction lifespan, lazy detoasting, appropriate for SQLXML, which was the first VarlenaWrapper client.
  • [ ] Add MBeans with statistics for the new caches

And then

  • [ ] Choose some interesting JVM language foo and implement a simple PL/foo in pure Java, using these facilities.
  • [ ] Reimplement PL/Java's own language handler the same way.

jcflack avatar Jan 24 '22 02:01 jcflack