datanucleus-rdbms
datanucleus-rdbms copied to clipboard
Support bulk-fetch using JOIN
If we have a JDOQL query like
SELECT FROM Person WHERE this.firstName == :value
then this becomes
SELECT P.* FROM PERSON P WHERE P.FIRST_NAME = ?
If a Person has a Set<Address> then if the addresses field is in the fetch plan we already support a bulk-fetch mode "EXISTS" giving SQL of
SELECT A.* FROM ADDRESS A WHERE EXISTS (SELECT P.ID FROM PERSON P WHERE P.FIRST_NAME = ? AND A.PERSON_ID = P.ID)
We could potentially have a bulk-fetch mode "JOIN" as
SELECT A.* FROM PERSON P, ADDRESS A WHERE A.PERSON_ID = P.ID AND P.FIRST_NAME = ?
The reason why this is more complicated than the EXISTS case is that for EXISTS we can make use of the backing store getIteratorStatement for the basic statement, and then put the original query in an EXISTS clause. Here we need to start from the basic query (but clearing the select) and then adding the join to the element, while catering for all different combinations of set/list/collection whether with embedded elements or not, and whether via FK or JoinTable.
any plan to optimize the generated SQL?
No. That is dependent on contributions, this being open source and all. As per the "unresourced" tag on this issue
Another option that is supported on several databases is to use an "IN" clause instead of "EXISTS" which would be implicitly converted to a join without any of the risks associated with inadvertent many to many relationships. That would probably be a lot easier to implement than the join as the SQL is closely related to exists. I normally work on Hadoop Projects but since I'm looking at using some of this I'll start getting familiar with the code base and see if I can help.
Based on some testing against PostgreSQL 10.3, MaraiDB 10.2 and Oracle 12.2 this optimization is already happening with "EXISTS" and "IN". I can post the test scripts for other folks to look at but assuming you can use a relatively modern release of your database software you're already getting the benefit of using a join.
Thx for your input, interesting to hear.
A comparison of the 3 "bulk"/"batch" options (EXISTS, IN, JOIN) for EclipseLink JPA is present on this link https://java-persistence-performance.blogspot.co.uk/2010/08/batch-fetching-optimizing-object-graph.html