datanucleus-rdbms icon indicating copy to clipboard operation
datanucleus-rdbms copied to clipboard

Support bulk-fetch using JOIN

Open andyjefferson opened this issue 7 years ago • 5 comments

If we have a JDOQL query like SELECT FROM Person WHERE this.firstName == :value

then this becomes SELECT P.* FROM PERSON P WHERE P.FIRST_NAME = ?

If a Person has a Set<Address> then if the addresses field is in the fetch plan we already support a bulk-fetch mode "EXISTS" giving SQL of SELECT A.* FROM ADDRESS A WHERE EXISTS (SELECT P.ID FROM PERSON P WHERE P.FIRST_NAME = ? AND A.PERSON_ID = P.ID)

We could potentially have a bulk-fetch mode "JOIN" as SELECT A.* FROM PERSON P, ADDRESS A WHERE A.PERSON_ID = P.ID AND P.FIRST_NAME = ?

The reason why this is more complicated than the EXISTS case is that for EXISTS we can make use of the backing store getIteratorStatement for the basic statement, and then put the original query in an EXISTS clause. Here we need to start from the basic query (but clearing the select) and then adding the join to the element, while catering for all different combinations of set/list/collection whether with embedded elements or not, and whether via FK or JoinTable.

andyjefferson avatar Mar 07 '17 13:03 andyjefferson

any plan to optimize the generated SQL?

Hexiaoqiao avatar Nov 09 '17 06:11 Hexiaoqiao

No. That is dependent on contributions, this being open source and all. As per the "unresourced" tag on this issue

andyjefferson avatar Nov 09 '17 07:11 andyjefferson

Another option that is supported on several databases is to use an "IN" clause instead of "EXISTS" which would be implicitly converted to a join without any of the risks associated with inadvertent many to many relationships. That would probably be a lot easier to implement than the join as the SQL is closely related to exists. I normally work on Hadoop Projects but since I'm looking at using some of this I'll start getting familiar with the code base and see if I can help.

shawnweeks avatar Apr 07 '18 02:04 shawnweeks

Based on some testing against PostgreSQL 10.3, MaraiDB 10.2 and Oracle 12.2 this optimization is already happening with "EXISTS" and "IN". I can post the test scripts for other folks to look at but assuming you can use a relatively modern release of your database software you're already getting the benefit of using a join.

shawnweeks avatar Apr 08 '18 14:04 shawnweeks

Thx for your input, interesting to hear.

A comparison of the 3 "bulk"/"batch" options (EXISTS, IN, JOIN) for EclipseLink JPA is present on this link https://java-persistence-performance.blogspot.co.uk/2010/08/batch-fetching-optimizing-object-graph.html

andyjefferson avatar Apr 08 '18 17:04 andyjefferson