perf(neo4j): improve neo4j query performance by using node labels
PR created and contributed by: MediamarktSaturn Technology GmbH, Analytics-Services Team. Special thanks to @raudzis for the finding and idea proposed.
PR Introduction: This PR introduces an optimization to the Neo4j querying process within our Datahub project. Previously, our Neo4j queries did not specify node labels during the match phase, which resulted in scanning all nodes in the database. This approach was inefficient, especially for large datasets. By integrating dynamic node labels into our match queries, we significantly improve query performance by leveraging Neo4j's ability to use indexes more effectively.
-
Node Label Integration: Modified the Neo4j queries wherever applicable and now, the query explicitly targets nodes with the specified label, reducing the search space and improving performance.
-
Performance: By applying node labels directly in our match clauses, the database engine can optimize node lookups using existing indexes, thus speeding up the query execution by reducing the number of nodes scanned.
-
Scalability: These improvements make our database queries more scalable, handling larger datasets more efficiently.
-
Maintainability: This change also enhances the clarity of our queries, making them more understandable at a glance, which benefits new contributors and maintainers alike.
Checklist
- [X] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
- ~[ ] Links to related issues (if applicable)~
- [X] Tests for the changes have been added/updated (if applicable)
- ~[ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.~
- ~[ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub~
You can find the screenshot attached comparing the profile result:
Profile WITHOUT NODELABEL Query
Query: PROFILE MATCH (src {urn: "urn:li:schemaField:(%s),warp_file_name)"})-[r:DownstreamOf]->(dest) RETURN type(r), dest, 1
Profile WITH NODELABEL Query
Query: PROFILE MATCH (src:schemaField {urn: "urn:li:schemaField:(%s),warp_file_name)"})-[r:DownstreamOf]->(dest) RETURN type(r), dest, 1
@david-leifker @RyanHolstien Could you please look into this PR as we have also seen significant improvement with these changes in read calls to Neo4j
@RyanHolstien @david-leifker Could you please approve it as I had fixed the changes. It was a formatting issue and hence the build was failing and so are the unit-tests. I fixed it now and would need your approval.
Could you please approve it as I had fixed the changes. It was a formatting issue and hence the build was failing and so are the unit-tests. I fixed it now and would need your approval.
@pashashaik-mms , as mentions here
the below error is occurring because in testcases here, "sourceEntityFilter" is passed as null and because of that the variable "srcNodeLabel" in method "findRelatedEntities" is not setting having a default value of blank, which results in the below query which is not correct.
MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"
Please also handle the case where variable "sourceEntityFilter" in method "findRelatedEntities" can be null or empty
metadata-io > nonsearch > com.linkedin.metadata.graph.dgraph.DgraphGraphServiceTest > testFindRelatedEntitiesDestinationType[11](dataset, [HasOwner], {or=[], direction=UNDIRECTED}, [RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset1,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset2,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset3,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset4,PROD), via=null)]) STANDARD_ERROR org.neo4j.bolt.protocol.common.fsm.error.TransactionStateTransitionException: Invalid input ')': expected "%", "(" or an identifier (line 1, column 13 (offset: 12)) "MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"
@pashashaik-mms , as mentions here
the below error is occurring because in testcases here, "sourceEntityFilter" is passed as null and because of that the variable "srcNodeLabel" in method "findRelatedEntities" is not setting having a default value of blank, which results in the below query which is not correct.
MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"Please also handle the case where variable "sourceEntityFilter" in method "findRelatedEntities" can be null or empty
metadata-io > nonsearch > com.linkedin.metadata.graph.dgraph.DgraphGraphServiceTest > testFindRelatedEntitiesDestinationType[11](dataset, [HasOwner], {or=[], direction=UNDIRECTED}, [RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset1,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset2,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset3,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset4,PROD), via=null)]) STANDARD_ERROR org.neo4j.bolt.protocol.common.fsm.error.TransactionStateTransitionException: Invalid input ')': expected "%", "(" or an identifier (line 1, column 13 (offset: 12)) "MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"
FIXED NOW
@deepgarg-visa fixed now. Could you please check and approve it. Its been waiting a while now.
@pashashaik-mms are all metadata-io testcase passed ?
I guess this fix will not the solve the problem, as defualt value of variable "srcNodeLabel" is blank. Because of that
the variable matchTemplate = "MATCH (src:%s %s)-[r%s %s]-(dest %s)%s" generates below query:
MATCH (src: )-[r:DownstreamOf ]->(dest )
@deepgarg-visa I handled your scenario as well. Now the tests are running fine. Could you please check the same. removeEdgesFromNode() might be in balance.
@pashashaik-mms are all metadata-io testcase passed ? I guess this fix will not the solve the problem, as defualt value of variable "srcNodeLabel" is blank. Because of that the variable
matchTemplate = "MATCH (src:%s %s)-[r%s %s]-(dest %s)%s"generates below query:
MATCH (src: )-[r:DownstreamOf ]->(dest )
It seems it was already resolved. Is it not so? If everything is good, we can close this PR or merge it accordingly.
Looks like the changes were added in a different PR, closing this one. Thanks!