datahub icon indicating copy to clipboard operation
datahub copied to clipboard

perf(neo4j): improve neo4j query performance by using node labels

Open pashashaik-mms opened this issue 1 year ago • 2 comments

PR created and contributed by: MediamarktSaturn Technology GmbH, Analytics-Services Team. Special thanks to @raudzis for the finding and idea proposed.

PR Introduction: This PR introduces an optimization to the Neo4j querying process within our Datahub project. Previously, our Neo4j queries did not specify node labels during the match phase, which resulted in scanning all nodes in the database. This approach was inefficient, especially for large datasets. By integrating dynamic node labels into our match queries, we significantly improve query performance by leveraging Neo4j's ability to use indexes more effectively.

  • Node Label Integration: Modified the Neo4j queries wherever applicable and now, the query explicitly targets nodes with the specified label, reducing the search space and improving performance.

  • Performance: By applying node labels directly in our match clauses, the database engine can optimize node lookups using existing indexes, thus speeding up the query execution by reducing the number of nodes scanned.

  • Scalability: These improvements make our database queries more scalable, handling larger datasets more efficiently.

  • Maintainability: This change also enhances the clarity of our queries, making them more understandable at a glance, which benefits new contributors and maintainers alike.


Checklist

  • [X] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • ~[ ] Links to related issues (if applicable)~
  • [X] Tests for the changes have been added/updated (if applicable)
  • ~[ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.~
  • ~[ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub~

pashashaik-mms avatar May 02 '24 10:05 pashashaik-mms

You can find the screenshot attached comparing the profile result:

Profile WITHOUT NODELABEL Query

Query: PROFILE MATCH (src {urn: "urn:li:schemaField:(%s),warp_file_name)"})-[r:DownstreamOf]->(dest) RETURN type(r), dest, 1 image

Profile WITH NODELABEL Query

Query: PROFILE MATCH (src:schemaField {urn: "urn:li:schemaField:(%s),warp_file_name)"})-[r:DownstreamOf]->(dest) RETURN type(r), dest, 1 image

pashashaik-mms avatar May 02 '24 11:05 pashashaik-mms

@david-leifker @RyanHolstien Could you please look into this PR as we have also seen significant improvement with these changes in read calls to Neo4j

deepgarg760 avatar May 22 '24 15:05 deepgarg760

@RyanHolstien @david-leifker Could you please approve it as I had fixed the changes. It was a formatting issue and hence the build was failing and so are the unit-tests. I fixed it now and would need your approval.

pashashaik-mms avatar May 23 '24 15:05 pashashaik-mms

Could you please approve it as I had fixed the changes. It was a formatting issue and hence the build was failing and so are the unit-tests. I fixed it now and would need your approval.

pashashaik-mms avatar May 23 '24 15:05 pashashaik-mms

@pashashaik-mms , as mentions here

the below error is occurring because in testcases here, "sourceEntityFilter" is passed as null and because of that the variable "srcNodeLabel" in method "findRelatedEntities" is not setting having a default value of blank, which results in the below query which is not correct.

MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"

Please also handle the case where variable "sourceEntityFilter" in method "findRelatedEntities" can be null or empty

metadata-io > nonsearch > com.linkedin.metadata.graph.dgraph.DgraphGraphServiceTest > testFindRelatedEntitiesDestinationType[11](dataset, [HasOwner], {or=[], direction=UNDIRECTED}, [RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset1,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset2,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset3,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset4,PROD), via=null)]) STANDARD_ERROR org.neo4j.bolt.protocol.common.fsm.error.TransactionStateTransitionException: Invalid input ')': expected "%", "(" or an identifier (line 1, column 13 (offset: 12)) "MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"

deepgarg760 avatar May 24 '24 10:05 deepgarg760

@pashashaik-mms , as mentions here

the below error is occurring because in testcases here, "sourceEntityFilter" is passed as null and because of that the variable "srcNodeLabel" in method "findRelatedEntities" is not setting having a default value of blank, which results in the below query which is not correct.

MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"

Please also handle the case where variable "sourceEntityFilter" in method "findRelatedEntities" can be null or empty

metadata-io > nonsearch > com.linkedin.metadata.graph.dgraph.DgraphGraphServiceTest > testFindRelatedEntitiesDestinationType[11](dataset, [HasOwner], {or=[], direction=UNDIRECTED}, [RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset1,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset2,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset3,PROD), via=null), RelatedEntity(relationshipType=HasOwner, urn=urn:li:dataset:(urn:li:dataPlatform:type,SampleDataset4,PROD), via=null)]) STANDARD_ERROR org.neo4j.bolt.protocol.common.fsm.error.TransactionStateTransitionException: Invalid input ')': expected "%", "(" or an identifier (line 1, column 13 (offset: 12)) "MATCH (src: )-[r:DownstreamOf ]-(dest ) WHERE left(type(r), 2)<>'r_' RETURN dest, type(r) SKIP $offset LIMIT $count"

FIXED NOW

pashashaik-mms avatar May 24 '24 14:05 pashashaik-mms

@deepgarg-visa fixed now. Could you please check and approve it. Its been waiting a while now.

pashashaik-mms avatar May 24 '24 14:05 pashashaik-mms

@pashashaik-mms are all metadata-io testcase passed ? I guess this fix will not the solve the problem, as defualt value of variable "srcNodeLabel" is blank. Because of that the variable matchTemplate = "MATCH (src:%s %s)-[r%s %s]-(dest %s)%s" generates below query:

MATCH (src: )-[r:DownstreamOf ]->(dest )

deepgarg760 avatar May 24 '24 14:05 deepgarg760

@deepgarg-visa I handled your scenario as well. Now the tests are running fine. Could you please check the same. removeEdgesFromNode() might be in balance.

pashashaik-mms avatar May 24 '24 16:05 pashashaik-mms

@pashashaik-mms are all metadata-io testcase passed ? I guess this fix will not the solve the problem, as defualt value of variable "srcNodeLabel" is blank. Because of that the variable matchTemplate = "MATCH (src:%s %s)-[r%s %s]-(dest %s)%s" generates below query:

MATCH (src: )-[r:DownstreamOf ]->(dest )

It seems it was already resolved. Is it not so? If everything is good, we can close this PR or merge it accordingly.

pashashaik-mms avatar May 29 '24 07:05 pashashaik-mms

Looks like the changes were added in a different PR, closing this one. Thanks!

david-leifker avatar May 29 '24 14:05 david-leifker