amundsen
amundsen copied to clipboard
perf: Neo4j csv publisher using apoc library for performance improvements - CLEAN
Summary of Changes
New Neo4JCsvPublisher implementation using the APOC library for 5x performance improvement. This is a fresh PR because original one wasn't created on a clean branch.
Tests
I didn't see any automated tests for Neo4JCsvPublisher. This PR was tested manually by several community members.
Documentation
CheckList
Make sure you have checked all steps below to ensure a timely review.
- [ ] PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.
- [ ] PR includes a summary of changes.
- [ ] PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
- [ ] In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain docstrings that explain what it does
Looking forward to having this PR merged! Would like to use this in the official build 🚀
Ahh looks like DCO
is blocked. Can you randomly add a small commit @zacr with format:
Signed-off-by: your_name_here <your_email_here>
(https://github.com/amundsen-io/amundsen/pull/1877/checks?check_run_id=6632089423)
I tested this in my environment and found out that the data is actually not ingested into neo4j. Good news is after a discussion and pair debugging with @zacr , we are able to locate the problem! The root cause is related to the neo4j version.
- Works with neo4j
3.5.26
- Fails with neo4j
3.3.0
Basically, there are three main problems
Publisher not showing neo4j commit error
We will get this error message while running the query on the neo4j browser
org.neo4j.graphdb.TransactionFailureException: Transaction was marked as successful, but unable to commit transaction so rolled back.
However, this error is not showing on the publisher side. Therefore, it's hard to debug.
apoc.merge.node spec changed 3.3.0
and 3.5.26
- 3.5.26: take 4 input parameters
- 3.3.0: take 3 input parameters
Property key case sensitive
This is the query written in the source code
CALL apoc.periodic.iterate('UNWIND $rows AS row RETURN row', '
CALL apoc.merge.node([row.label], {key:row.key}, row, {published_tag:$tag,publisher_last_updated_epoch_ms:timestamp()}) YIELD node RETURN COUNT(*);
', {batchSize: $batch_size, iterateList: True, parallel: False, params: { rows: $batch, tag: $publish_tag }})
This is the header of the csv file
"LABEL","KEY","name","is_view"
In 3.5.26
row.key
could be mapped to KEY
. In 3.3.0
, we think it's case sensitive and it's not able to find the uppercase "KEY" when written in lowercase in the query.
Ahh yep, that would make sense with the .lower()
I saw in the previous iteration. Thanks for investigating this as I'm not too familiar with Neo4J!
Thanks so much @chonyy. I have an update in progress that addresses all three, hope to have it committed in the next day or so.
Thanks so much @chonyy. I have an update in progress that addresses all three, hope to have it committed in the next day or so.
@zacr that's great news! I could also help with testing it in my environment once you are done 🚀
Hey @zacr ,
FYI, I tested your code in neo4j 3.5.26
and apoc 3.5.0.17
, it also doesn't work. The error message is like below.
After I changed all the key to uppercase, it works! Just want to let you know that case sensitivity seems to also be an issue in your provided environment.
Awesome @chonyy . Agreed on the upper case. Can we assume the headers of the CSV files will always be upper case?
Also, I am out of town this week, but here is the error handling code that goes right after session.run() in _execute_statement()
ret = [dict(i) for i in result]
if ret[0]['failedOperations'] > 0:
raise RuntimeError(f"Failed to executed statement: {stmt} with {ret[0]['errorMessages']}")
Was thinking of adding a flag to turn on\off this checking.
Any updates on this PR? Would be great to see it merged!
I'm thinking that maybe we don't even have to spend time supporting Neo4j 3.3.0
since it's already EOL. @zacr's version works great with some minor fixes related to case sensitivity.
What I can help here is to add the fix and also the extra part that zacr mentioned, in order to surface the error to the caller side. But we still need a commiter's help to look into this and approve the CI pipeline run 🥲
/rebase
Hello, I know it has been a while since there has been activity on this PR, but I wanted to follow up for those who were interested in a faster publisher. A few months ago I worked on one myself in this PR using the unwind
clause, which improves performance by allowing Neo4j to compile and cache the statements and reduces the overall amount of transactions (see it here). This publisher doesn't depend on the apoc library. We have been using it and seeing huge improvements in speed compared to the old one. I haven't tested it on all versions of Neo4j, but it may be sufficient for others to try and see if this in progress one using apoc is no longer required?
Closing as done.