amundsen icon indicating copy to clipboard operation
amundsen copied to clipboard

perf: Neo4j csv publisher using apoc library for performance improvements - CLEAN

Open zacr opened this issue 2 years ago • 11 comments

Summary of Changes

New Neo4JCsvPublisher implementation using the APOC library for 5x performance improvement. This is a fresh PR because original one wasn't created on a clean branch.

Tests

I didn't see any automated tests for Neo4JCsvPublisher. This PR was tested manually by several community members.

Documentation

CheckList

Make sure you have checked all steps below to ensure a timely review.

  • [ ] PR title addresses the issue accurately and concisely. Example: "Updates the version of Flask to v1.0.2"
  • [ ] PR includes a summary of changes.
  • [ ] PR adds unit tests, updates existing unit tests, OR documents why no test additions or modifications are needed.
  • [ ] In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain docstrings that explain what it does

zacr avatar May 27 '22 21:05 zacr

Looking forward to having this PR merged! Would like to use this in the official build 🚀

chonyy avatar May 30 '22 05:05 chonyy

Ahh looks like DCO is blocked. Can you randomly add a small commit @zacr with format:

Signed-off-by: your_name_here <your_email_here>

(https://github.com/amundsen-io/amundsen/pull/1877/checks?check_run_id=6632089423)

VoLKyyyOG avatar May 31 '22 01:05 VoLKyyyOG

I tested this in my environment and found out that the data is actually not ingested into neo4j. Good news is after a discussion and pair debugging with @zacr , we are able to locate the problem! The root cause is related to the neo4j version.

  • Works with neo4j 3.5.26
  • Fails with neo4j 3.3.0

Basically, there are three main problems

Publisher not showing neo4j commit error

We will get this error message while running the query on the neo4j browser org.neo4j.graphdb.TransactionFailureException: Transaction was marked as successful, but unable to commit transaction so rolled back. However, this error is not showing on the publisher side. Therefore, it's hard to debug.

apoc.merge.node spec changed 3.3.0 and 3.5.26

  • 3.5.26: take 4 input parameters
  • 3.3.0: take 3 input parameters

Property key case sensitive

This is the query written in the source code

CALL apoc.periodic.iterate('UNWIND $rows AS row RETURN row', '
                        CALL apoc.merge.node([row.label], {key:row.key}, row, {published_tag:$tag,publisher_last_updated_epoch_ms:timestamp()}) YIELD node RETURN COUNT(*);
                    ', {batchSize: $batch_size, iterateList: True, parallel: False, params: { rows: $batch, tag: $publish_tag }})

This is the header of the csv file

"LABEL","KEY","name","is_view"

In 3.5.26 row.key could be mapped to KEY. In 3.3.0, we think it's case sensitive and it's not able to find the uppercase "KEY" when written in lowercase in the query.

chonyy avatar May 31 '22 18:05 chonyy

Ahh yep, that would make sense with the .lower() I saw in the previous iteration. Thanks for investigating this as I'm not too familiar with Neo4J!

VoLKyyyOG avatar Jun 01 '22 02:06 VoLKyyyOG

Thanks so much @chonyy. I have an update in progress that addresses all three, hope to have it committed in the next day or so.

zacr avatar Jun 09 '22 13:06 zacr

Thanks so much @chonyy. I have an update in progress that addresses all three, hope to have it committed in the next day or so.

@zacr that's great news! I could also help with testing it in my environment once you are done 🚀

chonyy avatar Jun 11 '22 12:06 chonyy

Hey @zacr ,

FYI, I tested your code in neo4j 3.5.26 and apoc 3.5.0.17, it also doesn't work. The error message is like below. image

After I changed all the key to uppercase, it works! Just want to let you know that case sensitivity seems to also be an issue in your provided environment.

chonyy avatar Jun 12 '22 14:06 chonyy

Awesome @chonyy . Agreed on the upper case. Can we assume the headers of the CSV files will always be upper case?

zacr avatar Jun 13 '22 13:06 zacr

Also, I am out of town this week, but here is the error handling code that goes right after session.run() in _execute_statement()

            ret = [dict(i) for i in result]

            if ret[0]['failedOperations'] > 0:
                raise RuntimeError(f"Failed to executed statement: {stmt} with {ret[0]['errorMessages']}")

Was thinking of adding a flag to turn on\off this checking.

zacr avatar Jun 13 '22 13:06 zacr

Any updates on this PR? Would be great to see it merged!

ozandogrultan avatar Aug 11 '22 19:08 ozandogrultan

I'm thinking that maybe we don't even have to spend time supporting Neo4j 3.3.0 since it's already EOL. @zacr's version works great with some minor fixes related to case sensitivity.

What I can help here is to add the fix and also the extra part that zacr mentioned, in order to surface the error to the caller side. But we still need a commiter's help to look into this and approve the CI pipeline run 🥲

chonyy avatar Aug 12 '22 01:08 chonyy

/rebase

Golodhros avatar Jan 31 '23 21:01 Golodhros

Hello, I know it has been a while since there has been activity on this PR, but I wanted to follow up for those who were interested in a faster publisher. A few months ago I worked on one myself in this PR using the unwind clause, which improves performance by allowing Neo4j to compile and cache the statements and reduces the overall amount of transactions (see it here). This publisher doesn't depend on the apoc library. We have been using it and seeing huge improvements in speed compared to the old one. I haven't tested it on all versions of Neo4j, but it may be sufficient for others to try and see if this in progress one using apoc is no longer required?

kristenarmes avatar Feb 01 '23 00:02 kristenarmes

Closing as done.

Golodhros avatar Feb 08 '23 00:02 Golodhros