text-dedup minhash_spark.py [UNABLE_TO_INFER

minhash_spark.py [UNABLE_TO_INFER_SCHEMA]

Open Yang-QW opened this issue 2 months ago • 2 comments

When I use the Spark cluster to execute minhash_spark.py, I occasionally encounter [UNABLE-TO-INFER-SCHEMA] errors, as shown in the following figure. I don't know if it's a problem with the data. Because workers need to copy data to different machines. For files with errors, they can run normally after retransmission, but errors may also occur after a period of time. I don't know if the file movement or reading has an impact on Spark? Now I have set up an NFS server, which can ensure that the files read by each worker are consistent, but this problem still occurs. Can you help me analyze where the problem lies?

Apr 18 '24 03:04 Yang-QW

I have found a solution in the following issue, which seems to be an error in spark reading checkpoint.

https://github.com/graphframes/graphframes/issues/201

Modify the parameter of connectedComponents to algorithm="graphx" and it worked.

However, it takes longer than the default algorithm.

Apr 18 '24 06:04 Yang-QW

Thanks for sharing all the details. Could you verify that your checkpoint location is writable to Spark?

Based on the conversations in the issue linked, it does not seem there is something I can do to "solve" it other than checking the checkpoint write access. The default distributed and iterative algorithm was the whole reason why I chose it in the first place to speed it up.

I will add this issue to a QA section in case anyone encounter the same issue in the future.

Apr 18 '24 17:04 ChenghaoMou

text-dedup text-dedup copied to clipboard

minhash_spark.py [UNABLE_TO_INFER_SCHEMA]

text-dedup
text-dedup copied to clipboard