algebird
algebird copied to clipboard
The MinHasher execution depends by the number of nodes
Hi, I am using the Minhasher32 to create clusters of similar records, tokenizing the records values to create the signatures (as I explained here https://github.com/twitter/algebird/issues/609), but seems that the resulting buckets depends by the Spark configuration. I executed the same code on a single node of a cluster machine with 16 cores more times and I always obtained X number of buckets. Than on the same machine I aumented the number of cores to 20, and the number of buckets it is changed to another number Y, I repeated the test and I obtained Y again.
It is possible that the execution of the MinHasher is influenced by the number of nodes? Someone it is able to explain me why?
Thanks
Regards Luca
I confirm that the bucket generation depends by the level of Spark parallelism. I made a test on my laptop, repartitioning the token before initializing the MinHasher
val attributeWithHashes: RDD[(String, Iterable[MinHashSignature])] = attributesToken.repartition(10).map {
case (attribute, token) =>
(attribute, minHasher.init(token))
}.groupByKey()
At the same level of repartition I always obtains the same buckets, if I change it, I obtain different buckets.