algebird icon indicating copy to clipboard operation
algebird copied to clipboard

The MinHasher execution depends by the number of nodes

Open Gaglia88 opened this issue 7 years ago • 1 comments

Hi, I am using the Minhasher32 to create clusters of similar records, tokenizing the records values to create the signatures (as I explained here https://github.com/twitter/algebird/issues/609), but seems that the resulting buckets depends by the Spark configuration. I executed the same code on a single node of a cluster machine with 16 cores more times and I always obtained X number of buckets. Than on the same machine I aumented the number of cores to 20, and the number of buckets it is changed to another number Y, I repeated the test and I obtained Y again.

It is possible that the execution of the MinHasher is influenced by the number of nodes? Someone it is able to explain me why?

Thanks

Regards Luca

Gaglia88 avatar Jun 06 '17 13:06 Gaglia88

I confirm that the bucket generation depends by the level of Spark parallelism. I made a test on my laptop, repartitioning the token before initializing the MinHasher

val attributeWithHashes: RDD[(String, Iterable[MinHashSignature])] = attributesToken.repartition(10).map {
   case (attribute, token) =>
      (attribute, minHasher.init(token))
}.groupByKey()

At the same level of repartition I always obtains the same buckets, if I change it, I obtain different buckets.

Gaglia88 avatar Jun 06 '17 18:06 Gaglia88