algebird The MinHasher execution depends by the number of nodes

The MinHasher execution depends by the number of nodes

Open Gaglia88 opened this issue 7 years ago • 1 comments

Hi, I am using the Minhasher32 to create clusters of similar records, tokenizing the records values to create the signatures (as I explained here https://github.com/twitter/algebird/issues/609), but seems that the resulting buckets depends by the Spark configuration. I executed the same code on a single node of a cluster machine with 16 cores more times and I always obtained X number of buckets. Than on the same machine I aumented the number of cores to 20, and the number of buckets it is changed to another number Y, I repeated the test and I obtained Y again.

It is possible that the execution of the MinHasher is influenced by the number of nodes? Someone it is able to explain me why?

Thanks

Regards Luca

Jun 06 '17 13:06 Gaglia88

I confirm that the bucket generation depends by the level of Spark parallelism. I made a test on my laptop, repartitioning the token before initializing the MinHasher

val attributeWithHashes: RDD[(String, Iterable[MinHashSignature])] = attributesToken.repartition(10).map {
   case (attribute, token) =>
      (attribute, minHasher.init(token))
}.groupByKey()

At the same level of repartition I always obtains the same buckets, if I change it, I obtain different buckets.

Jun 06 '17 18:06 Gaglia88

algebird algebird copied to clipboard

The MinHasher execution depends by the number of nodes

algebird
algebird copied to clipboard