nsl-kdd icon indicating copy to clipboard operation
nsl-kdd copied to clipboard

Meeting a problem in Gaussian Mixture clustering part

Open lockbro opened this issue 4 years ago • 5 comments

`# Gaussian Mixture clustering from pyspark.ml.clustering import GaussianMixture

t0 = time() gm = GaussianMixture(k=8, maxIter=150, seed=seed, featuresCol="pca_features", predictionCol="cluster", probabilityCol="gm_prob")

gm_pipeline = Pipeline(stages=[pca_slicer, pca, gm]) gm_model = gm_pipeline.fit(scaled_train_df)

gm_train_df = gm_model.transform(scaled_train_df).cache() gm_cv_df = gm_model.transform(scaled_cv_df).cache() gm_test_df = gm_model.transform(scaled_test_df).cache()

gm_params = (gm_model.stages[2].gaussiansDF.rdd .map(lambda row: [row['mean'].toArray(), row['cov'].toArray()]) .collect()) gm_weights = gm_model.stages[2].weights

print(gm_train_df.count()) print(gm_cv_df.count()) print(gm_test_df.count()) print(time() - t0)`

When i run this part in jupyter notebook, an error appear: `--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) in 14 15 gm_params = (gm_model.stages[2].gaussiansDF.rdd ---> 16 .map(lambda row: [row['mean'].toArray(), row['cov'].toArray()]) 17 .collect()) 18 gm_weights = gm_model.stages[2].weights

C:\Spark\python\pyspark\rdd.py in collect(self) 813 to be small, as all the data is loaded into the driver's memory. 814 """ --> 815 with SCCallSiteSync(self.context) as css: 816 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 817 return list(_load_from_socket(sock_info, self._jrdd_deserializer))

C:\Spark\python\pyspark\traceback_utils.py in enter(self) 70 def enter(self): 71 if SCCallSiteSync._spark_stack_depth == 0: ---> 72 self._context._jsc.setCallSite(self._call_site) 73 SCCallSiteSync._spark_stack_depth += 1 74

AttributeError: 'NoneType' object has no attribute 'setCallSite'`

I do some research but there is few answer,some people said it‘s spark own bug.And by the way,i didn't use docker image but build the "Anaconda 3.7.6 + pyspark 2.4.5" environment to run these code.

Can you please help me solve the problem? I'll thank you very much!

lockbro avatar Apr 01 '20 09:04 lockbro

Hi @lockbro , I haven't been updating this repo for a while, but I just pushed a commit that has simplified instructions to build and run docker image with pysprak installed etc. It uses pyspark 2.4.5 too. I've run the notebook end-to-end and haven't faced any issues, so I'd advise you to just use the docker container for that.

So you just need to run make nsl-kdd-pyspark command. It'll download the latest jupyter/pyspark-notebook docker image, start a container with Jupyter at 8889 port and print you current Jupyter token after 15 seconds (to make sure that Jupyter had enough time to get running).

Hope it'll help!

thinline72 avatar Apr 01 '20 21:04 thinline72

I don't figure out how to use docker image untill now, so i am wonder why these code can't run smothly in my own environment (Win7 32bit + Anaconda 3.7 + Pyspark 2.4.5) (My computer can't use docker because the OS is Win7 32bit :( )

lockbro avatar Apr 02 '20 09:04 lockbro

@lockbro I'm sorry, but I don't use Windows as a main platform for my work so I'm not able to help here. I'd suggest you to just skip that part with Gaussian Mixture models if you cannot run it in cloud or on different machine.

thinline72 avatar Apr 02 '20 10:04 thinline72

Can I ask what platform and specific edition did you use? I wanna use virtual machine to run the code.

lockbro avatar Apr 02 '20 12:04 lockbro

As I mentioned above I'm running it in the docker container which is already kind of VM, so I can run it from my macbook or from my PC with Linux.

If you are interested in what OS is used inside that docker image, pls check here Seems like it's Ubuntu 18.04 (bionic).

thinline72 avatar Apr 02 '20 13:04 thinline72