nsl-kdd
nsl-kdd copied to clipboard
Meeting a problem in Gaussian Mixture clustering part
`# Gaussian Mixture clustering from pyspark.ml.clustering import GaussianMixture
t0 = time() gm = GaussianMixture(k=8, maxIter=150, seed=seed, featuresCol="pca_features", predictionCol="cluster", probabilityCol="gm_prob")
gm_pipeline = Pipeline(stages=[pca_slicer, pca, gm]) gm_model = gm_pipeline.fit(scaled_train_df)
gm_train_df = gm_model.transform(scaled_train_df).cache() gm_cv_df = gm_model.transform(scaled_cv_df).cache() gm_test_df = gm_model.transform(scaled_test_df).cache()
gm_params = (gm_model.stages[2].gaussiansDF.rdd .map(lambda row: [row['mean'].toArray(), row['cov'].toArray()]) .collect()) gm_weights = gm_model.stages[2].weights
print(gm_train_df.count()) print(gm_cv_df.count()) print(gm_test_df.count()) print(time() - t0)`
When i run this part in jupyter notebook, an error appear:
`---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
C:\Spark\python\pyspark\rdd.py in collect(self) 813 to be small, as all the data is loaded into the driver's memory. 814 """ --> 815 with SCCallSiteSync(self.context) as css: 816 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 817 return list(_load_from_socket(sock_info, self._jrdd_deserializer))
C:\Spark\python\pyspark\traceback_utils.py in enter(self) 70 def enter(self): 71 if SCCallSiteSync._spark_stack_depth == 0: ---> 72 self._context._jsc.setCallSite(self._call_site) 73 SCCallSiteSync._spark_stack_depth += 1 74
AttributeError: 'NoneType' object has no attribute 'setCallSite'`
I do some research but there is few answer,some people said it‘s spark own bug.And by the way,i didn't use docker image but build the "Anaconda 3.7.6 + pyspark 2.4.5" environment to run these code.
Can you please help me solve the problem? I'll thank you very much!
Hi @lockbro , I haven't been updating this repo for a while, but I just pushed a commit that has simplified instructions to build and run docker image with pysprak installed etc. It uses pyspark 2.4.5 too. I've run the notebook end-to-end and haven't faced any issues, so I'd advise you to just use the docker container for that.
So you just need to run make nsl-kdd-pyspark
command. It'll download the latest jupyter/pyspark-notebook docker image, start a container with Jupyter at 8889
port and print you current Jupyter token after 15 seconds (to make sure that Jupyter had enough time to get running).
Hope it'll help!
I don't figure out how to use docker image untill now, so i am wonder why these code can't run smothly in my own environment (Win7 32bit + Anaconda 3.7 + Pyspark 2.4.5) (My computer can't use docker because the OS is Win7 32bit :( )
@lockbro I'm sorry, but I don't use Windows as a main platform for my work so I'm not able to help here. I'd suggest you to just skip that part with Gaussian Mixture models if you cannot run it in cloud or on different machine.
Can I ask what platform and specific edition did you use? I wanna use virtual machine to run the code.
As I mentioned above I'm running it in the docker container which is already kind of VM, so I can run it from my macbook or from my PC with Linux.
If you are interested in what OS is used inside that docker image, pls check here Seems like it's Ubuntu 18.04 (bionic).