pygraphistry icon indicating copy to clipboard operation
pygraphistry copied to clipboard

[BUG] ValueError: Expected Pandas/Arrow/cuDF/Spark dataframe(s) or igraph/NetworkX graph when calling spark.sql()

Open DataBoyTX opened this issue 3 months ago • 1 comments

Describe the bug

The following code used to work, but is now throwing an error, assuming the datatype of the resulting df changed from SparkDataFrame to pyspark.sql.connect.dataframe.DataFrame

df = spark.sql("SELECT * FROM honeypot")

g2 = graphistry.edges(df, 'attackerIP', 'victimIP')

g2.plot()

simply adding .toPandas() to the df on input to edges() fixes the problem, but we should handle in the client.

error:


ValueError: Expected Pandas/Arrow/cuDF/Spark dataframe(s) or igraph/NetworkX graph.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File <command-2934552628071172>, line 1
----> 1 g.plot()

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/graphistry/PlotterBase.py:1404, in PlotterBase.plot(self, graph, nodes, name, description, render, skip_upload, as_files, memoize, extra_html, override_html_style)
   1401 PyGraphistry.refresh()
   1402 logger.debug("4. @PloatterBase plot: PyGraphistry.org_name(): {}".format(PyGraphistry.org_name()))
-> 1404 dataset = self._plot_dispatch(g, n, name, description, 'arrow', self._style, memoize)
   1405 if skip_upload:
   1406     return dataset

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/graphistry/PlotterBase.py:1701, in PlotterBase._plot_dispatch(self, graph, nodes, name, description, mode, metadata, memoize)
   1698 except ImportError:
   1699     pass
-> 1701 error('Expected Pandas/Arrow/cuDF/Spark dataframe(s) or igraph/NetworkX graph.')

File /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.10/site-packages/graphistry/util.py:280, in error(msg)
    279 def error(msg):
--> 280     raise ValueError(msg)

ValueError: Expected Pandas/Arrow/cuDF/Spark dataframe(s) or igraph/NetworkX graph.

To Reproduce

Lab 2 - Data Preparation and Styling-ExpectedPandasArrowSparkDataframe.zip

DataBoyTX avatar Mar 15 '24 17:03 DataBoyTX

We should support multiple spark versions, sounds like impacts potentially these:

  • Spark availability sniffing: https://github.com/graphistry/pygraphistry/blob/2506b798ec723e906c1c5279f613fe0c37bdbad2/graphistry/PlotterBase.py#L80
  • Dispatch: https://github.com/graphistry/pygraphistry/blob/2506b798ec723e906c1c5279f613fe0c37bdbad2/graphistry/PlotterBase.py#L1682
  • Arrow coercion: https://github.com/graphistry/pygraphistry/blob/2506b798ec723e906c1c5279f613fe0c37bdbad2/graphistry/PlotterBase.py#L1901

lmeyerov avatar Mar 15 '24 18:03 lmeyerov