koalas icon indicating copy to clipboard operation
koalas copied to clipboard

DataFrame not working when using columns=[]

Open feliperegis opened this issue 4 years ago • 1 comments

Hey guys,

I use to work with Pandas and recently I'm exploring Koalas with Databricks to leverage bigdata solutions. I'd like to share an issue I found when trying to update a pyspark dataframe using Koalas.

Look what happened when I loaded data previously from file to a pyspark df and when I used a sample dict instead for troubleshooting purposes:

import databricks.koalas as ks

kdf = ks.DataFrame(data=df, columns=['a', 'b'])
display(kdf)

It throws me an error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<command-3247900034274940> in <module>
---> 11 kdf = ks.DataFrame(df,columns=['a', 'b'])
     13 display(kdf)

/databricks/python/lib/python3.7/site-packages/databricks/koalas/usage_logging/__init__.py in wrapper(*args, **kwargs)
    178             start = time.perf_counter()
    179             try:
--> 180                 res = func(*args, **kwargs)
    181                 logger.log_success(
    182                     class_name, function_name, time.perf_counter() - start, signature

/databricks/python/lib/python3.7/site-packages/databricks/koalas/frame.py in __init__(self, data, index, columns, dtype, copy)
    469         elif isinstance(data, spark.DataFrame):
    470             assert index is None
--> 471             assert columns is None
    472             assert dtype is None
    473             assert not copy

I did some tests and it works when I use a dict instead of a df as above:

import databricks.koalas as ks

df = {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]}

kdf = ks.DataFrame(df, columns=['a','b'])
display(kdf)

I also tried with Pandas and it worked fine when I used the same pyspark df. It follows the code below:

import pandas as pd

pdf = df.toPandas()
pdf = pd.DataFrame(df, columns=['a','b'])
display(pdf)

I tried to use filter method as below and worked but it's not the same thing, right?

import databricks.koalas as ks

kdf = ks.DataFrame(df)
kdf_filtered = kdf.filter(items=['a', 'b'])
display(kdf_filtered)

Does anybody know if I'm doing sth wrong? Is this sth under construction that needs to be fixed in koalas?

Thanks!

feliperegis avatar Feb 14 '21 13:02 feliperegis

Oh, Koalas supports either:

ks.DataFrame(spark_df)
ks.DataFrame(pandas_df)
ks.DataFrame(koalas_series)

or

ks.DataFrame(...)  # same as pandas

Mixed arguments are currently not supported.

You can do, for example, as below:

>>> kdf = ks.DataFrame(df)
>>> kdf.columns=["foo"]

HyukjinKwon avatar Feb 17 '21 05:02 HyukjinKwon