dataprep icon indicating copy to clipboard operation
dataprep copied to clipboard

Can I define which characters should be missing value?

Open Bowen0729 opened this issue 3 years ago • 14 comments

In some cases, I treat "" or " " as a missing value, can I define which characters should be missing value?

Bowen0729 avatar Mar 16 '22 08:03 Bowen0729

Hi @Bowen0729 , you can set the na_values parameter when you use pd.read_csv to get the dataframe.

jinglinpeng avatar Mar 17 '22 04:03 jinglinpeng

Hi @Bowen0729 , you can set the na_values parameter when you use pd.read_csv to get the dataframe.

Thank you for your reply. Sure I can, but my data comes from multiple sources, such as parquet, hive, and so on. I'm considering if it is possible that dataprep can configure what the missing values are?

Bowen0729 avatar Mar 17 '22 07:03 Bowen0729

Hi @Bowen0729 , you can set the na_values parameter when you use pd.read_csv to get the dataframe.

Thank you for your reply. Sure I can, but my data comes from multiple sources, such as parquet, hive, and so on. I'm considering if it is possible that dataprep can configure what the missing values are?

Thanks for the reply! I now understand the use case. Yeah that's definitely something useful. I'm considering what's the efficient way to do this, as dataprep only process the dask or pandas dataframe. Did you use df.replace as the current solution?

jinglinpeng avatar Mar 17 '22 19:03 jinglinpeng

We use dataprep in a bigdata eco system, I used to contibute the doc of dataprep on yarn (https://github.com/sfu-db/dataprep/issues/771), but Pandas and Dask couldn't support some of datasources, such as Apache Hudi.

Therefore, on this basis, I made dataprep support Spark dataframe with Ray which help dataprep integrated into bigdata eco system.

This is my use case, I could add doc if it is necessary.

And I think df.replace could actually solve my problem without modified the dataprep. I will do some tests, thank you.

Bowen0729 avatar Mar 18 '22 01:03 Bowen0729

@Bowen0729 Thanks for the info. May I know how you made dataprep support Spark dataframe with Ray, did you modify the internal code?

jinglinpeng avatar Mar 18 '22 02:03 jinglinpeng

No, I didn't modify the internal code, I just used raydp(spark on ray) [https://github.com/oap-project/raydp] to read a spark dataframe, and transfrom a spark dataframe to a dask dataframe with ray api, it is simple.

ray.init()

spark = raydp.init_spark()

spark_df = spark.sql("")

ray_df = ray.data.from_spark(spark_df)

dask_df = ray_df.to_dask()

create_report(dask_df)

Bowen0729 avatar Mar 18 '22 02:03 Bowen0729

No, I didn't modify the internal code, I just used raydp(spark on ray) [https://github.com/oap-project/raydp] to read a spark dataframe, and transfrom a spark dataframe to a dask dataframe with ray api, it is simple.

ray.init()

spark = raydp.init_spark()

spark_df = spark.sql("")

ray_df = ray.data.from_spark(spark_df)

dask_df = ray_df.to_dask()

create_report(dask_df)

I see. Good to know this use case.

jinglinpeng avatar Mar 18 '22 02:03 jinglinpeng

Is it necessary to add the doc for this case?

Bowen0729 avatar Mar 18 '22 02:03 Bowen0729

Is it necessary to add the doc for this case?

Yeah, I think it would be nice to have a use case doc. If you would like to contribute, you can add a notebook named use_case.ipynb in https://github.com/sfu-db/dataprep/tree/develop/docs/source/user_guide/eda, where you can write down this use case :)

jinglinpeng avatar Mar 18 '22 03:03 jinglinpeng

@Bowen0729 let me know if you would like to create that doc together, I've got a spark dataframe I could test that out on or can help write functions to support this Feature.

datatalking avatar Jun 20 '22 17:06 datatalking

@Bowen0729 let me know if you would like to create that doc together, I've got a spark dataframe I could test that out on or can help write functions to support this Feature.

Sure, the commit haven't been merge, is there anything wrong?@jinglinpeng

Bowen0729 avatar Jun 21 '22 12:06 Bowen0729

Hi @Bowen0729 , thanks for the reminder, I've merged the PR. There are some problems in the doc-build workflow, and we're fixing it.

jinglinpeng avatar Jun 26 '22 02:06 jinglinpeng

Hi @Bowen0729 , thanks for the reminder, I've merged the PR. There are some problems in the doc-build workflow, and we're fixing it.

Thanks! And what's your plan? Do you have some good ideas about this feature? I'd love to do it together @datatalking

Bowen0729 avatar Jun 26 '22 13:06 Bowen0729

@Bowen0729 I'm pretty new to the repo, still learning how to do stuff. Is there a list or should we start one in Discussions '[https://github.com/sfu-db/dataprep/discussions]', or perhaps 'Projects' 'https://github.com/sfu-db/dataprep/projects?type=beta'. We can embrace and expand upon what was already done in the Titanic and 'house price' use cases.

datatalking avatar Jun 28 '22 06:06 datatalking