dataprep Can I define which characters should be missing value?

In some cases, I treat "" or " " as a missing value, can I define which characters should be missing value?

Mar 16 '22 08:03 Bowen0729

Hi @Bowen0729 , you can set the na_values parameter when you use pd.read_csv to get the dataframe.

Mar 17 '22 04:03 jinglinpeng

Hi @Bowen0729 , you can set the na_values parameter when you use pd.read_csv to get the dataframe.

Thank you for your reply. Sure I can, but my data comes from multiple sources, such as parquet, hive, and so on. I'm considering if it is possible that dataprep can configure what the missing values are?

Mar 17 '22 07:03 Bowen0729

Hi @Bowen0729 , you can set the na_values parameter when you use pd.read_csv to get the dataframe.

Thank you for your reply. Sure I can, but my data comes from multiple sources, such as parquet, hive, and so on. I'm considering if it is possible that dataprep can configure what the missing values are?

Thanks for the reply! I now understand the use case. Yeah that's definitely something useful. I'm considering what's the efficient way to do this, as dataprep only process the dask or pandas dataframe. Did you use df.replace as the current solution?

Mar 17 '22 19:03 jinglinpeng

We use dataprep in a bigdata eco system, I used to contibute the doc of dataprep on yarn (https://github.com/sfu-db/dataprep/issues/771), but Pandas and Dask couldn't support some of datasources, such as Apache Hudi.

Therefore, on this basis, I made dataprep support Spark dataframe with Ray which help dataprep integrated into bigdata eco system.

This is my use case, I could add doc if it is necessary.

And I think df.replace could actually solve my problem without modified the dataprep. I will do some tests, thank you.

Mar 18 '22 01:03 Bowen0729

@Bowen0729 Thanks for the info. May I know how you made dataprep support Spark dataframe with Ray, did you modify the internal code?

Mar 18 '22 02:03 jinglinpeng

No, I didn't modify the internal code, I just used raydp(spark on ray) [https://github.com/oap-project/raydp] to read a spark dataframe, and transfrom a spark dataframe to a dask dataframe with ray api, it is simple.

ray.init()

spark = raydp.init_spark()

spark_df = spark.sql("")

ray_df = ray.data.from_spark(spark_df)

dask_df = ray_df.to_dask()

create_report(dask_df)

Mar 18 '22 02:03 Bowen0729

No, I didn't modify the internal code, I just used raydp(spark on ray) [https://github.com/oap-project/raydp] to read a spark dataframe, and transfrom a spark dataframe to a dask dataframe with ray api, it is simple.
ray.init()

spark = raydp.init_spark()

spark_df = spark.sql("")

ray_df = ray.data.from_spark(spark_df)

dask_df = ray_df.to_dask()

create_report(dask_df)

I see. Good to know this use case.

Mar 18 '22 02:03 jinglinpeng

Is it necessary to add the doc for this case?

Mar 18 '22 02:03 Bowen0729

Is it necessary to add the doc for this case?

Yeah, I think it would be nice to have a use case doc. If you would like to contribute, you can add a notebook named use_case.ipynb in https://github.com/sfu-db/dataprep/tree/develop/docs/source/user_guide/eda, where you can write down this use case :)

Mar 18 '22 03:03 jinglinpeng

@Bowen0729 let me know if you would like to create that doc together, I've got a spark dataframe I could test that out on or can help write functions to support this Feature.

Jun 20 '22 17:06 datatalking

@Bowen0729 let me know if you would like to create that doc together, I've got a spark dataframe I could test that out on or can help write functions to support this Feature.

Sure, the commit haven't been merge, is there anything wrong?@jinglinpeng

Jun 21 '22 12:06 Bowen0729

Hi @Bowen0729 , thanks for the reminder, I've merged the PR. There are some problems in the doc-build workflow, and we're fixing it.

Jun 26 '22 02:06 jinglinpeng

Hi @Bowen0729 , thanks for the reminder, I've merged the PR. There are some problems in the doc-build workflow, and we're fixing it.

Thanks! And what's your plan? Do you have some good ideas about this feature? I'd love to do it together @datatalking

Jun 26 '22 13:06 Bowen0729

@Bowen0729 I'm pretty new to the repo, still learning how to do stuff. Is there a list or should we start one in Discussions '[https://github.com/sfu-db/dataprep/discussions]', or perhaps 'Projects' 'https://github.com/sfu-db/dataprep/projects?type=beta'. We can embrace and expand upon what was already done in the Titanic and 'house price' use cases.

Jun 28 '22 06:06 datatalking

dataprep dataprep copied to clipboard

Can I define which characters should be missing value?

dataprep
dataprep copied to clipboard