data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

Potential performance Issue: Slow read_csv() Function with pandas 2.0.0

Open TendouArisu opened this issue 1 year ago • 4 comments

Issue Description:

Hello. I have discovered a performance degradation in the read_csv function of pandas version below 2.0.1. And I notice some parts of the repository depend on pandas 2.0.0 in environments/minimal_requires.txt and some other dependencies require pandas below 2.0.1. I am not sure whether this performance problem in pandas will affect this repository. I found some discussions on pandas GitHub related to this issue, including #52546 and #52548. I also found that app.py and demos/data_process_loop/app.py used the influenced api. There may be more files using the influenced api.

Suggestion

I would recommend considering an upgrade to a different version of pandas >= 2.0.1 or exploring other solutions to optimize the performance of read_csv. Any other workarounds or solutions would be greatly appreciated. Thank you!

TendouArisu avatar Mar 02 '24 08:03 TendouArisu

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions[bot] avatar Mar 25 '24 09:03 github-actions[bot]

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions[bot] avatar Apr 16 '24 09:04 github-actions[bot]

Hi @TendouArisu , thanks for your attention and suggestions!

We have conducted a few experiments and proved what you said. We limited pandas to 2.0.0 mainly because:

  1. pandas >= 2.1.x and datasets==2.11.0 might raise a ValueError when exporting a dataset to a JSON file.
ValueError: 'index=True' is only valid when 'orient' is 'split', 'table', 'index', or 'columns'.
  1. pandas >= 2.1.x requires Python >= 3.9, but we want to support Python 3.7/3.8 as well.

However, we found that pandas 2.0.1 - 2.0.3 work well both on performance and these two problems above. So we update the version of pandas to 2.0.3 in the latest PR #303 .

Thanks for your suggestion again! Feel free to discuss with us if you have any further suggestions~

HYLcool avatar Apr 23 '24 11:04 HYLcool

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

github-actions[bot] avatar May 15 '24 09:05 github-actions[bot]

Close this stale issue.

github-actions[bot] avatar May 18 '24 09:05 github-actions[bot]