modin modin on ray on aws glue performance disappoints

I'm not sure if this is a bug or operator error on my part or something wrong on AWS side (I can try to follow up there too).

I took the documented comparison test and ran it with AWS Glue example data, including a new section for Ray:

import ray

ray.init("auto")

import os
import time

import modin.pandas as pd
import numpy as np
import pandas  # for comparisons

os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray

# note: depends on having s3fs==0.4.2
input_path = "s3://amazon-reviews-pds/parquet/product_category=Wireless/"

start = time.time()
pandas_df = pandas.read_parquet(input_path)
end = time.time()
pandas_duration = end - start
print("Time to read with pandas: {} seconds".format(round(pandas_duration, 3)))
pandas_df.head(1)

start = time.time()
df = pd.read_parquet(input_path)
end = time.time()
modin_duration = end - start
print("Time to read with modin: {} seconds".format(round(modin_duration, 3)))
df.head(1)


start = time.time()
ray_df = ray.data.read_parquet(input_path)
end = time.time()
ray_duration = end - start
print("Time to read with ray: {} seconds".format(round(ray_duration, 3)))
ray_df.show(1)

On this test, I saw:

Time to read with pandas: 60.291 seconds
Time to read with modin: 53.713 seconds
Time to read with ray: 6.197 seconds

On one of my datasets (not public), I did the same test and saw these timings:

Time to read with pandas: 16.89 seconds
Time to read with modin: 23.839 seconds
Time to read with ray: 2.548 seconds

I then noticed in logs another suggest env var to set, so added:

os.environ["__MODIN_AUTOIMPORT_PANDAS__"] = "1"

That's also in https://github.com/modin-project/modin/issues/5131 I guess.

Then I got these timings (on the public data set):

Time to read with pandas: 55.562 seconds
Time to read with modin: 26.646 seconds
Time to read with ray: 5.98 seconds

Is that about as good as I should expect?

This makes me suspect the AWS Glue Ray setup is not by itself adequate for modin. Anything I can do to further debug?

Jun 19 '23 02:06 mooreniemi

@mooreniemi thanks for opening the issue! Some of the read_parquet performance depends on a couple of things including how large the dataset is. Could you give some additional context?

What version of Modin are you running?
How many cores does does your AWS machine have?
How large are the parquet files, how many row groups do they have (roughly, you can check this with parquet-tools), and how many columns do they have?

Jun 19 '23 14:06 pyrito

Thanks for the comment @pyrito . @kukushking, can you tell why modin is so much slower than ray here?

Jun 19 '23 15:06 mvashishtha

I'll share details below, but I'm especially curious how these change the contrastive situation with Ray. I did notice that AWS worked on https://github.com/ray-project/ray/pull/23179 specifically to improve performance of read_parquet.

Latest version of Modin. AWS Glue for Ray scaling is a bit complicated. (Note: Ray is only up to 2.4 on AWS.) It's measured in M-DPUs which are a vCPU measure (that is, they're hyper-threads on a chip, not just the cores).

Ray jobs currently have access to one worker type, Z.2X. The Z.2X worker maps to 2 M-DPUs (8 vCPUs, 64 GB of memory) and has 128 GB of disk space. By default, a Z.2X machine provides 8 Ray workers (one per vCPU). You can configure the number of vCPUs that are allocated to a Ray worker at the Ray worker level in your script by using the num_cpus key. For more information, see ray.init in the Ray documentation.

I can change how many workers I use.

As for the data, s3://amazon-reviews-pds/parquet/product_category=Wireless/ is a public dataset.

88665a4a7a7f:~ amooren$ aws s3 ls --human s3://amazon-reviews-pds/parquet/product_category=Wireless/
2018-04-09 02:40:34  236.9 MiB part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 02:40:34  234.9 MiB part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 02:40:34  236.8 MiB part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 02:40:35  235.7 MiB part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 02:40:35  235.6 MiB part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 02:40:41  236.6 MiB part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 02:40:41  236.9 MiB part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 02:40:41  237.1 MiB part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 02:40:42  235.7 MiB part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
2018-04-09 02:40:42  235.9 MiB part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet

2.3GiB total.

Using pqrs:

88665a4a7a7f:~ amooren$ pqrs rowcount /tmp/ds/*.parquet
File Name: /tmp/ds/part-00000-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet: 902936 rows
File Name: /tmp/ds/part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet: 903814 rows
File Name: /tmp/ds/part-00002-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet: 904452 rows
File Name: /tmp/ds/part-00003-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet: 903561 rows
File Name: /tmp/ds/part-00004-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet: 903231 rows
File Name: /tmp/ds/part-00005-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet: 903310 rows
File Name: /tmp/ds/part-00006-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet: 904848 rows
File Name: /tmp/ds/part-00007-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet: 904240 rows
File Name: /tmp/ds/part-00008-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet: 903639 rows
File Name: /tmp/ds/part-00009-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet: 904218 rows

So roughly 9,000,000 rows.

And:

88665a4a7a7f:~ amooren$ pqrs head /tmp/ds/part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
{marketplace: "US", customer_id: "34688639", review_id: "RY9F75PL8P3VE", product_id: "B00D7VRPLG", product_parent: "671909994", product_title: "Boilfish®,Smoked,Flip Cover,Card Holder,Wallet,Stand,PU Leather Case,Samsung Galaxy S3,S III,GT-i9300,Verizon,AT&T,T-Mobile,US Cellulan,Sprint", star_rating: 4, helpful_votes: 0, total_votes: 0, vine: "N", verified_purchase: "Y", review_headline: "Well made, well played", review_body: "Looking for a professional looking phone cover. Looks good, well made. Credit Card Holder and space for money. The stand works well but the magnet belt tends to get in the way.", review_date: 2014-04-10 +00:00, year: 2014}

And:

88665a4a7a7f:~ amooren$ pqrs schema /tmp/ds/part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet
Metadata for file: /tmp/ds/part-00001-495c48e6-96d6-4650-aa65-3c36a3516ddd.c000.snappy.parquet

version: 1
num of rows: 903814
created by: parquet-mr version 1.8.2 (build c6522788629e590a53eb79874b95f6c3ff11f16c)
metadata:
  org.apache.spark.sql.parquet.row.metadata: {"type":"struct","fields":[{"name":"marketplace","type":"string","nullable":true,"metadata":{}},{"name":"customer_id","type":"string","nullable":true,"metadata":{}},{"name":"review_id","type":"string","nullable":true,"metadata":{}},{"name":"product_id","type":"string","nullable":true,"metadata":{}},{"name":"product_parent","type":"string","nullable":true,"metadata":{}},{"name":"product_title","type":"string","nullable":true,"metadata":{}},{"name":"star_rating","type":"integer","nullable":true,"metadata":{}},{"name":"helpful_votes","type":"integer","nullable":true,"metadata":{}},{"name":"total_votes","type":"integer","nullable":true,"metadata":{}},{"name":"vine","type":"string","nullable":true,"metadata":{}},{"name":"verified_purchase","type":"string","nullable":true,"metadata":{}},{"name":"review_headline","type":"string","nullable":true,"metadata":{}},{"name":"review_body","type":"string","nullable":true,"metadata":{}},{"name":"review_date","type":"date","nullable":true,"metadata":{}},{"name":"year","type":"integer","nullable":true,"metadata":{}}]}
message spark_schema {
  OPTIONAL BYTE_ARRAY marketplace (UTF8);
  OPTIONAL BYTE_ARRAY customer_id (UTF8);
  OPTIONAL BYTE_ARRAY review_id (UTF8);
  OPTIONAL BYTE_ARRAY product_id (UTF8);
  OPTIONAL BYTE_ARRAY product_parent (UTF8);
  OPTIONAL BYTE_ARRAY product_title (UTF8);
  OPTIONAL INT32 star_rating;
  OPTIONAL INT32 helpful_votes;
  OPTIONAL INT32 total_votes;
  OPTIONAL BYTE_ARRAY vine (UTF8);
  OPTIONAL BYTE_ARRAY verified_purchase (UTF8);
  OPTIONAL BYTE_ARRAY review_headline (UTF8);
  OPTIONAL BYTE_ARRAY review_body (UTF8);
  OPTIONAL INT32 review_date (DATE);
  OPTIONAL INT32 year;
}

Jun 19 '23 19:06 mooreniemi

@mooreniemi thanks for the update! This was super useful. We took a deeper look into this and it turns out the Ray dataset read just creates an Arrow dataset (you can see the docs here), so it seems that the actually parquet data isn't being materialized into the object store immediately. Now, if you want the ray_df to actually be usable for comparison with Modin and pandas you should try ray_df.to_modin() or ray_df.to_pandas() which seems to trigger materialization. Could you try this on your end and let us know how long it takes?

cc: @modin-project/modin-ray if I misunderstood anything here could folks chime in with a better explanation?

Jun 20 '23 16:06 pyrito

Ok, that explains Ray's performance and yes I can try that, but irrespective of that shouldn't I expect much better performance from Modin than Pandas? (Not what I see.)

Jun 20 '23 16:06 mooreniemi

@mooreniemi not necessarily. We haven't hyper-optimized our read_parquet implementation yet. At best we parallelized reads among N files and also over the number of row groups (thus the more the better).

Jun 21 '23 14:06 pyrito

Ok, is there a function you'd recommend I try to better demonstrate performance differences right now? I'd like to confirm some benefit on Glue.

Jun 21 '23 14:06 mooreniemi

@mooreniemi Modin performs pretty well with applymap and apply.

cc: @modin-project/modin-contributors

Jun 21 '23 16:06 pyrito

Performance comparisons here are admittedly tricky.

Since ray won't read_parquet or add_column until show, I grouped these operations together:

Time to apply with pandas: 61.675 seconds
Time to apply with modin: 39.899 seconds
Time to add_column/apply with ray: 92.79 seconds

I ran this just a couple times and saw variance on pandas of +/- 5s, but more like +/- 18s for modin.

Script run:

import ray

# performs worse if I use address="auto"
ray.init("auto")

import os
import time

import modin.pandas as pd
import numpy as np
import pandas  # for comparisons

os.environ["__MODIN_AUTOIMPORT_PANDAS__"] = "1"
os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray

# note: depends on having s3fs==0.4.2
input_path = "s3://amazon-reviews-pds/parquet/product_category=Wireless/"

def upper(s):
    if s:
        return s.upper()
    else:
        return "EMPTY"

start = time.time()
pandas_df = pandas.read_parquet(input_path)
pandas_df['up'] = pandas_df['review_headline'].apply(upper)
pandas_df.head(1)
end = time.time()
pandas_duration = end - start
print("Time to apply with pandas: {} seconds".format(round(pandas_duration, 3)))

start = time.time()
df = pd.read_parquet(input_path)
df['up'] = df['review_headline'].apply(upper)
df.head(1)
end = time.time()
modin_duration = end - start
print("Time to apply with modin: {} seconds".format(round(modin_duration, 3)))

start = time.time()
ray_df = ray.data.read_parquet(input_path)
ray_df = ray_df.add_column('up', lambda df: df['review_headline'].apply(upper))
ray_df.show(1) # to trigger execution
end = time.time()
ray_duration = end - start
print("Time to add_column/apply with ray: {} seconds".format(round(ray_duration, 3)))

Just isolating to the apply of pandas vs modin does make modin look good (3.33 seconds vs 0.145 seconds), but I think is not then a fair comparison with ray unless I trigger a show on ray first (it still takes a terrible 22.282 seconds).

Jun 22 '23 03:06 mooreniemi

Can you try read_parquet then show (and presumably got reading input files done) and then measure time for apply and show for ray?

Jun 23 '23 17:06 Garra1980

A short summary:

"s3://amazon-reviews-pds/parquet/product_category=Wireless/" is no longer a public dataset
os.environ["__MODIN_AUTOIMPORT_PANDAS__"] = "1" - using this will slow down rather than speed up. More details.
I ran this just a couple times and saw variance on pandas of +/- 5s, but more like +/- 18s for modin.

It looks very likely that this is the result of a double read from S3. But this is only a hypothesis, since I could not check even locally, due to the fact that the Amazon dataset is no longer public. At the moment, I think that it is necessary to write a draft implementation that will transfer the read pieces directly to the workers, instead of double reading. This is similar to https://github.com/modin-project/modin/issues/4770

Jan 08 '24 16:01 anmyachev

modin modin copied to clipboard

modin on ray on aws glue performance disappoints

modin
modin copied to clipboard