NVTabular
NVTabular copied to clipboard
Update the `Categorify` operator to set the domain max correctly
Goal
Reduce the resulting int_domain.max
property by one on a ColumnSchema after transforming with Categorify
. To match the data correctly.
Motivation / Context
This PR was motivated by work on https://github.com/NVIDIA-Merlin/Merlin/issues/479
We are using the domain.max
to compute the vocab size / cardinality when creating embedding tables in Merlin Models. This off-by-one error is resulting in some confusion when creating the correct shape embedding dimensions from pretrained embedding data.
Example
import uuid
import pandas as pd
from merlin.io import Dataset
from merlin.transforms import Workflow
from merlin.transforms.ops import Categorify
df = pd.DataFrame({"id": [str(uuid.uuid4()) for _ in range(2)]})
dataset = Dataset(df)
dataset
id
0 fc5f18c4-919f-4496-9209-1ae34aa4230d
1 738f873f-5fa7-4345-9daa-b1f714c9f1aa
dataset.schema
name tags dtype is_list is_ragged
0 id () object False False
After the Categorify op, these ids are transformed to integers {1, 2} with 0 reserved for out-of-vocabulary. So we have a cardinality of 3 (including the zero).
workflow = Workflow(["id"] >> Categorify())
transformed_dataset = workflow.fit_transform(dataset)
transformed_dataset
:
id
0 2
1 1
transformed_dataset.schema
name tags dtype is_list is_ragged properties.num_buckets \
0 id (Tags.CATEGORICAL) int64 False False None
properties.freq_threshold properties.max_size properties.start_index \
0 0 0 0
properties.cat_path properties.domain.min \
0 .//categories/unique.id.parquet 0
properties.domain.max properties.domain.name \
0 3 id
properties.embedding_sizes.cardinality \
0 3
properties.embedding_sizes.dimension
0 16
However, with the current implementation the int_domain.max value after the transform in this example is 3. This is the same value as the cardinality. However, the maximum integer value is one less than the cardinality here which is 2.
Click to view CI Results
GitHub pull request #1641 of commit 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4, no merge conflicts. Running as SYSTEM Setting status of 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4615/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10 > git rev-parse 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4^{commit} # timeout=10 Checking out Revision 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 # timeout=10 Commit message: "Update `Categorify` operator to set the domain max correctly" > git rev-list --no-walk c2a5b743c7a0b458be7af4ca96da091887a044b9 # timeout=10 First time build. Skipping changelog. [nvtabular_tests] $ /bin/bash /tmp/jenkins14816013642204511087.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1432 itemstests/unit/test_dask_nvt.py ..........................F..F..........FF.F [ 3%] ...F..............................................................FFF... [ 8%] .... [ 8%] tests/unit/test_notebooks.py ...... [ 8%] tests/unit/test_s3.py FF [ 8%] tests/unit/test_tf4rec.py . [ 9%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py ................................ [ 12%] tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%] tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%] ................................................... [ 18%] tests/unit/framework_utils/test_torch_layers.py . [ 18%] tests/unit/loader/test_dataloader_backend.py ...... [ 18%] tests/unit/loader/test_tf_dataloader.py ................................ [ 21%] ........................................s.. [ 24%] tests/unit/loader/test_torch_dataloader.py ............................. [ 26%] ...................................................... [ 29%] tests/unit/ops/test_categorify.py ...................................... [ 32%] ........................................................................ [ 37%] ........................................... [ 40%] tests/unit/ops/test_column_similarity.py ........................ [ 42%] tests/unit/ops/test_drop_low_cardinality.py FF [ 42%] tests/unit/ops/test_fill.py ............................................ [ 45%] ........ [ 45%] tests/unit/ops/test_groupyby.py ..................... [ 47%] tests/unit/ops/test_hash_bucket.py ......................... [ 49%] tests/unit/ops/test_join.py ............................................ [ 52%] ........................................................................ [ 57%] .................................. [ 59%] tests/unit/ops/test_lambda.py .......... [ 60%] tests/unit/ops/test_normalize.py ....................................... [ 63%] .. [ 63%] tests/unit/ops/test_ops.py ............................................. [ 66%] .................... [ 67%] tests/unit/ops/test_ops_schema.py ...................................... [ 70%] ........................................................................ [ 75%] ........................................................................ [ 80%] ........................................................................ [ 85%] ....................................... [ 88%] tests/unit/ops/test_reduce_dtype_size.py .. [ 88%] tests/unit/ops/test_target_encode.py ..................... [ 89%] tests/unit/workflow/test_cpu_workflow.py FFFFFF [ 90%] tests/unit/workflow/test_workflow.py ................................... [ 92%] .......................................................... [ 96%] tests/unit/workflow/test_workflow_chaining.py ... [ 96%] tests/unit/workflow/test_workflow_node.py ........... [ 97%] tests/unit/workflow/test_workflow_ops.py ... [ 97%] tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%] ... [100%]
=================================== FAILURES =================================== ____ test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1] ____
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr26') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')} freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv-no-header' cat_cache = 'device', on_host = True, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1]) @pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("freq_threshold", [0, 150]) @pytest.mark.parametrize("cat_cache", ["device", None]) @pytest.mark.parametrize("on_host", [True, False]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [True, False]) def test_dask_workflow_api_dlrm( client, tmpdir, datasets, freq_threshold, part_mem_fraction, engine, cat_cache, on_host, shuffle, cpu, ): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) paths = sorted(paths) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) df0 = df0.to_pandas() if cpu else df0 if engine == "parquet": cat_names = ["name-cat", "name-string"] else: cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] cats = cat_names >> ops.Categorify( freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host ) conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp() workflow = Workflow(cats + conts + label_name) if engine in ("parquet", "csv"): dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction) else: dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction) output_path = os.path.join(tmpdir, "processed") transformed = workflow.fit_transform(dataset) transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1) result = transformed.to_ddf().compute() assert len(df0) == len(result) assert result["x"].min() == 0.0 assert result["x"].isna().sum() == 0 assert result["y"].min() == 0.0 assert result["y"].isna().sum() == 0 # Check categories. Need to sort first to make sure we are comparing # "apples to apples" expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]] dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg( {"name-string_x": "count", "name-string_y": "count"} ) if freq_threshold: dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold] assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False) # Read back from disk if cpu:
df_disk = dd_read_parquet(output_path).compute()
tests/unit/test_dask_nvt.py:130:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-09 14:18:30,251 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-1852321342b43b35c1c4d664628b409a', 0) Function: subgraph_callable-11603efe-dc29-4a19-8a57-01e58196 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr26/processed/part_0.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___ test_dask_workflow_api_dlrm[True-None-True-device-150-csv-no-header-0.1] ___
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr29') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')} freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header' cat_cache = 'device', on_host = True, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1]) @pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("freq_threshold", [0, 150]) @pytest.mark.parametrize("cat_cache", ["device", None]) @pytest.mark.parametrize("on_host", [True, False]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [True, False]) def test_dask_workflow_api_dlrm( client, tmpdir, datasets, freq_threshold, part_mem_fraction, engine, cat_cache, on_host, shuffle, cpu, ): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) paths = sorted(paths) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) df0 = df0.to_pandas() if cpu else df0 if engine == "parquet": cat_names = ["name-cat", "name-string"] else: cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] cats = cat_names >> ops.Categorify( freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host ) conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp() workflow = Workflow(cats + conts + label_name) if engine in ("parquet", "csv"): dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction) else: dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction) output_path = os.path.join(tmpdir, "processed") transformed = workflow.fit_transform(dataset) transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1) result = transformed.to_ddf().compute() assert len(df0) == len(result) assert result["x"].min() == 0.0 assert result["x"].isna().sum() == 0 assert result["y"].min() == 0.0 assert result["y"].isna().sum() == 0 # Check categories. Need to sort first to make sure we are comparing # "apples to apples" expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]] dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg( {"name-string_x": "count", "name-string_y": "count"} ) if freq_threshold: dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold] assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False) # Read back from disk if cpu:
df_disk = dd_read_parquet(output_path).compute()
tests/unit/test_dask_nvt.py:130:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-09 14:18:32,258 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-7dbace1faf4aac58bf4b5a9808158f3e', 1) Function: subgraph_callable-ddf3578b-55e1-4bac-ad05-4a5c5fa7 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr29/processed/part_1.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
_______ test_dask_workflow_api_dlrm[True-None-False-device-150-csv-0.1] ________
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr40') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')} freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv' cat_cache = 'device', on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1]) @pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("freq_threshold", [0, 150]) @pytest.mark.parametrize("cat_cache", ["device", None]) @pytest.mark.parametrize("on_host", [True, False]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [True, False]) def test_dask_workflow_api_dlrm( client, tmpdir, datasets, freq_threshold, part_mem_fraction, engine, cat_cache, on_host, shuffle, cpu, ): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) paths = sorted(paths) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) df0 = df0.to_pandas() if cpu else df0 if engine == "parquet": cat_names = ["name-cat", "name-string"] else: cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] cats = cat_names >> ops.Categorify( freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host ) conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp() workflow = Workflow(cats + conts + label_name) if engine in ("parquet", "csv"): dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction) else: dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction) output_path = os.path.join(tmpdir, "processed") transformed = workflow.fit_transform(dataset) transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1) result = transformed.to_ddf().compute() assert len(df0) == len(result) assert result["x"].min() == 0.0 assert result["x"].isna().sum() == 0 assert result["y"].min() == 0.0 assert result["y"].isna().sum() == 0 # Check categories. Need to sort first to make sure we are comparing # "apples to apples" expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]] dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg( {"name-string_x": "count", "name-string_y": "count"} ) if freq_threshold: dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold] assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False) # Read back from disk if cpu:
df_disk = dd_read_parquet(output_path).compute()
tests/unit/test_dask_nvt.py:130:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-09 14:18:38,567 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-60a11de382c15140e4b6043c3fd932a0', 0) Function: subgraph_callable-5674b3d5-3fa9-4e20-8649-a7fb3f72 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr40/processed/part_0.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
__ test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1] ___
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr41') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')} freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header' cat_cache = 'device', on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1]) @pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("freq_threshold", [0, 150]) @pytest.mark.parametrize("cat_cache", ["device", None]) @pytest.mark.parametrize("on_host", [True, False]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [True, False]) def test_dask_workflow_api_dlrm( client, tmpdir, datasets, freq_threshold, part_mem_fraction, engine, cat_cache, on_host, shuffle, cpu, ): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) paths = sorted(paths) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) df0 = df0.to_pandas() if cpu else df0 if engine == "parquet": cat_names = ["name-cat", "name-string"] else: cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] cats = cat_names >> ops.Categorify( freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host ) conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp() workflow = Workflow(cats + conts + label_name) if engine in ("parquet", "csv"): dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction) else: dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction) output_path = os.path.join(tmpdir, "processed") transformed = workflow.fit_transform(dataset) transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1) result = transformed.to_ddf().compute() assert len(df0) == len(result) assert result["x"].min() == 0.0 assert result["x"].isna().sum() == 0 assert result["y"].min() == 0.0 assert result["y"].isna().sum() == 0 # Check categories. Need to sort first to make sure we are comparing # "apples to apples" expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]] dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg( {"name-string_x": "count", "name-string_y": "count"} ) if freq_threshold: dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold] assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False) # Read back from disk if cpu:
df_disk = dd_read_parquet(output_path).compute()
tests/unit/test_dask_nvt.py:130:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-09 14:18:39,541 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-c58c6d8297d1363e166bbae0ba2b7cbc', 0) Function: subgraph_callable-d0c1022a-5b14-4dfd-96bc-4d734a7f args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr41/processed/part_0.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
_________ test_dask_workflow_api_dlrm[True-None-False-None-0-csv-0.1] __________
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr43') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')} freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv', cat_cache = None on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1]) @pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("freq_threshold", [0, 150]) @pytest.mark.parametrize("cat_cache", ["device", None]) @pytest.mark.parametrize("on_host", [True, False]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [True, False]) def test_dask_workflow_api_dlrm( client, tmpdir, datasets, freq_threshold, part_mem_fraction, engine, cat_cache, on_host, shuffle, cpu, ): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) paths = sorted(paths) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) df0 = df0.to_pandas() if cpu else df0 if engine == "parquet": cat_names = ["name-cat", "name-string"] else: cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] cats = cat_names >> ops.Categorify( freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host ) conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp() workflow = Workflow(cats + conts + label_name) if engine in ("parquet", "csv"): dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction) else: dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction) output_path = os.path.join(tmpdir, "processed") transformed = workflow.fit_transform(dataset) transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1) result = transformed.to_ddf().compute() assert len(df0) == len(result) assert result["x"].min() == 0.0 assert result["x"].isna().sum() == 0 assert result["y"].min() == 0.0 assert result["y"].isna().sum() == 0 # Check categories. Need to sort first to make sure we are comparing # "apples to apples" expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]] dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg( {"name-string_x": "count", "name-string_y": "count"} ) if freq_threshold: dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold] assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False) # Read back from disk if cpu:
df_disk = dd_read_parquet(output_path).compute()
tests/unit/test_dask_nvt.py:130:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-09 14:18:41,172 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4bd0270b5198abc9c84600ff73978b63', 0) Function: subgraph_callable-f60f3e2c-6e85-43cb-b01b-c5c57c37 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr43/processed/part_0.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___ test_dask_workflow_api_dlrm[True-None-False-None-150-csv-no-header-0.1] ____
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr47') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')} freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header' cat_cache = None, on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1]) @pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("freq_threshold", [0, 150]) @pytest.mark.parametrize("cat_cache", ["device", None]) @pytest.mark.parametrize("on_host", [True, False]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [True, False]) def test_dask_workflow_api_dlrm( client, tmpdir, datasets, freq_threshold, part_mem_fraction, engine, cat_cache, on_host, shuffle, cpu, ): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) paths = sorted(paths) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) df0 = df0.to_pandas() if cpu else df0 if engine == "parquet": cat_names = ["name-cat", "name-string"] else: cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] cats = cat_names >> ops.Categorify( freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host ) conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp() workflow = Workflow(cats + conts + label_name) if engine in ("parquet", "csv"): dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction) else: dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction) output_path = os.path.join(tmpdir, "processed") transformed = workflow.fit_transform(dataset) transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1) result = transformed.to_ddf().compute() assert len(df0) == len(result) assert result["x"].min() == 0.0 assert result["x"].isna().sum() == 0 assert result["y"].min() == 0.0 assert result["y"].isna().sum() == 0 # Check categories. Need to sort first to make sure we are comparing # "apples to apples" expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]] dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg( {"name-string_x": "count", "name-string_y": "count"} ) if freq_threshold: dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold] assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False) # Read back from disk if cpu:
df_disk = dd_read_parquet(output_path).compute()
tests/unit/test_dask_nvt.py:130:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-09 14:18:43,498 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4b353e97dab5f3f10917661932e5e9dc', 0) Function: subgraph_callable-6a1bdeef-16c0-4d80-92c5-33cc15c9 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_workflow_api_dlrm_Tr47/processed/part_0.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___________________ test_dask_preproc_cpu[True-None-parquet] ___________________
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non0') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')} engine = 'parquet', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [None, True]) def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) if engine in ("parquet", "csv"): dataset = Dataset(paths, part_size="1MB", cpu=cpu) else: dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu) # Simple transform (normalize) cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] conts = cont_names >> ops.FillMissing() >> ops.Normalize() workflow = Workflow(conts + cat_names + label_name) transformed = workflow.fit_transform(dataset) # Write out dataset output_path = os.path.join(tmpdir, "processed") transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4) # Check the final result
df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()
tests/unit/test_dask_nvt.py:277:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( 2022-08-09 14:19:24,201 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-a7389f08cd92af659ecf786a270fd236', 14) Function: subgraph_callable-982cf7f6-ffd8-476b-8b05-9acf576e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( 2022-08-09 14:19:24,204 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-a7389f08cd92af659ecf786a270fd236', 13) Function: subgraph_callable-982cf7f6-ffd8-476b-8b05-9acf576e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:24,205 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-a7389f08cd92af659ecf786a270fd236', 15) Function: subgraph_callable-982cf7f6-ffd8-476b-8b05-9acf576e args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
_____________________ test_dask_preproc_cpu[True-None-csv] _____________________
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')} engine = 'csv', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [None, True]) def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) if engine in ("parquet", "csv"): dataset = Dataset(paths, part_size="1MB", cpu=cpu) else: dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu) # Simple transform (normalize) cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] conts = cont_names >> ops.FillMissing() >> ops.Normalize() workflow = Workflow(conts + cat_names + label_name) transformed = workflow.fit_transform(dataset) # Write out dataset output_path = os.path.join(tmpdir, "processed") transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4) # Check the final result
df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()
tests/unit/test_dask_nvt.py:277:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-09 14:19:25,124 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 13) Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,128 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 2) Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,129 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 12) Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,129 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 14) Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
--------------------------- Captured stderr teardown --------------------------- 2022-08-09 14:19:25,135 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 0) Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,136 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 1) Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_0.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,137 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 11) Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,137 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 10) Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_2.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,138 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-6517feaf40a419ca65ce24008b83ee39', 15) Function: subgraph_callable-ab2d714b-24f8-4ab0-bbf4-3c7ea6f0 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non1/processed/part_3.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
________________ test_dask_preproc_cpu[True-None-csv-no-header] ________________
client = <Client: 'tcp://127.0.0.1:35395' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')} engine = 'csv-no-header', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [None, True]) def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) if engine in ("parquet", "csv"): dataset = Dataset(paths, part_size="1MB", cpu=cpu) else: dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu) # Simple transform (normalize) cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] conts = cont_names >> ops.FillMissing() >> ops.Normalize() workflow = Workflow(conts + cat_names + label_name) transformed = workflow.fit_transform(dataset) # Write out dataset output_path = os.path.join(tmpdir, "processed") transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4) # Check the final result
df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()
tests/unit/test_dask_nvt.py:277:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-09 14:19:25,811 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 22) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_5.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,813 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 20) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_5.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,815 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 21) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_5.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,816 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 16) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_4.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,816 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 18) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_4.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,818 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 17) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_4.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,824 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 19) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_4.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
--------------------------- Captured stderr teardown --------------------------- 2022-08-09 14:19:25,837 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 26) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_6.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,842 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 28) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_7.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,845 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 24) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_6.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,847 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 23) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_5.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,849 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 27) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_6.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,852 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 25) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_6.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,864 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 30) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_7.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,868 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 29) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_7.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:19:25,870 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-4ecc968b0fcd42342c0f06c652e20bd6', 31) Function: subgraph_callable-cad5f648-f300-4fa5-8a1d-c0aa58a5 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-15/test_dask_preproc_cpu_True_Non2/processed/part_7.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___________________________ test_s3_dataset[parquet] ___________________________
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
def _new_conn(self): """ Establish a socket connection and set nodelay settings on it. :return: New socket connection. """ extra_kw = {} if self.source_address: extra_kw["source_address"] = self.source_address if self.socket_options: extra_kw["socket_options"] = self.socket_options try:
conn = connection.create_connection(
(self._dns_host, self.port), self.timeout, **extra_kw )
/usr/lib/python3/dist-packages/urllib3/connection.py:159:
address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]
def create_connection( address, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, source_address=None, socket_options=None, ): """Connect to *address* and return the socket object. Convenience function. Connect to *address* (a 2-tuple ``(host, port)``) and return the socket object. Passing the optional *timeout* parameter will set the timeout on the socket instance before attempting to connect. If no *timeout* is supplied, the global default timeout setting returned by :func:`getdefaulttimeout` is used. If *source_address* is set it must be a tuple of (host, port) for the socket to bind as a source address before making the connection. An host of '' or port 0 tells the OS to use the default. """ host, port = address if host.startswith("["): host = host.strip("[]") err = None # Using the value from allowed_gai_family() in the context of getaddrinfo lets # us select whether to work with IPv4 DNS records, IPv6 records, or both. # The original create_connection function always returns all records. family = allowed_gai_family() for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res sock = None try: sock = socket.socket(af, socktype, proto) # If provided, set socket level options before connecting. _set_socket_options(sock, socket_options) if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT: sock.settimeout(timeout) if source_address: sock.bind(source_address) sock.connect(sa) return sock except socket.error as e: err = e if sock is not None: sock.close() sock = None if err is not None:
raise err
/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:
address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]
def create_connection( address, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, source_address=None, socket_options=None, ): """Connect to *address* and return the socket object. Convenience function. Connect to *address* (a 2-tuple ``(host, port)``) and return the socket object. Passing the optional *timeout* parameter will set the timeout on the socket instance before attempting to connect. If no *timeout* is supplied, the global default timeout setting returned by :func:`getdefaulttimeout` is used. If *source_address* is set it must be a tuple of (host, port) for the socket to bind as a source address before making the connection. An host of '' or port 0 tells the OS to use the default. """ host, port = address if host.startswith("["): host = host.strip("[]") err = None # Using the value from allowed_gai_family() in the context of getaddrinfo lets # us select whether to work with IPv4 DNS records, IPv6 records, or both. # The original create_connection function always returns all records. family = allowed_gai_family() for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res sock = None try: sock = socket.socket(af, socktype, proto) # If provided, set socket level options before connecting. _set_socket_options(sock, socket_options) if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT: sock.settimeout(timeout) if source_address: sock.bind(source_address)
sock.connect(sa)
E ConnectionRefusedError: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError
During handling of the above exception, another exception occurred:
self = <botocore.httpsession.URLLib3Session object at 0x7f6098bd2d00> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request): try: proxy_url = self._proxy_config.proxy_url_for(request.url) manager = self._get_connection_manager(request.url, proxy_url) conn = manager.connection_from_url(request.url) self._setup_ssl_cert(conn, request.url, self._verify) if ensure_boolean( os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '') ): # This is currently an "experimental" feature which provides # no guarantees of backwards compatibility. It may be subject # to change or removal in any patch version. Anyone opting in # to this feature should strictly pin botocore. host = urlparse(request.url).hostname conn.proxy_headers['host'] = host request_target = self._get_request_target(request.url, proxy_url)
urllib_response = conn.urlopen(
method=request.method, url=request_target, body=request.body, headers=request.headers, retries=Retry(False), assert_same_host=False, preload_content=False, decode_content=False, chunked=self._chunked(request.headers), )
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:
self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f609b6e6e20> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7f6186e61220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f609b4c3580> is_new_proxy_conn = False
def urlopen( self, method, url, body=None, headers=None, retries=None, redirect=True, assert_same_host=True, timeout=_Default, pool_timeout=None, release_conn=None, chunked=False, body_pos=None, **response_kw ): """ Get a connection from the pool and perform an HTTP request. This is the lowest level call for making a request, so you'll need to specify all the raw details. .. note:: More commonly, it's appropriate to use a convenience method provided by :class:`.RequestMethods`, such as :meth:`request`. .. note:: `release_conn` will only behave as expected if `preload_content=False` because we want to make `preload_content=False` the default behaviour someday soon without breaking backwards compatibility. :param method: HTTP request method (such as GET, POST, PUT, etc.) :param body: Data to send in the request body (useful for creating POST requests, see HTTPConnectionPool.post_url for more convenience). :param headers: Dictionary of custom headers to send, such as User-Agent, If-None-Match, etc. If None, pool headers are used. If provided, these headers completely replace any pool-specific headers. :param retries: Configure the number of retries to allow before raising a :class:`~urllib3.exceptions.MaxRetryError` exception. Pass ``None`` to retry until you receive a response. Pass a :class:`~urllib3.util.retry.Retry` object for fine-grained control over different types of retries. Pass an integer number to retry connection errors that many times, but no other types of errors. Pass zero to never retry. If ``False``, then retries are disabled and any exception is raised immediately. Also, instead of raising a MaxRetryError on redirects, the redirect response will be returned. :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int. :param redirect: If True, automatically handle redirects (status codes 301, 302, 303, 307, 308). Each redirect counts as a retry. Disabling retries will disable redirect, too. :param assert_same_host: If ``True``, will make sure that the host of the pool requests is consistent else will raise HostChangedError. When False, you can use the pool on an HTTP proxy and request foreign hosts. :param timeout: If specified, overrides the default timeout for this one request. It may be a float (in seconds) or an instance of :class:`urllib3.util.Timeout`. :param pool_timeout: If set and the pool is set to block=True, then this method will block for ``pool_timeout`` seconds and raise EmptyPoolError if no connection is available within the time period. :param release_conn: If False, then the urlopen call will not release the connection back into the pool once a response is received (but will release if you read the entire contents of the response such as when `preload_content=True`). This is useful if you're not preloading the response's content immediately. You will need to call ``r.release_conn()`` on the response ``r`` to return the connection back into the pool. If None, it takes the value of ``response_kw.get('preload_content', True)``. :param chunked: If True, urllib3 will send the body using chunked transfer encoding. Otherwise, urllib3 will send the body using the standard content-length form. Defaults to False. :param int body_pos: Position to seek to in file-like body in the event of a retry or redirect. Typically this won't need to be set because urllib3 will auto-populate the value when needed. :param \\**response_kw: Additional parameters are passed to :meth:`urllib3.response.HTTPResponse.from_httplib` """ if headers is None: headers = self.headers if not isinstance(retries, Retry): retries = Retry.from_int(retries, redirect=redirect, default=self.retries) if release_conn is None: release_conn = response_kw.get("preload_content", True) # Check host if assert_same_host and not self.is_same_host(url): raise HostChangedError(self, url, retries) # Ensure that the URL we're connecting to is properly encoded if url.startswith("/"): url = six.ensure_str(_encode_target(url)) else: url = six.ensure_str(parse_url(url).url) conn = None # Track whether `conn` needs to be released before # returning/raising/recursing. Update this variable if necessary, and # leave `release_conn` constant throughout the function. That way, if # the function recurses, the original value of `release_conn` will be # passed down into the recursive call, and its value will be respected. # # See issue #651 [1] for details. # # [1] <https://github.com/urllib3/urllib3/issues/651> release_this_conn = release_conn # Merge the proxy headers. Only do this in HTTP. We have to copy the # headers dict so we can safely change it without those changes being # reflected in anyone else's copy. if self.scheme == "http": headers = headers.copy() headers.update(self.proxy_headers) # Must keep the exception bound to a separate variable or else Python 3 # complains about UnboundLocalError. err = None # Keep track of whether we cleanly exited the except block. This # ensures we do proper cleanup in finally. clean_exit = False # Rewind body position, if needed. Record current position # for future rewinds in the event of a redirect/retry. body_pos = set_file_position(body, body_pos) try: # Request a connection from the queue. timeout_obj = self._get_timeout(timeout) conn = self._get_conn(timeout=pool_timeout) conn.timeout = timeout_obj.connect_timeout is_new_proxy_conn = self.proxy is not None and not getattr( conn, "sock", None ) if is_new_proxy_conn: self._prepare_proxy(conn) # Make the request on the httplib connection object. httplib_response = self._make_request( conn, method, url, timeout=timeout_obj, body=body, headers=headers, chunked=chunked, ) # If we're going to release the connection in ``finally:``, then # the response doesn't need to know about the connection. Otherwise # it will also try to release it and we'll have a double-release # mess. response_conn = conn if not release_conn else None # Pass method to Response for length checking response_kw["request_method"] = method # Import httplib's response into our own wrapper object response = self.ResponseCls.from_httplib( httplib_response, pool=self, connection=response_conn, retries=retries, **response_kw ) # Everything went great! clean_exit = True except queue.Empty: # Timed out by queue. raise EmptyPoolError(self, "No pool connections are available.") except ( TimeoutError, HTTPException, SocketError, ProtocolError, BaseSSLError, SSLError, CertificateError, ) as e: # Discard the connection for these exceptions. It will be # replaced during the next _get_conn() call. clean_exit = False if isinstance(e, (BaseSSLError, CertificateError)): e = SSLError(e) elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy: e = ProxyError("Cannot connect to proxy.", e) elif isinstance(e, (SocketError, HTTPException)): e = ProtocolError("Connection aborted.", e)
retries = retries.increment(
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] )
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:
self = Retry(total=False, connect=None, read=None, redirect=0, status=None) method = 'PUT', url = '/parquet', response = None error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>: Failed to establish a new connection: [Errno 111] Connection refused') _pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f609b6e6e20> _stacktrace = <traceback object at 0x7f60989d9c40>
def increment( self, method=None, url=None, response=None, error=None, _pool=None, _stacktrace=None, ): """ Return a new Retry object with incremented retry counters. :param response: A response object, or None, if the server did not return a response. :type response: :class:`~urllib3.response.HTTPResponse` :param Exception error: An error encountered during the request, or None if the response was received successfully. :return: A new ``Retry`` object. """ if self.total is False and error: # Disabled, indicate to re-raise the error.
raise six.reraise(type(error), error, _stacktrace)
/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:
tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None
def reraise(tp, value, tb=None): try: if value is None: value = tp() if value.__traceback__ is not tb: raise value.with_traceback(tb)
raise value
../../../.local/lib/python3.8/site-packages/six.py:703:
self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f609b6e6e20> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7f6186e61220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f609b4c3580> is_new_proxy_conn = False
def urlopen( self, method, url, body=None, headers=None, retries=None, redirect=True, assert_same_host=True, timeout=_Default, pool_timeout=None, release_conn=None, chunked=False, body_pos=None, **response_kw ): """ Get a connection from the pool and perform an HTTP request. This is the lowest level call for making a request, so you'll need to specify all the raw details. .. note:: More commonly, it's appropriate to use a convenience method provided by :class:`.RequestMethods`, such as :meth:`request`. .. note:: `release_conn` will only behave as expected if `preload_content=False` because we want to make `preload_content=False` the default behaviour someday soon without breaking backwards compatibility. :param method: HTTP request method (such as GET, POST, PUT, etc.) :param body: Data to send in the request body (useful for creating POST requests, see HTTPConnectionPool.post_url for more convenience). :param headers: Dictionary of custom headers to send, such as User-Agent, If-None-Match, etc. If None, pool headers are used. If provided, these headers completely replace any pool-specific headers. :param retries: Configure the number of retries to allow before raising a :class:`~urllib3.exceptions.MaxRetryError` exception. Pass ``None`` to retry until you receive a response. Pass a :class:`~urllib3.util.retry.Retry` object for fine-grained control over different types of retries. Pass an integer number to retry connection errors that many times, but no other types of errors. Pass zero to never retry. If ``False``, then retries are disabled and any exception is raised immediately. Also, instead of raising a MaxRetryError on redirects, the redirect response will be returned. :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int. :param redirect: If True, automatically handle redirects (status codes 301, 302, 303, 307, 308). Each redirect counts as a retry. Disabling retries will disable redirect, too. :param assert_same_host: If ``True``, will make sure that the host of the pool requests is consistent else will raise HostChangedError. When False, you can use the pool on an HTTP proxy and request foreign hosts. :param timeout: If specified, overrides the default timeout for this one request. It may be a float (in seconds) or an instance of :class:`urllib3.util.Timeout`. :param pool_timeout: If set and the pool is set to block=True, then this method will block for ``pool_timeout`` seconds and raise EmptyPoolError if no connection is available within the time period. :param release_conn: If False, then the urlopen call will not release the connection back into the pool once a response is received (but will release if you read the entire contents of the response such as when `preload_content=True`). This is useful if you're not preloading the response's content immediately. You will need to call ``r.release_conn()`` on the response ``r`` to return the connection back into the pool. If None, it takes the value of ``response_kw.get('preload_content', True)``. :param chunked: If True, urllib3 will send the body using chunked transfer encoding. Otherwise, urllib3 will send the body using the standard content-length form. Defaults to False. :param int body_pos: Position to seek to in file-like body in the event of a retry or redirect. Typically this won't need to be set because urllib3 will auto-populate the value when needed. :param \\**response_kw: Additional parameters are passed to :meth:`urllib3.response.HTTPResponse.from_httplib` """ if headers is None: headers = self.headers if not isinstance(retries, Retry): retries = Retry.from_int(retries, redirect=redirect, default=self.retries) if release_conn is None: release_conn = response_kw.get("preload_content", True) # Check host if assert_same_host and not self.is_same_host(url): raise HostChangedError(self, url, retries) # Ensure that the URL we're connecting to is properly encoded if url.startswith("/"): url = six.ensure_str(_encode_target(url)) else: url = six.ensure_str(parse_url(url).url) conn = None # Track whether `conn` needs to be released before # returning/raising/recursing. Update this variable if necessary, and # leave `release_conn` constant throughout the function. That way, if # the function recurses, the original value of `release_conn` will be # passed down into the recursive call, and its value will be respected. # # See issue #651 [1] for details. # # [1] <https://github.com/urllib3/urllib3/issues/651> release_this_conn = release_conn # Merge the proxy headers. Only do this in HTTP. We have to copy the # headers dict so we can safely change it without those changes being # reflected in anyone else's copy. if self.scheme == "http": headers = headers.copy() headers.update(self.proxy_headers) # Must keep the exception bound to a separate variable or else Python 3 # complains about UnboundLocalError. err = None # Keep track of whether we cleanly exited the except block. This # ensures we do proper cleanup in finally. clean_exit = False # Rewind body position, if needed. Record current position # for future rewinds in the event of a redirect/retry. body_pos = set_file_position(body, body_pos) try: # Request a connection from the queue. timeout_obj = self._get_timeout(timeout) conn = self._get_conn(timeout=pool_timeout) conn.timeout = timeout_obj.connect_timeout is_new_proxy_conn = self.proxy is not None and not getattr( conn, "sock", None ) if is_new_proxy_conn: self._prepare_proxy(conn) # Make the request on the httplib connection object.
httplib_response = self._make_request(
conn, method, url, timeout=timeout_obj, body=body, headers=headers, chunked=chunked, )
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:
self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f609b6e6e20> conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0> method = 'PUT', url = '/parquet' timeout = <urllib3.util.timeout.Timeout object at 0x7f609b4c3580> chunked = False httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}} timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f609aa49100>
def _make_request( self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw ): """ Perform a request on a given urllib connection object taken from our pool. :param conn: a connection from one of our connection pools :param timeout: Socket timeout in seconds for the request. This can be a float or integer, which will set the same timeout value for the socket connect and the socket read, or an instance of :class:`urllib3.util.Timeout`, which gives you more fine-grained control over your timeouts. """ self.num_requests += 1 timeout_obj = self._get_timeout(timeout) timeout_obj.start_connect() conn.timeout = timeout_obj.connect_timeout # Trigger any extra validation we need to do. try: self._validate_conn(conn) except (SocketTimeout, BaseSSLError) as e: # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout. self._raise_timeout(err=e, url=url, timeout_value=conn.timeout) raise # conn.request() calls httplib.*.request, not the method in # urllib3.request. It also calls makefile (recv) on the socket. if chunked: conn.request_chunked(method, url, **httplib_request_kw) else:
conn.request(method, url, **httplib_request_kw)
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
def request(self, method, url, body=None, headers={}, *, encode_chunked=False): """Send a complete request to the server."""
self._send_request(method, url, body, headers, encode_chunked)
/usr/lib/python3.8/http/client.py:1256:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} args = (False,), kwargs = {}
def _send_request(self, method, url, body, headers, *args, **kwargs): self._response_received = False if headers.get('Expect', b'') == b'100-continue': self._expect_header_set = True else: self._expect_header_set = False self.response_class = self._original_response_cls
rval = super()._send_request(
method, url, body, headers, *args, **kwargs )
/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} encode_chunked = False
def _send_request(self, method, url, body, headers, encode_chunked): # Honor explicitly requested Host: and Accept-Encoding: headers. header_names = frozenset(k.lower() for k in headers) skips = {} if 'host' in header_names: skips['skip_host'] = 1 if 'accept-encoding' in header_names: skips['skip_accept_encoding'] = 1 self.putrequest(method, url, **skips) # chunked encoding will happen if HTTP/1.1 is used and either # the caller passes encode_chunked=True or the following # conditions hold: # 1. content-length has not been explicitly set # 2. the body is a file or iterable, but not a str or bytes-like # 3. Transfer-Encoding has NOT been explicitly set by the caller if 'content-length' not in header_names: # only chunk body if not explicitly set for backwards # compatibility, assuming the client code is already handling the # chunking if 'transfer-encoding' not in header_names: # if content-length cannot be automatically determined, fall # back to chunked encoding encode_chunked = False content_length = self._get_content_length(body, method) if content_length is None: if body is not None: if self.debuglevel > 0: print('Unable to determine size of %r' % body) encode_chunked = True self.putheader('Transfer-Encoding', 'chunked') else: self.putheader('Content-Length', str(content_length)) else: encode_chunked = False for hdr, value in headers.items(): self.putheader(hdr, value) if isinstance(body, str): # RFC 2616 Section 3.7.1 says that text default has a # default charset of iso-8859-1. body = _encode(body, 'body')
self.endheaders(body, encode_chunked=encode_chunked)
/usr/lib/python3.8/http/client.py:1302:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0> message_body = None
def endheaders(self, message_body=None, *, encode_chunked=False): """Indicate that the last header line has been sent to the server. This method sends the request to the server. The optional message_body argument can be used to pass a message body associated with the request. """ if self.__state == _CS_REQ_STARTED: self.__state = _CS_REQ_SENT else: raise CannotSendHeader()
self._send_output(message_body, encode_chunked=encode_chunked)
/usr/lib/python3.8/http/client.py:1251:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0> message_body = None, args = (), kwargs = {'encode_chunked': False} msg = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 9063600a-b349-4012-a1b6-e82a82b2bbd1\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def _send_output(self, message_body=None, *args, **kwargs): self._buffer.extend((b"", b"")) msg = self._convert_to_bytes(self._buffer) del self._buffer[:] # If msg and message_body are sent in a single send() call, # it will avoid performance problems caused by the interaction # between delayed ack and the Nagle algorithm. if isinstance(message_body, bytes): msg += message_body message_body = None
self.send(msg)
/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0> str = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 9063600a-b349-4012-a1b6-e82a82b2bbd1\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, str): if self._response_received: logger.debug( "send() called, but reseponse already received. " "Not sending data." ) return
return super().send(str)
/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0> data = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 9063600a-b349-4012-a1b6-e82a82b2bbd1\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, data): """Send `data' to the server. ``data`` can be a string object, a bytes object, an array object, a file-like object that supports a .read() method, or an iterable object. """ if self.sock is None: if self.auto_open:
self.connect()
/usr/lib/python3.8/http/client.py:951:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
def connect(self):
conn = self._new_conn()
/usr/lib/python3/dist-packages/urllib3/connection.py:187:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>
def _new_conn(self): """ Establish a socket connection and set nodelay settings on it. :return: New socket connection. """ extra_kw = {} if self.source_address: extra_kw["source_address"] = self.source_address if self.socket_options: extra_kw["socket_options"] = self.socket_options try: conn = connection.create_connection( (self._dns_host, self.port), self.timeout, **extra_kw ) except SocketTimeout: raise ConnectTimeoutError( self, "Connection to %s timed out. (connect timeout=%s)" % (self.host, self.timeout), ) except SocketError as e:
raise NewConnectionError(
self, "Failed to establish a new connection: %s" % e )
E urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7f609aa49dc0>: Failed to establish a new connection: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError
During handling of the above exception, another exception occurred:
s3_base = 'http://127.0.0.1:5000/' s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}} paths = ['/tmp/pytest-of-jenkins/pytest-15/parquet0/dataset-0.parquet', '/tmp/pytest-of-jenkins/pytest-15/parquet0/dataset-1.parquet'] datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')} engine = 'parquet' df = name-cat name-string id label x y 0 Ingrid Hannah 1031 999 -0.076963 0.314008 ...la 1062 1029 0.995636 0.555042 4320 Charlie Dan 992 976 -0.958343 0.245327
[4321 rows x 6 columns] patch_aiobotocore = None
@pytest.mark.parametrize("engine", ["parquet", "csv"]) def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore): # Copy files to mock s3 bucket files = {} for i, path in enumerate(paths): with open(path, "rb") as f: fbytes = f.read() fn = path.split(os.path.sep)[-1] files[fn] = BytesIO() files[fn].write(fbytes) files[fn].seek(0) if engine == "parquet": # Workaround for nvt#539. In order to avoid the # bug in Dask's `create_metadata_file`, we need # to manually generate a "_metadata" file here. # This can be removed after dask#7295 is merged # (see https://github.com/dask/dask/pull/7295) fn = "_metadata" files[fn] = BytesIO() meta = create_metadata_file( paths, engine="pyarrow", out_dir=False, ) meta.write_metadata_file(files[fn]) files[fn].seek(0)
with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:
tests/unit/test_s3.py:97:
/usr/lib/python3.8/contextlib.py:113: in enter return next(self.gen) /usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context client.create_bucket(Bucket=bucket, ACL="public-read-write") /usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call return self._make_api_call(operation_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call http, parsed_response = self._make_request( /usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request return self._endpoint.make_request(operation_model, request_dict) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request return self._send_request(request_dict, operation_model) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request while self._needs_retry( /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry responses = self._event_emitter.emit( /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit return self._emitter.emit(aliased_event_name, **kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit return self._emit(event_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit response = handler(**kwargs) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call if self._checker(**checker_kwargs): /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call should_retry = self._should_retry( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry return self._checker(attempt_number, response, caught_exception) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call checker_response = checker( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call return self._check_caught_exception( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception raise caught_exception /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response http_response = self._send(request) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send return self.http_session.send(request)
self = <botocore.httpsession.URLLib3Session object at 0x7f6098bd2d00> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'9063600a-b349-4012-a1b6-e82a82b2bbd1', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request): try: proxy_url = self._proxy_config.proxy_url_for(request.url) manager = self._get_connection_manager(request.url, proxy_url) conn = manager.connection_from_url(request.url) self._setup_ssl_cert(conn, request.url, self._verify) if ensure_boolean( os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '') ): # This is currently an "experimental" feature which provides # no guarantees of backwards compatibility. It may be subject # to change or removal in any patch version. Anyone opting in # to this feature should strictly pin botocore. host = urlparse(request.url).hostname conn.proxy_headers['host'] = host request_target = self._get_request_target(request.url, proxy_url) urllib_response = conn.urlopen( method=request.method, url=request_target, body=request.body, headers=request.headers, retries=Retry(False), assert_same_host=False, preload_content=False, decode_content=False, chunked=self._chunked(request.headers), ) http_response = botocore.awsrequest.AWSResponse( request.url, urllib_response.status, urllib_response.headers, urllib_response, ) if not request.stream_output: # Cause the raw stream to be exhausted immediately. We do it # this way instead of using preload_content because # preload_content will never buffer chunked responses http_response.content return http_response except URLLib3SSLError as e: raise SSLError(endpoint_url=request.url, error=e) except (NewConnectionError, socket.gaierror) as e:
raise EndpointConnectionError(endpoint_url=request.url, error=e)
E botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/parquet"
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError ---------------------------- Captured stderr setup ----------------------------- Traceback (most recent call last): File "/usr/local/bin/moto_server", line 5, in
from moto.server import main File "/usr/local/lib/python3.8/dist-packages/moto/server.py", line 7, in from moto.moto_server.werkzeug_app import ( File "/usr/local/lib/python3.8/dist-packages/moto/moto_server/werkzeug_app.py", line 6, in from flask import Flask File "/usr/local/lib/python3.8/dist-packages/flask/init.py", line 4, in from . import json as json File "/usr/local/lib/python3.8/dist-packages/flask/json/init.py", line 8, in from ..globals import current_app File "/usr/local/lib/python3.8/dist-packages/flask/globals.py", line 56, in app_ctx: "AppContext" = LocalProxy( # type: ignore[assignment] TypeError: init() got an unexpected keyword argument 'unbound_message' _____________________________ test_s3_dataset[csv] _____________________________ self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
def _new_conn(self): """ Establish a socket connection and set nodelay settings on it. :return: New socket connection. """ extra_kw = {} if self.source_address: extra_kw["source_address"] = self.source_address if self.socket_options: extra_kw["socket_options"] = self.socket_options try:
conn = connection.create_connection(
(self._dns_host, self.port), self.timeout, **extra_kw )
/usr/lib/python3/dist-packages/urllib3/connection.py:159:
address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]
def create_connection( address, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, source_address=None, socket_options=None, ): """Connect to *address* and return the socket object. Convenience function. Connect to *address* (a 2-tuple ``(host, port)``) and return the socket object. Passing the optional *timeout* parameter will set the timeout on the socket instance before attempting to connect. If no *timeout* is supplied, the global default timeout setting returned by :func:`getdefaulttimeout` is used. If *source_address* is set it must be a tuple of (host, port) for the socket to bind as a source address before making the connection. An host of '' or port 0 tells the OS to use the default. """ host, port = address if host.startswith("["): host = host.strip("[]") err = None # Using the value from allowed_gai_family() in the context of getaddrinfo lets # us select whether to work with IPv4 DNS records, IPv6 records, or both. # The original create_connection function always returns all records. family = allowed_gai_family() for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res sock = None try: sock = socket.socket(af, socktype, proto) # If provided, set socket level options before connecting. _set_socket_options(sock, socket_options) if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT: sock.settimeout(timeout) if source_address: sock.bind(source_address) sock.connect(sa) return sock except socket.error as e: err = e if sock is not None: sock.close() sock = None if err is not None:
raise err
/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:
address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]
def create_connection( address, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, source_address=None, socket_options=None, ): """Connect to *address* and return the socket object. Convenience function. Connect to *address* (a 2-tuple ``(host, port)``) and return the socket object. Passing the optional *timeout* parameter will set the timeout on the socket instance before attempting to connect. If no *timeout* is supplied, the global default timeout setting returned by :func:`getdefaulttimeout` is used. If *source_address* is set it must be a tuple of (host, port) for the socket to bind as a source address before making the connection. An host of '' or port 0 tells the OS to use the default. """ host, port = address if host.startswith("["): host = host.strip("[]") err = None # Using the value from allowed_gai_family() in the context of getaddrinfo lets # us select whether to work with IPv4 DNS records, IPv6 records, or both. # The original create_connection function always returns all records. family = allowed_gai_family() for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res sock = None try: sock = socket.socket(af, socktype, proto) # If provided, set socket level options before connecting. _set_socket_options(sock, socket_options) if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT: sock.settimeout(timeout) if source_address: sock.bind(source_address)
sock.connect(sa)
E ConnectionRefusedError: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError
During handling of the above exception, another exception occurred:
self = <botocore.httpsession.URLLib3Session object at 0x7f609b7bbcd0> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request): try: proxy_url = self._proxy_config.proxy_url_for(request.url) manager = self._get_connection_manager(request.url, proxy_url) conn = manager.connection_from_url(request.url) self._setup_ssl_cert(conn, request.url, self._verify) if ensure_boolean( os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '') ): # This is currently an "experimental" feature which provides # no guarantees of backwards compatibility. It may be subject # to change or removal in any patch version. Anyone opting in # to this feature should strictly pin botocore. host = urlparse(request.url).hostname conn.proxy_headers['host'] = host request_target = self._get_request_target(request.url, proxy_url)
urllib_response = conn.urlopen(
method=request.method, url=request_target, body=request.body, headers=request.headers, retries=Retry(False), assert_same_host=False, preload_content=False, decode_content=False, chunked=self._chunked(request.headers), )
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:
self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f60c054e040> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7f6186e61220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f6098ce1970> is_new_proxy_conn = False
def urlopen( self, method, url, body=None, headers=None, retries=None, redirect=True, assert_same_host=True, timeout=_Default, pool_timeout=None, release_conn=None, chunked=False, body_pos=None, **response_kw ): """ Get a connection from the pool and perform an HTTP request. This is the lowest level call for making a request, so you'll need to specify all the raw details. .. note:: More commonly, it's appropriate to use a convenience method provided by :class:`.RequestMethods`, such as :meth:`request`. .. note:: `release_conn` will only behave as expected if `preload_content=False` because we want to make `preload_content=False` the default behaviour someday soon without breaking backwards compatibility. :param method: HTTP request method (such as GET, POST, PUT, etc.) :param body: Data to send in the request body (useful for creating POST requests, see HTTPConnectionPool.post_url for more convenience). :param headers: Dictionary of custom headers to send, such as User-Agent, If-None-Match, etc. If None, pool headers are used. If provided, these headers completely replace any pool-specific headers. :param retries: Configure the number of retries to allow before raising a :class:`~urllib3.exceptions.MaxRetryError` exception. Pass ``None`` to retry until you receive a response. Pass a :class:`~urllib3.util.retry.Retry` object for fine-grained control over different types of retries. Pass an integer number to retry connection errors that many times, but no other types of errors. Pass zero to never retry. If ``False``, then retries are disabled and any exception is raised immediately. Also, instead of raising a MaxRetryError on redirects, the redirect response will be returned. :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int. :param redirect: If True, automatically handle redirects (status codes 301, 302, 303, 307, 308). Each redirect counts as a retry. Disabling retries will disable redirect, too. :param assert_same_host: If ``True``, will make sure that the host of the pool requests is consistent else will raise HostChangedError. When False, you can use the pool on an HTTP proxy and request foreign hosts. :param timeout: If specified, overrides the default timeout for this one request. It may be a float (in seconds) or an instance of :class:`urllib3.util.Timeout`. :param pool_timeout: If set and the pool is set to block=True, then this method will block for ``pool_timeout`` seconds and raise EmptyPoolError if no connection is available within the time period. :param release_conn: If False, then the urlopen call will not release the connection back into the pool once a response is received (but will release if you read the entire contents of the response such as when `preload_content=True`). This is useful if you're not preloading the response's content immediately. You will need to call ``r.release_conn()`` on the response ``r`` to return the connection back into the pool. If None, it takes the value of ``response_kw.get('preload_content', True)``. :param chunked: If True, urllib3 will send the body using chunked transfer encoding. Otherwise, urllib3 will send the body using the standard content-length form. Defaults to False. :param int body_pos: Position to seek to in file-like body in the event of a retry or redirect. Typically this won't need to be set because urllib3 will auto-populate the value when needed. :param \\**response_kw: Additional parameters are passed to :meth:`urllib3.response.HTTPResponse.from_httplib` """ if headers is None: headers = self.headers if not isinstance(retries, Retry): retries = Retry.from_int(retries, redirect=redirect, default=self.retries) if release_conn is None: release_conn = response_kw.get("preload_content", True) # Check host if assert_same_host and not self.is_same_host(url): raise HostChangedError(self, url, retries) # Ensure that the URL we're connecting to is properly encoded if url.startswith("/"): url = six.ensure_str(_encode_target(url)) else: url = six.ensure_str(parse_url(url).url) conn = None # Track whether `conn` needs to be released before # returning/raising/recursing. Update this variable if necessary, and # leave `release_conn` constant throughout the function. That way, if # the function recurses, the original value of `release_conn` will be # passed down into the recursive call, and its value will be respected. # # See issue #651 [1] for details. # # [1] <https://github.com/urllib3/urllib3/issues/651> release_this_conn = release_conn # Merge the proxy headers. Only do this in HTTP. We have to copy the # headers dict so we can safely change it without those changes being # reflected in anyone else's copy. if self.scheme == "http": headers = headers.copy() headers.update(self.proxy_headers) # Must keep the exception bound to a separate variable or else Python 3 # complains about UnboundLocalError. err = None # Keep track of whether we cleanly exited the except block. This # ensures we do proper cleanup in finally. clean_exit = False # Rewind body position, if needed. Record current position # for future rewinds in the event of a redirect/retry. body_pos = set_file_position(body, body_pos) try: # Request a connection from the queue. timeout_obj = self._get_timeout(timeout) conn = self._get_conn(timeout=pool_timeout) conn.timeout = timeout_obj.connect_timeout is_new_proxy_conn = self.proxy is not None and not getattr( conn, "sock", None ) if is_new_proxy_conn: self._prepare_proxy(conn) # Make the request on the httplib connection object. httplib_response = self._make_request( conn, method, url, timeout=timeout_obj, body=body, headers=headers, chunked=chunked, ) # If we're going to release the connection in ``finally:``, then # the response doesn't need to know about the connection. Otherwise # it will also try to release it and we'll have a double-release # mess. response_conn = conn if not release_conn else None # Pass method to Response for length checking response_kw["request_method"] = method # Import httplib's response into our own wrapper object response = self.ResponseCls.from_httplib( httplib_response, pool=self, connection=response_conn, retries=retries, **response_kw ) # Everything went great! clean_exit = True except queue.Empty: # Timed out by queue. raise EmptyPoolError(self, "No pool connections are available.") except ( TimeoutError, HTTPException, SocketError, ProtocolError, BaseSSLError, SSLError, CertificateError, ) as e: # Discard the connection for these exceptions. It will be # replaced during the next _get_conn() call. clean_exit = False if isinstance(e, (BaseSSLError, CertificateError)): e = SSLError(e) elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy: e = ProxyError("Cannot connect to proxy.", e) elif isinstance(e, (SocketError, HTTPException)): e = ProtocolError("Connection aborted.", e)
retries = retries.increment(
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] )
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:
self = Retry(total=False, connect=None, read=None, redirect=0, status=None) method = 'PUT', url = '/csv', response = None error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>: Failed to establish a new connection: [Errno 111] Connection refused') _pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f60c054e040> _stacktrace = <traceback object at 0x7f609855e040>
def increment( self, method=None, url=None, response=None, error=None, _pool=None, _stacktrace=None, ): """ Return a new Retry object with incremented retry counters. :param response: A response object, or None, if the server did not return a response. :type response: :class:`~urllib3.response.HTTPResponse` :param Exception error: An error encountered during the request, or None if the response was received successfully. :return: A new ``Retry`` object. """ if self.total is False and error: # Disabled, indicate to re-raise the error.
raise six.reraise(type(error), error, _stacktrace)
/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:
tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None
def reraise(tp, value, tb=None): try: if value is None: value = tp() if value.__traceback__ is not tb: raise value.with_traceback(tb)
raise value
../../../.local/lib/python3.8/site-packages/six.py:703:
self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f60c054e040> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7f6186e61220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f6098ce1970> is_new_proxy_conn = False
def urlopen( self, method, url, body=None, headers=None, retries=None, redirect=True, assert_same_host=True, timeout=_Default, pool_timeout=None, release_conn=None, chunked=False, body_pos=None, **response_kw ): """ Get a connection from the pool and perform an HTTP request. This is the lowest level call for making a request, so you'll need to specify all the raw details. .. note:: More commonly, it's appropriate to use a convenience method provided by :class:`.RequestMethods`, such as :meth:`request`. .. note:: `release_conn` will only behave as expected if `preload_content=False` because we want to make `preload_content=False` the default behaviour someday soon without breaking backwards compatibility. :param method: HTTP request method (such as GET, POST, PUT, etc.) :param body: Data to send in the request body (useful for creating POST requests, see HTTPConnectionPool.post_url for more convenience). :param headers: Dictionary of custom headers to send, such as User-Agent, If-None-Match, etc. If None, pool headers are used. If provided, these headers completely replace any pool-specific headers. :param retries: Configure the number of retries to allow before raising a :class:`~urllib3.exceptions.MaxRetryError` exception. Pass ``None`` to retry until you receive a response. Pass a :class:`~urllib3.util.retry.Retry` object for fine-grained control over different types of retries. Pass an integer number to retry connection errors that many times, but no other types of errors. Pass zero to never retry. If ``False``, then retries are disabled and any exception is raised immediately. Also, instead of raising a MaxRetryError on redirects, the redirect response will be returned. :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int. :param redirect: If True, automatically handle redirects (status codes 301, 302, 303, 307, 308). Each redirect counts as a retry. Disabling retries will disable redirect, too. :param assert_same_host: If ``True``, will make sure that the host of the pool requests is consistent else will raise HostChangedError. When False, you can use the pool on an HTTP proxy and request foreign hosts. :param timeout: If specified, overrides the default timeout for this one request. It may be a float (in seconds) or an instance of :class:`urllib3.util.Timeout`. :param pool_timeout: If set and the pool is set to block=True, then this method will block for ``pool_timeout`` seconds and raise EmptyPoolError if no connection is available within the time period. :param release_conn: If False, then the urlopen call will not release the connection back into the pool once a response is received (but will release if you read the entire contents of the response such as when `preload_content=True`). This is useful if you're not preloading the response's content immediately. You will need to call ``r.release_conn()`` on the response ``r`` to return the connection back into the pool. If None, it takes the value of ``response_kw.get('preload_content', True)``. :param chunked: If True, urllib3 will send the body using chunked transfer encoding. Otherwise, urllib3 will send the body using the standard content-length form. Defaults to False. :param int body_pos: Position to seek to in file-like body in the event of a retry or redirect. Typically this won't need to be set because urllib3 will auto-populate the value when needed. :param \\**response_kw: Additional parameters are passed to :meth:`urllib3.response.HTTPResponse.from_httplib` """ if headers is None: headers = self.headers if not isinstance(retries, Retry): retries = Retry.from_int(retries, redirect=redirect, default=self.retries) if release_conn is None: release_conn = response_kw.get("preload_content", True) # Check host if assert_same_host and not self.is_same_host(url): raise HostChangedError(self, url, retries) # Ensure that the URL we're connecting to is properly encoded if url.startswith("/"): url = six.ensure_str(_encode_target(url)) else: url = six.ensure_str(parse_url(url).url) conn = None # Track whether `conn` needs to be released before # returning/raising/recursing. Update this variable if necessary, and # leave `release_conn` constant throughout the function. That way, if # the function recurses, the original value of `release_conn` will be # passed down into the recursive call, and its value will be respected. # # See issue #651 [1] for details. # # [1] <https://github.com/urllib3/urllib3/issues/651> release_this_conn = release_conn # Merge the proxy headers. Only do this in HTTP. We have to copy the # headers dict so we can safely change it without those changes being # reflected in anyone else's copy. if self.scheme == "http": headers = headers.copy() headers.update(self.proxy_headers) # Must keep the exception bound to a separate variable or else Python 3 # complains about UnboundLocalError. err = None # Keep track of whether we cleanly exited the except block. This # ensures we do proper cleanup in finally. clean_exit = False # Rewind body position, if needed. Record current position # for future rewinds in the event of a redirect/retry. body_pos = set_file_position(body, body_pos) try: # Request a connection from the queue. timeout_obj = self._get_timeout(timeout) conn = self._get_conn(timeout=pool_timeout) conn.timeout = timeout_obj.connect_timeout is_new_proxy_conn = self.proxy is not None and not getattr( conn, "sock", None ) if is_new_proxy_conn: self._prepare_proxy(conn) # Make the request on the httplib connection object.
httplib_response = self._make_request(
conn, method, url, timeout=timeout_obj, body=body, headers=headers, chunked=chunked, )
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:
self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7f60c054e040> conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0> method = 'PUT', url = '/csv' timeout = <urllib3.util.timeout.Timeout object at 0x7f6098ce1970> chunked = False httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}} timeout_obj = <urllib3.util.timeout.Timeout object at 0x7f60c050eee0>
def _make_request( self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw ): """ Perform a request on a given urllib connection object taken from our pool. :param conn: a connection from one of our connection pools :param timeout: Socket timeout in seconds for the request. This can be a float or integer, which will set the same timeout value for the socket connect and the socket read, or an instance of :class:`urllib3.util.Timeout`, which gives you more fine-grained control over your timeouts. """ self.num_requests += 1 timeout_obj = self._get_timeout(timeout) timeout_obj.start_connect() conn.timeout = timeout_obj.connect_timeout # Trigger any extra validation we need to do. try: self._validate_conn(conn) except (SocketTimeout, BaseSSLError) as e: # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout. self._raise_timeout(err=e, url=url, timeout_value=conn.timeout) raise # conn.request() calls httplib.*.request, not the method in # urllib3.request. It also calls makefile (recv) on the socket. if chunked: conn.request_chunked(method, url, **httplib_request_kw) else:
conn.request(method, url, **httplib_request_kw)
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
def request(self, method, url, body=None, headers={}, *, encode_chunked=False): """Send a complete request to the server."""
self._send_request(method, url, body, headers, encode_chunked)
/usr/lib/python3.8/http/client.py:1256:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} args = (False,), kwargs = {}
def _send_request(self, method, url, body, headers, *args, **kwargs): self._response_received = False if headers.get('Expect', b'') == b'100-continue': self._expect_header_set = True else: self._expect_header_set = False self.response_class = self._original_response_cls
rval = super()._send_request(
method, url, body, headers, *args, **kwargs )
/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} encode_chunked = False
def _send_request(self, method, url, body, headers, encode_chunked): # Honor explicitly requested Host: and Accept-Encoding: headers. header_names = frozenset(k.lower() for k in headers) skips = {} if 'host' in header_names: skips['skip_host'] = 1 if 'accept-encoding' in header_names: skips['skip_accept_encoding'] = 1 self.putrequest(method, url, **skips) # chunked encoding will happen if HTTP/1.1 is used and either # the caller passes encode_chunked=True or the following # conditions hold: # 1. content-length has not been explicitly set # 2. the body is a file or iterable, but not a str or bytes-like # 3. Transfer-Encoding has NOT been explicitly set by the caller if 'content-length' not in header_names: # only chunk body if not explicitly set for backwards # compatibility, assuming the client code is already handling the # chunking if 'transfer-encoding' not in header_names: # if content-length cannot be automatically determined, fall # back to chunked encoding encode_chunked = False content_length = self._get_content_length(body, method) if content_length is None: if body is not None: if self.debuglevel > 0: print('Unable to determine size of %r' % body) encode_chunked = True self.putheader('Transfer-Encoding', 'chunked') else: self.putheader('Content-Length', str(content_length)) else: encode_chunked = False for hdr, value in headers.items(): self.putheader(hdr, value) if isinstance(body, str): # RFC 2616 Section 3.7.1 says that text default has a # default charset of iso-8859-1. body = _encode(body, 'body')
self.endheaders(body, encode_chunked=encode_chunked)
/usr/lib/python3.8/http/client.py:1302:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0> message_body = None
def endheaders(self, message_body=None, *, encode_chunked=False): """Indicate that the last header line has been sent to the server. This method sends the request to the server. The optional message_body argument can be used to pass a message body associated with the request. """ if self.__state == _CS_REQ_STARTED: self.__state = _CS_REQ_SENT else: raise CannotSendHeader()
self._send_output(message_body, encode_chunked=encode_chunked)
/usr/lib/python3.8/http/client.py:1251:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0> message_body = None, args = (), kwargs = {'encode_chunked': False} msg = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: 2de35b00-cfdf-4f44-9946-8f466bfa6571\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def _send_output(self, message_body=None, *args, **kwargs): self._buffer.extend((b"", b"")) msg = self._convert_to_bytes(self._buffer) del self._buffer[:] # If msg and message_body are sent in a single send() call, # it will avoid performance problems caused by the interaction # between delayed ack and the Nagle algorithm. if isinstance(message_body, bytes): msg += message_body message_body = None
self.send(msg)
/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0> str = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: 2de35b00-cfdf-4f44-9946-8f466bfa6571\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, str): if self._response_received: logger.debug( "send() called, but reseponse already received. " "Not sending data." ) return
return super().send(str)
/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0> data = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: 2de35b00-cfdf-4f44-9946-8f466bfa6571\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, data): """Send `data' to the server. ``data`` can be a string object, a bytes object, an array object, a file-like object that supports a .read() method, or an iterable object. """ if self.sock is None: if self.auto_open:
self.connect()
/usr/lib/python3.8/http/client.py:951:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
def connect(self):
conn = self._new_conn()
/usr/lib/python3/dist-packages/urllib3/connection.py:187:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>
def _new_conn(self): """ Establish a socket connection and set nodelay settings on it. :return: New socket connection. """ extra_kw = {} if self.source_address: extra_kw["source_address"] = self.source_address if self.socket_options: extra_kw["socket_options"] = self.socket_options try: conn = connection.create_connection( (self._dns_host, self.port), self.timeout, **extra_kw ) except SocketTimeout: raise ConnectTimeoutError( self, "Connection to %s timed out. (connect timeout=%s)" % (self.host, self.timeout), ) except SocketError as e:
raise NewConnectionError(
self, "Failed to establish a new connection: %s" % e )
E urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7f60c050e3a0>: Failed to establish a new connection: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError
During handling of the above exception, another exception occurred:
s3_base = 'http://127.0.0.1:5000/' s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}} paths = ['/tmp/pytest-of-jenkins/pytest-15/csv0/dataset-0.csv', '/tmp/pytest-of-jenkins/pytest-15/csv0/dataset-1.csv'] datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-15/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-15/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-15/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-15/parquet0')} engine = 'csv' df = name-string id label x y 0 Hannah 1031 999 -0.076963 0.314008 1 Sarah ... Ursula 1062 1029 0.995636 0.555042 2160 Dan 992 976 -0.958343 0.245327
[4321 rows x 5 columns] patch_aiobotocore = None
@pytest.mark.parametrize("engine", ["parquet", "csv"]) def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore): # Copy files to mock s3 bucket files = {} for i, path in enumerate(paths): with open(path, "rb") as f: fbytes = f.read() fn = path.split(os.path.sep)[-1] files[fn] = BytesIO() files[fn].write(fbytes) files[fn].seek(0) if engine == "parquet": # Workaround for nvt#539. In order to avoid the # bug in Dask's `create_metadata_file`, we need # to manually generate a "_metadata" file here. # This can be removed after dask#7295 is merged # (see https://github.com/dask/dask/pull/7295) fn = "_metadata" files[fn] = BytesIO() meta = create_metadata_file( paths, engine="pyarrow", out_dir=False, ) meta.write_metadata_file(files[fn]) files[fn].seek(0)
with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:
tests/unit/test_s3.py:97:
/usr/lib/python3.8/contextlib.py:113: in enter return next(self.gen) /usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context client.create_bucket(Bucket=bucket, ACL="public-read-write") /usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call return self._make_api_call(operation_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call http, parsed_response = self._make_request( /usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request return self._endpoint.make_request(operation_model, request_dict) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request return self._send_request(request_dict, operation_model) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request while self._needs_retry( /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry responses = self._event_emitter.emit( /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit return self._emitter.emit(aliased_event_name, **kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit return self._emit(event_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit response = handler(**kwargs) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call if self._checker(**checker_kwargs): /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call should_retry = self._should_retry( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry return self._checker(attempt_number, response, caught_exception) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call checker_response = checker( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call return self._check_caught_exception( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception raise caught_exception /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response http_response = self._send(request) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send return self.http_session.send(request)
self = <botocore.httpsession.URLLib3Session object at 0x7f609b7bbcd0> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'2de35b00-cfdf-4f44-9946-8f466bfa6571', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request): try: proxy_url = self._proxy_config.proxy_url_for(request.url) manager = self._get_connection_manager(request.url, proxy_url) conn = manager.connection_from_url(request.url) self._setup_ssl_cert(conn, request.url, self._verify) if ensure_boolean( os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '') ): # This is currently an "experimental" feature which provides # no guarantees of backwards compatibility. It may be subject # to change or removal in any patch version. Anyone opting in # to this feature should strictly pin botocore. host = urlparse(request.url).hostname conn.proxy_headers['host'] = host request_target = self._get_request_target(request.url, proxy_url) urllib_response = conn.urlopen( method=request.method, url=request_target, body=request.body, headers=request.headers, retries=Retry(False), assert_same_host=False, preload_content=False, decode_content=False, chunked=self._chunked(request.headers), ) http_response = botocore.awsrequest.AWSResponse( request.url, urllib_response.status, urllib_response.headers, urllib_response, ) if not request.stream_output: # Cause the raw stream to be exhausted immediately. We do it # this way instead of using preload_content because # preload_content will never buffer chunked responses http_response.content return http_response except URLLib3SSLError as e: raise SSLError(endpoint_url=request.url, error=e) except (NewConnectionError, socket.gaierror) as e:
raise EndpointConnectionError(endpoint_url=request.url, error=e)
E botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/csv"
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError _______________________ test_drop_low_cardinality[True] ________________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_drop_low_cardinality_True0') cpu = True
@pytest.mark.parametrize("cpu", _CPU) def test_drop_low_cardinality(tmpdir, cpu): df = pd.DataFrame() if not cpu: df = cudf.DataFrame(df) df["col1"] = ["a", "a", "a", "a", "a"] df["col2"] = ["a", "a", "a", "a", "b"] df["col3"] = ["a", "a", "b", "b", "c"] features = list(df.columns) >> nvt.ops.Categorify() >> nvt.ops.DropLowCardinality() workflow = nvt.Workflow(features) transformed = workflow.fit_transform(nvt.Dataset(df)).to_ddf().compute()
assert workflow.output_schema.column_names == ["col2", "col3"]
E AssertionError: assert ['col3'] == ['col2', 'col3'] E At index 0 diff: 'col3' != 'col2' E Right contains one more item: 'col3' E Full diff: E - ['col2', 'col3'] E + ['col3']
tests/unit/ops/test_drop_low_cardinality.py:45: AssertionError _______________________ test_drop_low_cardinality[False] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_drop_low_cardinality_Fals0') cpu = False
@pytest.mark.parametrize("cpu", _CPU) def test_drop_low_cardinality(tmpdir, cpu): df = pd.DataFrame() if not cpu: df = cudf.DataFrame(df) df["col1"] = ["a", "a", "a", "a", "a"] df["col2"] = ["a", "a", "a", "a", "b"] df["col3"] = ["a", "a", "b", "b", "c"] features = list(df.columns) >> nvt.ops.Categorify() >> nvt.ops.DropLowCardinality() workflow = nvt.Workflow(features) transformed = workflow.fit_transform(nvt.Dataset(df)).to_ddf().compute()
assert workflow.output_schema.column_names == ["col2", "col3"]
E AssertionError: assert ['col3'] == ['col2', 'col3'] E At index 0 diff: 'col3' != 'col2' E Right contains one more item: 'col3' E Full diff: E - ['col2', 'col3'] E + ['col3']
tests/unit/ops/test_drop_low_cardinality.py:45: AssertionError _____________________ test_cpu_workflow[True-True-parquet] _____________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_pa0') df = name-cat name-string id label x y 0 Ingrid Hannah 1031 999 -0.076963 0.314008 ...la 1062 1029 0.995636 0.555042 4320 Charlie Dan 992 976 -0.958343 0.245327
[4321 rows x 6 columns] dataset = <merlin.io.dataset.Dataset object at 0x7f5ff87cc1c0>, cpu = True engine = 'parquet', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("dump", [True, False]) @pytest.mark.parametrize("cpu", [True]) def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump): # Make sure we are in cpu formats if cudf and isinstance(df, cudf.DataFrame): df = df.to_pandas() if cpu: dataset.to_cpu() cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] norms = ops.Normalize() conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms cats = cat_names >> ops.Categorify() workflow = nvt.Workflow(conts + cats + label_name) workflow.fit(dataset) if dump: workflow_dir = os.path.join(tmpdir, "workflow") workflow.save(workflow_dir) workflow = None workflow = Workflow.load(workflow_dir) def get_norms(tar: pd.Series): df = tar.fillna(0) df = df * (df >= 0).astype("int") return df assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4) assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4) assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3) assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3) # Check that categories match if engine == "parquet": cats_expected0 = df["name-cat"].unique() cats0 = get_cats(workflow, "name-cat", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist()) assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None]) cats_expected1 = df["name-string"].unique() cats1 = get_cats(workflow, "name-string", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist()) assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None]) # Write to new "shuffled" and "processed" dataset workflow.transform(dataset).to_parquet( output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION )
dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)
tests/unit/workflow/test_cpu_workflow.py:76:
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???
??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_pa0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_pa0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
pyarrow/error.pxi:99: ArrowInvalid _______________________ test_cpu_workflow[True-True-csv] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs0') df = name-string id label x y 0 Hannah 1031 999 -0.076963 0.314008 1 Sarah ... Ursula 1062 1029 0.995636 0.555042 2160 Dan 992 976 -0.958343 0.245327
[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7f601464dbe0>, cpu = True engine = 'csv', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("dump", [True, False]) @pytest.mark.parametrize("cpu", [True]) def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump): # Make sure we are in cpu formats if cudf and isinstance(df, cudf.DataFrame): df = df.to_pandas() if cpu: dataset.to_cpu() cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] norms = ops.Normalize() conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms cats = cat_names >> ops.Categorify() workflow = nvt.Workflow(conts + cats + label_name) workflow.fit(dataset) if dump: workflow_dir = os.path.join(tmpdir, "workflow") workflow.save(workflow_dir) workflow = None workflow = Workflow.load(workflow_dir) def get_norms(tar: pd.Series): df = tar.fillna(0) df = df * (df >= 0).astype("int") return df assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4) assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4) assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3) assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3) # Check that categories match if engine == "parquet": cats_expected0 = df["name-cat"].unique() cats0 = get_cats(workflow, "name-cat", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist()) assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None]) cats_expected1 = df["name-string"].unique() cats1 = get_cats(workflow, "name-string", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist()) assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None]) # Write to new "shuffled" and "processed" dataset workflow.transform(dataset).to_parquet( output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION )
dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)
tests/unit/workflow/test_cpu_workflow.py:76:
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???
??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
pyarrow/error.pxi:99: ArrowInvalid __________________ test_cpu_workflow[True-True-csv-no-header] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs1') df = name-string id label x y 0 Hannah 1031 999 -0.076963 0.314008 1 Sarah ... Ursula 1062 1029 0.995636 0.555042 2160 Dan 992 976 -0.958343 0.245327
[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7f60642f3100>, cpu = True engine = 'csv-no-header', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("dump", [True, False]) @pytest.mark.parametrize("cpu", [True]) def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump): # Make sure we are in cpu formats if cudf and isinstance(df, cudf.DataFrame): df = df.to_pandas() if cpu: dataset.to_cpu() cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] norms = ops.Normalize() conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms cats = cat_names >> ops.Categorify() workflow = nvt.Workflow(conts + cats + label_name) workflow.fit(dataset) if dump: workflow_dir = os.path.join(tmpdir, "workflow") workflow.save(workflow_dir) workflow = None workflow = Workflow.load(workflow_dir) def get_norms(tar: pd.Series): df = tar.fillna(0) df = df * (df >= 0).astype("int") return df assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4) assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4) assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3) assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3) # Check that categories match if engine == "parquet": cats_expected0 = df["name-cat"].unique() cats0 = get_cats(workflow, "name-cat", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist()) assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None]) cats_expected1 = df["name-string"].unique() cats1 = get_cats(workflow, "name-string", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist()) assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None]) # Write to new "shuffled" and "processed" dataset workflow.transform(dataset).to_parquet( output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION )
dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)
tests/unit/workflow/test_cpu_workflow.py:76:
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???
??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_True_cs1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
pyarrow/error.pxi:99: ArrowInvalid ____________________ test_cpu_workflow[True-False-parquet] _____________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_p0') df = name-cat name-string id label x y 0 Ingrid Hannah 1031 999 -0.076963 0.314008 ...la 1062 1029 0.995636 0.555042 4320 Charlie Dan 992 976 -0.958343 0.245327
[4321 rows x 6 columns] dataset = <merlin.io.dataset.Dataset object at 0x7f6014783a30>, cpu = True engine = 'parquet', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("dump", [True, False]) @pytest.mark.parametrize("cpu", [True]) def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump): # Make sure we are in cpu formats if cudf and isinstance(df, cudf.DataFrame): df = df.to_pandas() if cpu: dataset.to_cpu() cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] norms = ops.Normalize() conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms cats = cat_names >> ops.Categorify() workflow = nvt.Workflow(conts + cats + label_name) workflow.fit(dataset) if dump: workflow_dir = os.path.join(tmpdir, "workflow") workflow.save(workflow_dir) workflow = None workflow = Workflow.load(workflow_dir) def get_norms(tar: pd.Series): df = tar.fillna(0) df = df * (df >= 0).astype("int") return df assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4) assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4) assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3) assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3) # Check that categories match if engine == "parquet": cats_expected0 = df["name-cat"].unique() cats0 = get_cats(workflow, "name-cat", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist()) assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None]) cats_expected1 = df["name-string"].unique() cats1 = get_cats(workflow, "name-string", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist()) assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None]) # Write to new "shuffled" and "processed" dataset workflow.transform(dataset).to_parquet( output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION )
dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)
tests/unit/workflow/test_cpu_workflow.py:76:
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???
??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_p0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_p0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
pyarrow/error.pxi:99: ArrowInvalid ______________________ test_cpu_workflow[True-False-csv] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c0') df = name-string id label x y 0 Hannah 1031 999 -0.076963 0.314008 1 Sarah ... Ursula 1062 1029 0.995636 0.555042 2160 Dan 992 976 -0.958343 0.245327
[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7f6014644790>, cpu = True engine = 'csv', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("dump", [True, False]) @pytest.mark.parametrize("cpu", [True]) def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump): # Make sure we are in cpu formats if cudf and isinstance(df, cudf.DataFrame): df = df.to_pandas() if cpu: dataset.to_cpu() cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] norms = ops.Normalize() conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms cats = cat_names >> ops.Categorify() workflow = nvt.Workflow(conts + cats + label_name) workflow.fit(dataset) if dump: workflow_dir = os.path.join(tmpdir, "workflow") workflow.save(workflow_dir) workflow = None workflow = Workflow.load(workflow_dir) def get_norms(tar: pd.Series): df = tar.fillna(0) df = df * (df >= 0).astype("int") return df assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4) assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4) assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3) assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3) # Check that categories match if engine == "parquet": cats_expected0 = df["name-cat"].unique() cats0 = get_cats(workflow, "name-cat", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist()) assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None]) cats_expected1 = df["name-string"].unique() cats1 = get_cats(workflow, "name-string", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist()) assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None]) # Write to new "shuffled" and "processed" dataset workflow.transform(dataset).to_parquet( output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION )
dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)
tests/unit/workflow/test_cpu_workflow.py:76:
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???
??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
pyarrow/error.pxi:99: ArrowInvalid _________________ test_cpu_workflow[True-False-csv-no-header] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c1') df = name-string id label x y 0 Hannah 1031 999 -0.076963 0.314008 1 Sarah ... Ursula 1062 1029 0.995636 0.555042 2160 Dan 992 976 -0.958343 0.245327
[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7f6014744df0>, cpu = True engine = 'csv-no-header', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("dump", [True, False]) @pytest.mark.parametrize("cpu", [True]) def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump): # Make sure we are in cpu formats if cudf and isinstance(df, cudf.DataFrame): df = df.to_pandas() if cpu: dataset.to_cpu() cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] norms = ops.Normalize() conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms cats = cat_names >> ops.Categorify() workflow = nvt.Workflow(conts + cats + label_name) workflow.fit(dataset) if dump: workflow_dir = os.path.join(tmpdir, "workflow") workflow.save(workflow_dir) workflow = None workflow = Workflow.load(workflow_dir) def get_norms(tar: pd.Series): df = tar.fillna(0) df = df * (df >= 0).astype("int") return df assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4) assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4) assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3) assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3) # Check that categories match if engine == "parquet": cats_expected0 = df["name-cat"].unique() cats0 = get_cats(workflow, "name-cat", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist()) assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None]) cats_expected1 = df["name-string"].unique() cats1 = get_cats(workflow, "name-string", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist()) assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None]) # Write to new "shuffled" and "processed" dataset workflow.transform(dataset).to_parquet( output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION )
dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)
tests/unit/workflow/test_cpu_workflow.py:76:
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???
??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-15/test_cpu_workflow_True_False_c1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
pyarrow/error.pxi:99: ArrowInvalid =============================== warnings summary =============================== ../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33 /usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other)
nvtabular/loader/init.py:19 /var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The
nvtabular.loader
module has moved tomerlin.models.loader
. Support for importing fromnvtabular.loader
is deprecated, and will be removed in a future version. Please update your imports to refer tomerlin.models.loader
. warnings.warn(tests/unit/test_dask_nvt.py: 1 warning tests/unit/test_tf4rec.py: 1 warning tests/unit/test_tools.py: 5 warnings tests/unit/test_triton_inference.py: 8 warnings tests/unit/loader/test_dataloader_backend.py: 6 warnings tests/unit/loader/test_tf_dataloader.py: 66 warnings tests/unit/loader/test_torch_dataloader.py: 67 warnings tests/unit/ops/test_categorify.py: 69 warnings tests/unit/ops/test_drop_low_cardinality.py: 2 warnings tests/unit/ops/test_fill.py: 8 warnings tests/unit/ops/test_hash_bucket.py: 4 warnings tests/unit/ops/test_join.py: 88 warnings tests/unit/ops/test_lambda.py: 1 warning tests/unit/ops/test_normalize.py: 9 warnings tests/unit/ops/test_ops.py: 11 warnings tests/unit/ops/test_ops_schema.py: 17 warnings tests/unit/workflow/test_workflow.py: 27 warnings tests/unit/workflow/test_workflow_chaining.py: 1 warning tests/unit/workflow/test_workflow_node.py: 1 warning tests/unit/workflow/test_workflow_schemas.py: 1 warning /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files. warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers /usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters. warnings.warn(
tests/unit/test_notebooks.py: 1 warning tests/unit/test_tools.py: 17 warnings tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 54 warnings /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 12 warnings tests/unit/workflow/test_workflow.py: 9 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files. warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet] tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet] tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True] /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings tests/unit/workflow/test_workflow.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files. warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files. warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_parquet_output[True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None] /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files. warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================== short test summary info ============================ FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-0-csv-no-header-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-150-csv-no-header-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-150-csv-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-None-0-csv-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-None-150-csv-no-header-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-parquet] FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv] - py... FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv-no-header] FAILED tests/unit/test_s3.py::test_s3_dataset[parquet] - botocore.exceptions.... FAILED tests/unit/test_s3.py::test_s3_dataset[csv] - botocore.exceptions.Endp... FAILED tests/unit/ops/test_drop_low_cardinality.py::test_drop_low_cardinality[True] FAILED tests/unit/ops/test_drop_low_cardinality.py::test_drop_low_cardinality[False] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-parquet] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv-no-header] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-parquet] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv-no-header] ===== 19 failed, 1412 passed, 1 skipped, 617 warnings in 747.37s (0:12:27) ===== Build step 'Execute shell' marked build as failure Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins11395841751843227978.sh
Click to view CI Results
GitHub pull request #1641 of commit 729eb88f3ebd2064c0eea2acb040ed23aa0e5191, no merge conflicts. Running as SYSTEM Setting status of 729eb88f3ebd2064c0eea2acb040ed23aa0e5191 to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4616/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10 > git rev-parse 729eb88f3ebd2064c0eea2acb040ed23aa0e5191^{commit} # timeout=10 Checking out Revision 729eb88f3ebd2064c0eea2acb040ed23aa0e5191 (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 729eb88f3ebd2064c0eea2acb040ed23aa0e5191 # timeout=10 Commit message: "Update `DropLowCardinality` to handle changes to `Categorify` domain" > git rev-list --no-walk 25ee26cc2117bec7b2f5c1085a9b2fed77a140a4 # timeout=10 [nvtabular_tests] $ /bin/bash /tmp/jenkins1109161135988901750.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1432 itemstests/unit/test_dask_nvt.py ............................F..F......F..F.. [ 3%] ..................................................................F.F... [ 8%] .... [ 8%] tests/unit/test_notebooks.py ...... [ 8%] tests/unit/test_s3.py FF [ 8%] tests/unit/test_tf4rec.py . [ 9%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py ................................ [ 12%] tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%] tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%] ................................................... [ 18%] tests/unit/framework_utils/test_torch_layers.py . [ 18%] tests/unit/loader/test_dataloader_backend.py ...... [ 18%] tests/unit/loader/test_tf_dataloader.py ................................ [ 21%] ........................................s.. [ 24%] tests/unit/loader/test_torch_dataloader.py ............................. [ 26%] ...................................................... [ 29%] tests/unit/ops/test_categorify.py ...................................... [ 32%] ........................................................................ [ 37%] ........................................... [ 40%] tests/unit/ops/test_column_similarity.py ........................ [ 42%] tests/unit/ops/test_drop_low_cardinality.py .. [ 42%] tests/unit/ops/test_fill.py ............................................ [ 45%] ........ [ 45%] tests/unit/ops/test_groupyby.py ..................... [ 47%] tests/unit/ops/test_hash_bucket.py ......................... [ 49%] tests/unit/ops/test_join.py ............................................ [ 52%] ........................................................................ [ 57%] .................................. [ 59%] tests/unit/ops/test_lambda.py .......... [ 60%] tests/unit/ops/test_normalize.py ....................................... [ 63%] .. [ 63%] tests/unit/ops/test_ops.py ............................................. [ 66%] .................... [ 67%] tests/unit/ops/test_ops_schema.py ...................................... [ 70%] ........................................................................ [ 75%] ........................................................................ [ 80%] ........................................................................ [ 85%] ....................................... [ 88%] tests/unit/ops/test_reduce_dtype_size.py .. [ 88%] tests/unit/ops/test_target_encode.py ..................... [ 89%] tests/unit/workflow/test_cpu_workflow.py FFFFFF [ 90%] tests/unit/workflow/test_workflow.py ................................... [ 92%] .......................................................... [ 96%] tests/unit/workflow/test_workflow_chaining.py ... [ 96%] tests/unit/workflow/test_workflow_node.py ........... [ 97%] tests/unit/workflow/test_workflow_ops.py ... [ 97%] tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%] ... [100%]
=================================== FAILURES =================================== ________ test_dask_workflow_api_dlrm[True-None-True-device-150-csv-0.1] ________
client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr28') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')} freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv' cat_cache = 'device', on_host = True, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1]) @pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("freq_threshold", [0, 150]) @pytest.mark.parametrize("cat_cache", ["device", None]) @pytest.mark.parametrize("on_host", [True, False]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [True, False]) def test_dask_workflow_api_dlrm( client, tmpdir, datasets, freq_threshold, part_mem_fraction, engine, cat_cache, on_host, shuffle, cpu, ): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) paths = sorted(paths) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) df0 = df0.to_pandas() if cpu else df0 if engine == "parquet": cat_names = ["name-cat", "name-string"] else: cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] cats = cat_names >> ops.Categorify( freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host ) conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp() workflow = Workflow(cats + conts + label_name) if engine in ("parquet", "csv"): dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction) else: dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction) output_path = os.path.join(tmpdir, "processed") transformed = workflow.fit_transform(dataset) transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1) result = transformed.to_ddf().compute() assert len(df0) == len(result) assert result["x"].min() == 0.0 assert result["x"].isna().sum() == 0 assert result["y"].min() == 0.0 assert result["y"].isna().sum() == 0 # Check categories. Need to sort first to make sure we are comparing # "apples to apples" expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]] dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg( {"name-string_x": "count", "name-string_y": "count"} ) if freq_threshold: dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold] assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False) # Read back from disk if cpu:
df_disk = dd_read_parquet(output_path).compute()
tests/unit/test_dask_nvt.py:130:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-09 14:36:35,353 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-2ff0376c0374f06523b9f25395b72dfc', 1) Function: subgraph_callable-0d5ad759-7370-49ea-a9f7-33f00b22 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr28/processed/part_1.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
__________ test_dask_workflow_api_dlrm[True-None-True-None-0-csv-0.1] __________
client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr31') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')} freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv', cat_cache = None on_host = True, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1]) @pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("freq_threshold", [0, 150]) @pytest.mark.parametrize("cat_cache", ["device", None]) @pytest.mark.parametrize("on_host", [True, False]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [True, False]) def test_dask_workflow_api_dlrm( client, tmpdir, datasets, freq_threshold, part_mem_fraction, engine, cat_cache, on_host, shuffle, cpu, ): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) paths = sorted(paths) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) df0 = df0.to_pandas() if cpu else df0 if engine == "parquet": cat_names = ["name-cat", "name-string"] else: cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] cats = cat_names >> ops.Categorify( freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host ) conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp() workflow = Workflow(cats + conts + label_name) if engine in ("parquet", "csv"): dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction) else: dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction) output_path = os.path.join(tmpdir, "processed") transformed = workflow.fit_transform(dataset) transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1) result = transformed.to_ddf().compute() assert len(df0) == len(result) assert result["x"].min() == 0.0 assert result["x"].isna().sum() == 0 assert result["y"].min() == 0.0 assert result["y"].isna().sum() == 0 # Check categories. Need to sort first to make sure we are comparing # "apples to apples" expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]] dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg( {"name-string_x": "count", "name-string_y": "count"} ) if freq_threshold: dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold] assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False) # Read back from disk if cpu:
df_disk = dd_read_parquet(output_path).compute()
tests/unit/test_dask_nvt.py:130:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-09 14:36:37,385 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-a5a644fc8c79cdf9ae2635ed2b300f6c', 1) Function: subgraph_callable-62f18cdd-3485-404a-8218-65bd48c6 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr31/processed/part_1.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___ test_dask_workflow_api_dlrm[True-None-False-device-0-csv-no-header-0.1] ____
client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr38') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')} freq_threshold = 0, part_mem_fraction = 0.1, engine = 'csv-no-header' cat_cache = 'device', on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1]) @pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("freq_threshold", [0, 150]) @pytest.mark.parametrize("cat_cache", ["device", None]) @pytest.mark.parametrize("on_host", [True, False]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [True, False]) def test_dask_workflow_api_dlrm( client, tmpdir, datasets, freq_threshold, part_mem_fraction, engine, cat_cache, on_host, shuffle, cpu, ): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) paths = sorted(paths) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) df0 = df0.to_pandas() if cpu else df0 if engine == "parquet": cat_names = ["name-cat", "name-string"] else: cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] cats = cat_names >> ops.Categorify( freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host ) conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp() workflow = Workflow(cats + conts + label_name) if engine in ("parquet", "csv"): dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction) else: dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction) output_path = os.path.join(tmpdir, "processed") transformed = workflow.fit_transform(dataset) transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1) result = transformed.to_ddf().compute() assert len(df0) == len(result) assert result["x"].min() == 0.0 assert result["x"].isna().sum() == 0 assert result["y"].min() == 0.0 assert result["y"].isna().sum() == 0 # Check categories. Need to sort first to make sure we are comparing # "apples to apples" expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]] dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg( {"name-string_x": "count", "name-string_y": "count"} ) if freq_threshold: dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold] assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False) # Read back from disk if cpu:
df_disk = dd_read_parquet(output_path).compute()
tests/unit/test_dask_nvt.py:130:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-09 14:36:41,594 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-e664934af21c7d272636a2d73892785d', 0) Function: subgraph_callable-d16b9c8a-2683-4a79-84d1-534bcf89 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr38/processed/part_0.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
__ test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1] ___
client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr41') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')} freq_threshold = 150, part_mem_fraction = 0.1, engine = 'csv-no-header' cat_cache = 'device', on_host = False, shuffle = None, cpu = True
@pytest.mark.parametrize("part_mem_fraction", [0.1]) @pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("freq_threshold", [0, 150]) @pytest.mark.parametrize("cat_cache", ["device", None]) @pytest.mark.parametrize("on_host", [True, False]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [True, False]) def test_dask_workflow_api_dlrm( client, tmpdir, datasets, freq_threshold, part_mem_fraction, engine, cat_cache, on_host, shuffle, cpu, ): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) paths = sorted(paths) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) df0 = df0.to_pandas() if cpu else df0 if engine == "parquet": cat_names = ["name-cat", "name-string"] else: cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] cats = cat_names >> ops.Categorify( freq_threshold=freq_threshold, out_path=str(tmpdir), cat_cache=cat_cache, on_host=on_host ) conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> ops.LogOp() workflow = Workflow(cats + conts + label_name) if engine in ("parquet", "csv"): dataset = Dataset(paths, cpu=cpu, part_mem_fraction=part_mem_fraction) else: dataset = Dataset(paths, cpu=cpu, names=allcols_csv, part_mem_fraction=part_mem_fraction) output_path = os.path.join(tmpdir, "processed") transformed = workflow.fit_transform(dataset) transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=1) result = transformed.to_ddf().compute() assert len(df0) == len(result) assert result["x"].min() == 0.0 assert result["x"].isna().sum() == 0 assert result["y"].min() == 0.0 assert result["y"].isna().sum() == 0 # Check categories. Need to sort first to make sure we are comparing # "apples to apples" expect = df0.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() got = result.sort_values(["label", "x", "y", "id"]).reset_index(drop=True).reset_index() dfm = expect.merge(got, on="index", how="inner")[["name-string_x", "name-string_y"]] dfm_gb = dfm.groupby(["name-string_x", "name-string_y"]).agg( {"name-string_x": "count", "name-string_y": "count"} ) if freq_threshold: dfm_gb = dfm_gb[dfm_gb["name-string_x"] >= freq_threshold] assert_eq(dfm_gb["name-string_x"], dfm_gb["name-string_y"], check_names=False) # Read back from disk if cpu:
df_disk = dd_read_parquet(output_path).compute()
tests/unit/test_dask_nvt.py:130:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-09 14:36:43,398 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-94ec7dad78e51dd2f6113a2a4ddd9178', 0) Function: subgraph_callable-30db34bb-e4a3-4f14-af64-4e18b807 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_workflow_api_dlrm_Tr41/processed/part_0.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___________________ test_dask_preproc_cpu[True-None-parquet] ___________________
client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non0') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')} engine = 'parquet', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [None, True]) def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) if engine in ("parquet", "csv"): dataset = Dataset(paths, part_size="1MB", cpu=cpu) else: dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu) # Simple transform (normalize) cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] conts = cont_names >> ops.FillMissing() >> ops.Normalize() workflow = Workflow(conts + cat_names + label_name) transformed = workflow.fit_transform(dataset) # Write out dataset output_path = os.path.join(tmpdir, "processed") transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4) # Check the final result
df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()
tests/unit/test_dask_nvt.py:277:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( 2022-08-09 14:37:28,440 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-2fe44ae7e99effe9a18b5b20fbe1fa99', 10) Function: subgraph_callable-5bb5e98d-7fa4-48a5-a761-d37567c3 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non0/processed/part_2.parquet', [2], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
/usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn( 2022-08-09 14:37:28,445 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-2fe44ae7e99effe9a18b5b20fbe1fa99', 11) Function: subgraph_callable-5bb5e98d-7fa4-48a5-a761-d37567c3 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non0/processed/part_2.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
--------------------------- Captured stderr teardown --------------------------- 2022-08-09 14:37:28,450 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-2fe44ae7e99effe9a18b5b20fbe1fa99', 15) Function: subgraph_callable-5bb5e98d-7fa4-48a5-a761-d37567c3 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non0/processed/part_3.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
________________ test_dask_preproc_cpu[True-None-csv-no-header] ________________
client = <Client: 'tcp://127.0.0.1:44759' processes=2 threads=16, memory=125.83 GiB> tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non2') datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')} engine = 'csv-no-header', shuffle = None, cpu = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("shuffle", [Shuffle.PER_WORKER, None]) @pytest.mark.parametrize("cpu", [None, True]) def test_dask_preproc_cpu(client, tmpdir, datasets, engine, shuffle, cpu): set_dask_client(client=client) paths = glob.glob(str(datasets[engine]) + "/*." + engine.split("-")[0]) if engine == "parquet": df1 = cudf.read_parquet(paths[0])[mycols_pq] df2 = cudf.read_parquet(paths[1])[mycols_pq] elif engine == "csv": df1 = cudf.read_csv(paths[0], header=0)[mycols_csv] df2 = cudf.read_csv(paths[1], header=0)[mycols_csv] else: df1 = cudf.read_csv(paths[0], names=allcols_csv)[mycols_csv] df2 = cudf.read_csv(paths[1], names=allcols_csv)[mycols_csv] df0 = cudf.concat([df1, df2], axis=0) if engine in ("parquet", "csv"): dataset = Dataset(paths, part_size="1MB", cpu=cpu) else: dataset = Dataset(paths, names=allcols_csv, part_size="1MB", cpu=cpu) # Simple transform (normalize) cat_names = ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] conts = cont_names >> ops.FillMissing() >> ops.Normalize() workflow = Workflow(conts + cat_names + label_name) transformed = workflow.fit_transform(dataset) # Write out dataset output_path = os.path.join(tmpdir, "processed") transformed.to_parquet(output_path=output_path, shuffle=shuffle, out_files_per_proc=4) # Check the final result
df_disk = dd_read_parquet(output_path, engine="pyarrow").compute()
tests/unit/test_dask_nvt.py:277:
/usr/local/lib/python3.8/dist-packages/dask/base.py:288: in compute (result,) = compute(self, traverse=False, **kwargs) /usr/local/lib/python3.8/dist-packages/dask/base.py:571: in compute results = schedule(dsk, keys, **kwargs) /usr/local/lib/python3.8/dist-packages/distributed/client.py:3015: in get results = self.gather(packed, asynchronous=asynchronous, direct=direct) /usr/local/lib/python3.8/dist-packages/distributed/client.py:2167: in gather return self.sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:309: in sync return sync( /usr/local/lib/python3.8/dist-packages/distributed/utils.py:376: in sync raise exc.with_traceback(tb) /usr/local/lib/python3.8/dist-packages/distributed/utils.py:349: in f result = yield future /usr/local/lib/python3.8/dist-packages/tornado/gen.py:762: in run value = future.result() /usr/local/lib/python3.8/dist-packages/distributed/client.py:2030: in _gather raise exception.with_traceback(traceback) /usr/local/lib/python3.8/dist-packages/dask/optimization.py:969: in call return core.get(self.dsk, self.outkey, dict(zip(self.inkeys, args))) /usr/local/lib/python3.8/dist-packages/dask/core.py:149: in get result = _execute_task(task, cache) /usr/local/lib/python3.8/dist-packages/dask/core.py:119: in _execute_task return func(*(_execute_task(a, cache) for a in args)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:87: in call return read_parquet_part( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:431: in read_parquet_part dfs = [ /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/core.py:432: in
func(fs, rg, columns.copy(), index, **toolz.merge(kwargs, kw)) /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:466: in read_partition arrow_table = cls._read_table( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:1606: in _read_table arrow_table = _read_table_from_path( /usr/local/lib/python3.8/dist-packages/dask/dataframe/io/parquet/arrow.py:277: in _read_table_from_path return pq.ParquetFile(fil).read_row_groups( /usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py:230: in init self.reader.open( pyarrow/_parquet.pyx:972: in pyarrow._parquet.ParquetReader.open ???
??? E pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
pyarrow/error.pxi:99: ArrowInvalid ----------------------------- Captured stderr call ----------------------------- 2022-08-09 14:37:29,740 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-51ae06915442aa05c68392572c80ee96', 12) Function: subgraph_callable-86da12c7-32ae-4da9-a6dc-a9ace8b6 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non2/processed/part_3.parquet', [0], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:37:29,741 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-51ae06915442aa05c68392572c80ee96', 13) Function: subgraph_callable-86da12c7-32ae-4da9-a6dc-a9ace8b6 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non2/processed/part_3.parquet', [1], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
2022-08-09 14:37:29,746 - distributed.worker - WARNING - Compute Failed Key: ('read-parquet-51ae06915442aa05c68392572c80ee96', 15) Function: subgraph_callable-86da12c7-32ae-4da9-a6dc-a9ace8b6 args: ({'piece': ('/tmp/pytest-of-jenkins/pytest-18/test_dask_preproc_cpu_True_Non2/processed/part_3.parquet', [3], [])}) kwargs: {} Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
___________________________ test_s3_dataset[parquet] ___________________________
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
def _new_conn(self): """ Establish a socket connection and set nodelay settings on it. :return: New socket connection. """ extra_kw = {} if self.source_address: extra_kw["source_address"] = self.source_address if self.socket_options: extra_kw["socket_options"] = self.socket_options try:
conn = connection.create_connection(
(self._dns_host, self.port), self.timeout, **extra_kw )
/usr/lib/python3/dist-packages/urllib3/connection.py:159:
address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]
def create_connection( address, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, source_address=None, socket_options=None, ): """Connect to *address* and return the socket object. Convenience function. Connect to *address* (a 2-tuple ``(host, port)``) and return the socket object. Passing the optional *timeout* parameter will set the timeout on the socket instance before attempting to connect. If no *timeout* is supplied, the global default timeout setting returned by :func:`getdefaulttimeout` is used. If *source_address* is set it must be a tuple of (host, port) for the socket to bind as a source address before making the connection. An host of '' or port 0 tells the OS to use the default. """ host, port = address if host.startswith("["): host = host.strip("[]") err = None # Using the value from allowed_gai_family() in the context of getaddrinfo lets # us select whether to work with IPv4 DNS records, IPv6 records, or both. # The original create_connection function always returns all records. family = allowed_gai_family() for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res sock = None try: sock = socket.socket(af, socktype, proto) # If provided, set socket level options before connecting. _set_socket_options(sock, socket_options) if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT: sock.settimeout(timeout) if source_address: sock.bind(source_address) sock.connect(sa) return sock except socket.error as e: err = e if sock is not None: sock.close() sock = None if err is not None:
raise err
/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:
address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]
def create_connection( address, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, source_address=None, socket_options=None, ): """Connect to *address* and return the socket object. Convenience function. Connect to *address* (a 2-tuple ``(host, port)``) and return the socket object. Passing the optional *timeout* parameter will set the timeout on the socket instance before attempting to connect. If no *timeout* is supplied, the global default timeout setting returned by :func:`getdefaulttimeout` is used. If *source_address* is set it must be a tuple of (host, port) for the socket to bind as a source address before making the connection. An host of '' or port 0 tells the OS to use the default. """ host, port = address if host.startswith("["): host = host.strip("[]") err = None # Using the value from allowed_gai_family() in the context of getaddrinfo lets # us select whether to work with IPv4 DNS records, IPv6 records, or both. # The original create_connection function always returns all records. family = allowed_gai_family() for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res sock = None try: sock = socket.socket(af, socktype, proto) # If provided, set socket level options before connecting. _set_socket_options(sock, socket_options) if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT: sock.settimeout(timeout) if source_address: sock.bind(source_address)
sock.connect(sa)
E ConnectionRefusedError: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError
During handling of the above exception, another exception occurred:
self = <botocore.httpsession.URLLib3Session object at 0x7fe4907e4370> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request): try: proxy_url = self._proxy_config.proxy_url_for(request.url) manager = self._get_connection_manager(request.url, proxy_url) conn = manager.connection_from_url(request.url) self._setup_ssl_cert(conn, request.url, self._verify) if ensure_boolean( os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '') ): # This is currently an "experimental" feature which provides # no guarantees of backwards compatibility. It may be subject # to change or removal in any patch version. Anyone opting in # to this feature should strictly pin botocore. host = urlparse(request.url).hostname conn.proxy_headers['host'] = host request_target = self._get_request_target(request.url, proxy_url)
urllib_response = conn.urlopen(
method=request.method, url=request_target, body=request.body, headers=request.headers, retries=Retry(False), assert_same_host=False, preload_content=False, decode_content=False, chunked=self._chunked(request.headers), )
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:
self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe5682854f0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7fe561827220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe4886d2910> is_new_proxy_conn = False
def urlopen( self, method, url, body=None, headers=None, retries=None, redirect=True, assert_same_host=True, timeout=_Default, pool_timeout=None, release_conn=None, chunked=False, body_pos=None, **response_kw ): """ Get a connection from the pool and perform an HTTP request. This is the lowest level call for making a request, so you'll need to specify all the raw details. .. note:: More commonly, it's appropriate to use a convenience method provided by :class:`.RequestMethods`, such as :meth:`request`. .. note:: `release_conn` will only behave as expected if `preload_content=False` because we want to make `preload_content=False` the default behaviour someday soon without breaking backwards compatibility. :param method: HTTP request method (such as GET, POST, PUT, etc.) :param body: Data to send in the request body (useful for creating POST requests, see HTTPConnectionPool.post_url for more convenience). :param headers: Dictionary of custom headers to send, such as User-Agent, If-None-Match, etc. If None, pool headers are used. If provided, these headers completely replace any pool-specific headers. :param retries: Configure the number of retries to allow before raising a :class:`~urllib3.exceptions.MaxRetryError` exception. Pass ``None`` to retry until you receive a response. Pass a :class:`~urllib3.util.retry.Retry` object for fine-grained control over different types of retries. Pass an integer number to retry connection errors that many times, but no other types of errors. Pass zero to never retry. If ``False``, then retries are disabled and any exception is raised immediately. Also, instead of raising a MaxRetryError on redirects, the redirect response will be returned. :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int. :param redirect: If True, automatically handle redirects (status codes 301, 302, 303, 307, 308). Each redirect counts as a retry. Disabling retries will disable redirect, too. :param assert_same_host: If ``True``, will make sure that the host of the pool requests is consistent else will raise HostChangedError. When False, you can use the pool on an HTTP proxy and request foreign hosts. :param timeout: If specified, overrides the default timeout for this one request. It may be a float (in seconds) or an instance of :class:`urllib3.util.Timeout`. :param pool_timeout: If set and the pool is set to block=True, then this method will block for ``pool_timeout`` seconds and raise EmptyPoolError if no connection is available within the time period. :param release_conn: If False, then the urlopen call will not release the connection back into the pool once a response is received (but will release if you read the entire contents of the response such as when `preload_content=True`). This is useful if you're not preloading the response's content immediately. You will need to call ``r.release_conn()`` on the response ``r`` to return the connection back into the pool. If None, it takes the value of ``response_kw.get('preload_content', True)``. :param chunked: If True, urllib3 will send the body using chunked transfer encoding. Otherwise, urllib3 will send the body using the standard content-length form. Defaults to False. :param int body_pos: Position to seek to in file-like body in the event of a retry or redirect. Typically this won't need to be set because urllib3 will auto-populate the value when needed. :param \\**response_kw: Additional parameters are passed to :meth:`urllib3.response.HTTPResponse.from_httplib` """ if headers is None: headers = self.headers if not isinstance(retries, Retry): retries = Retry.from_int(retries, redirect=redirect, default=self.retries) if release_conn is None: release_conn = response_kw.get("preload_content", True) # Check host if assert_same_host and not self.is_same_host(url): raise HostChangedError(self, url, retries) # Ensure that the URL we're connecting to is properly encoded if url.startswith("/"): url = six.ensure_str(_encode_target(url)) else: url = six.ensure_str(parse_url(url).url) conn = None # Track whether `conn` needs to be released before # returning/raising/recursing. Update this variable if necessary, and # leave `release_conn` constant throughout the function. That way, if # the function recurses, the original value of `release_conn` will be # passed down into the recursive call, and its value will be respected. # # See issue #651 [1] for details. # # [1] <https://github.com/urllib3/urllib3/issues/651> release_this_conn = release_conn # Merge the proxy headers. Only do this in HTTP. We have to copy the # headers dict so we can safely change it without those changes being # reflected in anyone else's copy. if self.scheme == "http": headers = headers.copy() headers.update(self.proxy_headers) # Must keep the exception bound to a separate variable or else Python 3 # complains about UnboundLocalError. err = None # Keep track of whether we cleanly exited the except block. This # ensures we do proper cleanup in finally. clean_exit = False # Rewind body position, if needed. Record current position # for future rewinds in the event of a redirect/retry. body_pos = set_file_position(body, body_pos) try: # Request a connection from the queue. timeout_obj = self._get_timeout(timeout) conn = self._get_conn(timeout=pool_timeout) conn.timeout = timeout_obj.connect_timeout is_new_proxy_conn = self.proxy is not None and not getattr( conn, "sock", None ) if is_new_proxy_conn: self._prepare_proxy(conn) # Make the request on the httplib connection object. httplib_response = self._make_request( conn, method, url, timeout=timeout_obj, body=body, headers=headers, chunked=chunked, ) # If we're going to release the connection in ``finally:``, then # the response doesn't need to know about the connection. Otherwise # it will also try to release it and we'll have a double-release # mess. response_conn = conn if not release_conn else None # Pass method to Response for length checking response_kw["request_method"] = method # Import httplib's response into our own wrapper object response = self.ResponseCls.from_httplib( httplib_response, pool=self, connection=response_conn, retries=retries, **response_kw ) # Everything went great! clean_exit = True except queue.Empty: # Timed out by queue. raise EmptyPoolError(self, "No pool connections are available.") except ( TimeoutError, HTTPException, SocketError, ProtocolError, BaseSSLError, SSLError, CertificateError, ) as e: # Discard the connection for these exceptions. It will be # replaced during the next _get_conn() call. clean_exit = False if isinstance(e, (BaseSSLError, CertificateError)): e = SSLError(e) elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy: e = ProxyError("Cannot connect to proxy.", e) elif isinstance(e, (SocketError, HTTPException)): e = ProtocolError("Connection aborted.", e)
retries = retries.increment(
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] )
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:
self = Retry(total=False, connect=None, read=None, redirect=0, status=None) method = 'PUT', url = '/parquet', response = None error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>: Failed to establish a new connection: [Errno 111] Connection refused') _pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe5682854f0> _stacktrace = <traceback object at 0x7fe457dc5b00>
def increment( self, method=None, url=None, response=None, error=None, _pool=None, _stacktrace=None, ): """ Return a new Retry object with incremented retry counters. :param response: A response object, or None, if the server did not return a response. :type response: :class:`~urllib3.response.HTTPResponse` :param Exception error: An error encountered during the request, or None if the response was received successfully. :return: A new ``Retry`` object. """ if self.total is False and error: # Disabled, indicate to re-raise the error.
raise six.reraise(type(error), error, _stacktrace)
/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:
tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None
def reraise(tp, value, tb=None): try: if value is None: value = tp() if value.__traceback__ is not tb: raise value.with_traceback(tb)
raise value
../../../.local/lib/python3.8/site-packages/six.py:703:
self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe5682854f0> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7fe561827220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe4886d2910> is_new_proxy_conn = False
def urlopen( self, method, url, body=None, headers=None, retries=None, redirect=True, assert_same_host=True, timeout=_Default, pool_timeout=None, release_conn=None, chunked=False, body_pos=None, **response_kw ): """ Get a connection from the pool and perform an HTTP request. This is the lowest level call for making a request, so you'll need to specify all the raw details. .. note:: More commonly, it's appropriate to use a convenience method provided by :class:`.RequestMethods`, such as :meth:`request`. .. note:: `release_conn` will only behave as expected if `preload_content=False` because we want to make `preload_content=False` the default behaviour someday soon without breaking backwards compatibility. :param method: HTTP request method (such as GET, POST, PUT, etc.) :param body: Data to send in the request body (useful for creating POST requests, see HTTPConnectionPool.post_url for more convenience). :param headers: Dictionary of custom headers to send, such as User-Agent, If-None-Match, etc. If None, pool headers are used. If provided, these headers completely replace any pool-specific headers. :param retries: Configure the number of retries to allow before raising a :class:`~urllib3.exceptions.MaxRetryError` exception. Pass ``None`` to retry until you receive a response. Pass a :class:`~urllib3.util.retry.Retry` object for fine-grained control over different types of retries. Pass an integer number to retry connection errors that many times, but no other types of errors. Pass zero to never retry. If ``False``, then retries are disabled and any exception is raised immediately. Also, instead of raising a MaxRetryError on redirects, the redirect response will be returned. :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int. :param redirect: If True, automatically handle redirects (status codes 301, 302, 303, 307, 308). Each redirect counts as a retry. Disabling retries will disable redirect, too. :param assert_same_host: If ``True``, will make sure that the host of the pool requests is consistent else will raise HostChangedError. When False, you can use the pool on an HTTP proxy and request foreign hosts. :param timeout: If specified, overrides the default timeout for this one request. It may be a float (in seconds) or an instance of :class:`urllib3.util.Timeout`. :param pool_timeout: If set and the pool is set to block=True, then this method will block for ``pool_timeout`` seconds and raise EmptyPoolError if no connection is available within the time period. :param release_conn: If False, then the urlopen call will not release the connection back into the pool once a response is received (but will release if you read the entire contents of the response such as when `preload_content=True`). This is useful if you're not preloading the response's content immediately. You will need to call ``r.release_conn()`` on the response ``r`` to return the connection back into the pool. If None, it takes the value of ``response_kw.get('preload_content', True)``. :param chunked: If True, urllib3 will send the body using chunked transfer encoding. Otherwise, urllib3 will send the body using the standard content-length form. Defaults to False. :param int body_pos: Position to seek to in file-like body in the event of a retry or redirect. Typically this won't need to be set because urllib3 will auto-populate the value when needed. :param \\**response_kw: Additional parameters are passed to :meth:`urllib3.response.HTTPResponse.from_httplib` """ if headers is None: headers = self.headers if not isinstance(retries, Retry): retries = Retry.from_int(retries, redirect=redirect, default=self.retries) if release_conn is None: release_conn = response_kw.get("preload_content", True) # Check host if assert_same_host and not self.is_same_host(url): raise HostChangedError(self, url, retries) # Ensure that the URL we're connecting to is properly encoded if url.startswith("/"): url = six.ensure_str(_encode_target(url)) else: url = six.ensure_str(parse_url(url).url) conn = None # Track whether `conn` needs to be released before # returning/raising/recursing. Update this variable if necessary, and # leave `release_conn` constant throughout the function. That way, if # the function recurses, the original value of `release_conn` will be # passed down into the recursive call, and its value will be respected. # # See issue #651 [1] for details. # # [1] <https://github.com/urllib3/urllib3/issues/651> release_this_conn = release_conn # Merge the proxy headers. Only do this in HTTP. We have to copy the # headers dict so we can safely change it without those changes being # reflected in anyone else's copy. if self.scheme == "http": headers = headers.copy() headers.update(self.proxy_headers) # Must keep the exception bound to a separate variable or else Python 3 # complains about UnboundLocalError. err = None # Keep track of whether we cleanly exited the except block. This # ensures we do proper cleanup in finally. clean_exit = False # Rewind body position, if needed. Record current position # for future rewinds in the event of a redirect/retry. body_pos = set_file_position(body, body_pos) try: # Request a connection from the queue. timeout_obj = self._get_timeout(timeout) conn = self._get_conn(timeout=pool_timeout) conn.timeout = timeout_obj.connect_timeout is_new_proxy_conn = self.proxy is not None and not getattr( conn, "sock", None ) if is_new_proxy_conn: self._prepare_proxy(conn) # Make the request on the httplib connection object.
httplib_response = self._make_request(
conn, method, url, timeout=timeout_obj, body=body, headers=headers, chunked=chunked, )
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:
self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe5682854f0> conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430> method = 'PUT', url = '/parquet' timeout = <urllib3.util.timeout.Timeout object at 0x7fe4886d2910> chunked = False httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}} timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe457beb2e0>
def _make_request( self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw ): """ Perform a request on a given urllib connection object taken from our pool. :param conn: a connection from one of our connection pools :param timeout: Socket timeout in seconds for the request. This can be a float or integer, which will set the same timeout value for the socket connect and the socket read, or an instance of :class:`urllib3.util.Timeout`, which gives you more fine-grained control over your timeouts. """ self.num_requests += 1 timeout_obj = self._get_timeout(timeout) timeout_obj.start_connect() conn.timeout = timeout_obj.connect_timeout # Trigger any extra validation we need to do. try: self._validate_conn(conn) except (SocketTimeout, BaseSSLError) as e: # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout. self._raise_timeout(err=e, url=url, timeout_value=conn.timeout) raise # conn.request() calls httplib.*.request, not the method in # urllib3.request. It also calls makefile (recv) on the socket. if chunked: conn.request_chunked(method, url, **httplib_request_kw) else:
conn.request(method, url, **httplib_request_kw)
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
def request(self, method, url, body=None, headers={}, *, encode_chunked=False): """Send a complete request to the server."""
self._send_request(method, url, body, headers, encode_chunked)
/usr/lib/python3.8/http/client.py:1256:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} args = (False,), kwargs = {}
def _send_request(self, method, url, body, headers, *args, **kwargs): self._response_received = False if headers.get('Expect', b'') == b'100-continue': self._expect_header_set = True else: self._expect_header_set = False self.response_class = self._original_response_cls
rval = super()._send_request(
method, url, body, headers, *args, **kwargs )
/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430> method = 'PUT', url = '/parquet', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} encode_chunked = False
def _send_request(self, method, url, body, headers, encode_chunked): # Honor explicitly requested Host: and Accept-Encoding: headers. header_names = frozenset(k.lower() for k in headers) skips = {} if 'host' in header_names: skips['skip_host'] = 1 if 'accept-encoding' in header_names: skips['skip_accept_encoding'] = 1 self.putrequest(method, url, **skips) # chunked encoding will happen if HTTP/1.1 is used and either # the caller passes encode_chunked=True or the following # conditions hold: # 1. content-length has not been explicitly set # 2. the body is a file or iterable, but not a str or bytes-like # 3. Transfer-Encoding has NOT been explicitly set by the caller if 'content-length' not in header_names: # only chunk body if not explicitly set for backwards # compatibility, assuming the client code is already handling the # chunking if 'transfer-encoding' not in header_names: # if content-length cannot be automatically determined, fall # back to chunked encoding encode_chunked = False content_length = self._get_content_length(body, method) if content_length is None: if body is not None: if self.debuglevel > 0: print('Unable to determine size of %r' % body) encode_chunked = True self.putheader('Transfer-Encoding', 'chunked') else: self.putheader('Content-Length', str(content_length)) else: encode_chunked = False for hdr, value in headers.items(): self.putheader(hdr, value) if isinstance(body, str): # RFC 2616 Section 3.7.1 says that text default has a # default charset of iso-8859-1. body = _encode(body, 'body')
self.endheaders(body, encode_chunked=encode_chunked)
/usr/lib/python3.8/http/client.py:1302:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430> message_body = None
def endheaders(self, message_body=None, *, encode_chunked=False): """Indicate that the last header line has been sent to the server. This method sends the request to the server. The optional message_body argument can be used to pass a message body associated with the request. """ if self.__state == _CS_REQ_STARTED: self.__state = _CS_REQ_SENT else: raise CannotSendHeader()
self._send_output(message_body, encode_chunked=encode_chunked)
/usr/lib/python3.8/http/client.py:1251:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430> message_body = None, args = (), kwargs = {'encode_chunked': False} msg = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 2a749262-9b81-4314-9328-f469716a81ab\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def _send_output(self, message_body=None, *args, **kwargs): self._buffer.extend((b"", b"")) msg = self._convert_to_bytes(self._buffer) del self._buffer[:] # If msg and message_body are sent in a single send() call, # it will avoid performance problems caused by the interaction # between delayed ack and the Nagle algorithm. if isinstance(message_body, bytes): msg += message_body message_body = None
self.send(msg)
/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430> str = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 2a749262-9b81-4314-9328-f469716a81ab\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, str): if self._response_received: logger.debug( "send() called, but reseponse already received. " "Not sending data." ) return
return super().send(str)
/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430> data = b'PUT /parquet HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-A...-invocation-id: 2a749262-9b81-4314-9328-f469716a81ab\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, data): """Send `data' to the server. ``data`` can be a string object, a bytes object, an array object, a file-like object that supports a .read() method, or an iterable object. """ if self.sock is None: if self.auto_open:
self.connect()
/usr/lib/python3.8/http/client.py:951:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
def connect(self):
conn = self._new_conn()
/usr/lib/python3/dist-packages/urllib3/connection.py:187:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>
def _new_conn(self): """ Establish a socket connection and set nodelay settings on it. :return: New socket connection. """ extra_kw = {} if self.source_address: extra_kw["source_address"] = self.source_address if self.socket_options: extra_kw["socket_options"] = self.socket_options try: conn = connection.create_connection( (self._dns_host, self.port), self.timeout, **extra_kw ) except SocketTimeout: raise ConnectTimeoutError( self, "Connection to %s timed out. (connect timeout=%s)" % (self.host, self.timeout), ) except SocketError as e:
raise NewConnectionError(
self, "Failed to establish a new connection: %s" % e )
E urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7fe457beb430>: Failed to establish a new connection: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError
During handling of the above exception, another exception occurred:
s3_base = 'http://127.0.0.1:5000/' s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}} paths = ['/tmp/pytest-of-jenkins/pytest-18/parquet0/dataset-0.parquet', '/tmp/pytest-of-jenkins/pytest-18/parquet0/dataset-1.parquet'] datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')} engine = 'parquet' df = name-cat name-string id label x y 0 Alice Victor 973 995 -0.613973 -0.434246 ...dy 964 1065 -0.263394 -0.013804 4320 Jerry Ursula 970 1009 -0.394831 -0.651957
[4321 rows x 6 columns] patch_aiobotocore = None
@pytest.mark.parametrize("engine", ["parquet", "csv"]) def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore): # Copy files to mock s3 bucket files = {} for i, path in enumerate(paths): with open(path, "rb") as f: fbytes = f.read() fn = path.split(os.path.sep)[-1] files[fn] = BytesIO() files[fn].write(fbytes) files[fn].seek(0) if engine == "parquet": # Workaround for nvt#539. In order to avoid the # bug in Dask's `create_metadata_file`, we need # to manually generate a "_metadata" file here. # This can be removed after dask#7295 is merged # (see https://github.com/dask/dask/pull/7295) fn = "_metadata" files[fn] = BytesIO() meta = create_metadata_file( paths, engine="pyarrow", out_dir=False, ) meta.write_metadata_file(files[fn]) files[fn].seek(0)
with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:
tests/unit/test_s3.py:97:
/usr/lib/python3.8/contextlib.py:113: in enter return next(self.gen) /usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context client.create_bucket(Bucket=bucket, ACL="public-read-write") /usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call return self._make_api_call(operation_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call http, parsed_response = self._make_request( /usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request return self._endpoint.make_request(operation_model, request_dict) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request return self._send_request(request_dict, operation_model) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request while self._needs_retry( /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry responses = self._event_emitter.emit( /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit return self._emitter.emit(aliased_event_name, **kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit return self._emit(event_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit response = handler(**kwargs) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call if self._checker(**checker_kwargs): /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call should_retry = self._should_retry( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry return self._checker(attempt_number, response, caught_exception) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call checker_response = checker( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call return self._check_caught_exception( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception raise caught_exception /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response http_response = self._send(request) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send return self.http_session.send(request)
self = <botocore.httpsession.URLLib3Session object at 0x7fe4907e4370> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/parquet, headers={'x-amz-acl': b'public...nvocation-id': b'2a749262-9b81-4314-9328-f469716a81ab', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request): try: proxy_url = self._proxy_config.proxy_url_for(request.url) manager = self._get_connection_manager(request.url, proxy_url) conn = manager.connection_from_url(request.url) self._setup_ssl_cert(conn, request.url, self._verify) if ensure_boolean( os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '') ): # This is currently an "experimental" feature which provides # no guarantees of backwards compatibility. It may be subject # to change or removal in any patch version. Anyone opting in # to this feature should strictly pin botocore. host = urlparse(request.url).hostname conn.proxy_headers['host'] = host request_target = self._get_request_target(request.url, proxy_url) urllib_response = conn.urlopen( method=request.method, url=request_target, body=request.body, headers=request.headers, retries=Retry(False), assert_same_host=False, preload_content=False, decode_content=False, chunked=self._chunked(request.headers), ) http_response = botocore.awsrequest.AWSResponse( request.url, urllib_response.status, urllib_response.headers, urllib_response, ) if not request.stream_output: # Cause the raw stream to be exhausted immediately. We do it # this way instead of using preload_content because # preload_content will never buffer chunked responses http_response.content return http_response except URLLib3SSLError as e: raise SSLError(endpoint_url=request.url, error=e) except (NewConnectionError, socket.gaierror) as e:
raise EndpointConnectionError(endpoint_url=request.url, error=e)
E botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/parquet"
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError ---------------------------- Captured stderr setup ----------------------------- Traceback (most recent call last): File "/usr/local/bin/moto_server", line 5, in
from moto.server import main File "/usr/local/lib/python3.8/dist-packages/moto/server.py", line 7, in from moto.moto_server.werkzeug_app import ( File "/usr/local/lib/python3.8/dist-packages/moto/moto_server/werkzeug_app.py", line 6, in from flask import Flask File "/usr/local/lib/python3.8/dist-packages/flask/init.py", line 4, in from . import json as json File "/usr/local/lib/python3.8/dist-packages/flask/json/init.py", line 8, in from ..globals import current_app File "/usr/local/lib/python3.8/dist-packages/flask/globals.py", line 56, in app_ctx: "AppContext" = LocalProxy( # type: ignore[assignment] TypeError: init() got an unexpected keyword argument 'unbound_message' _____________________________ test_s3_dataset[csv] _____________________________ self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
def _new_conn(self): """ Establish a socket connection and set nodelay settings on it. :return: New socket connection. """ extra_kw = {} if self.source_address: extra_kw["source_address"] = self.source_address if self.socket_options: extra_kw["socket_options"] = self.socket_options try:
conn = connection.create_connection(
(self._dns_host, self.port), self.timeout, **extra_kw )
/usr/lib/python3/dist-packages/urllib3/connection.py:159:
address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]
def create_connection( address, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, source_address=None, socket_options=None, ): """Connect to *address* and return the socket object. Convenience function. Connect to *address* (a 2-tuple ``(host, port)``) and return the socket object. Passing the optional *timeout* parameter will set the timeout on the socket instance before attempting to connect. If no *timeout* is supplied, the global default timeout setting returned by :func:`getdefaulttimeout` is used. If *source_address* is set it must be a tuple of (host, port) for the socket to bind as a source address before making the connection. An host of '' or port 0 tells the OS to use the default. """ host, port = address if host.startswith("["): host = host.strip("[]") err = None # Using the value from allowed_gai_family() in the context of getaddrinfo lets # us select whether to work with IPv4 DNS records, IPv6 records, or both. # The original create_connection function always returns all records. family = allowed_gai_family() for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res sock = None try: sock = socket.socket(af, socktype, proto) # If provided, set socket level options before connecting. _set_socket_options(sock, socket_options) if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT: sock.settimeout(timeout) if source_address: sock.bind(source_address) sock.connect(sa) return sock except socket.error as e: err = e if sock is not None: sock.close() sock = None if err is not None:
raise err
/usr/lib/python3/dist-packages/urllib3/util/connection.py:84:
address = ('127.0.0.1', 5000), timeout = 60, source_address = None socket_options = [(6, 1, 1)]
def create_connection( address, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, source_address=None, socket_options=None, ): """Connect to *address* and return the socket object. Convenience function. Connect to *address* (a 2-tuple ``(host, port)``) and return the socket object. Passing the optional *timeout* parameter will set the timeout on the socket instance before attempting to connect. If no *timeout* is supplied, the global default timeout setting returned by :func:`getdefaulttimeout` is used. If *source_address* is set it must be a tuple of (host, port) for the socket to bind as a source address before making the connection. An host of '' or port 0 tells the OS to use the default. """ host, port = address if host.startswith("["): host = host.strip("[]") err = None # Using the value from allowed_gai_family() in the context of getaddrinfo lets # us select whether to work with IPv4 DNS records, IPv6 records, or both. # The original create_connection function always returns all records. family = allowed_gai_family() for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): af, socktype, proto, canonname, sa = res sock = None try: sock = socket.socket(af, socktype, proto) # If provided, set socket level options before connecting. _set_socket_options(sock, socket_options) if timeout is not socket._GLOBAL_DEFAULT_TIMEOUT: sock.settimeout(timeout) if source_address: sock.bind(source_address)
sock.connect(sa)
E ConnectionRefusedError: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/util/connection.py:74: ConnectionRefusedError
During handling of the above exception, another exception occurred:
self = <botocore.httpsession.URLLib3Session object at 0x7fe45554bdf0> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request): try: proxy_url = self._proxy_config.proxy_url_for(request.url) manager = self._get_connection_manager(request.url, proxy_url) conn = manager.connection_from_url(request.url) self._setup_ssl_cert(conn, request.url, self._verify) if ensure_boolean( os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '') ): # This is currently an "experimental" feature which provides # no guarantees of backwards compatibility. It may be subject # to change or removal in any patch version. Anyone opting in # to this feature should strictly pin botocore. host = urlparse(request.url).hostname conn.proxy_headers['host'] = host request_target = self._get_request_target(request.url, proxy_url)
urllib_response = conn.urlopen(
method=request.method, url=request_target, body=request.body, headers=request.headers, retries=Retry(False), assert_same_host=False, preload_content=False, decode_content=False, chunked=self._chunked(request.headers), )
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:448:
self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe457b21b20> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7fe561827220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe48ec749a0> is_new_proxy_conn = False
def urlopen( self, method, url, body=None, headers=None, retries=None, redirect=True, assert_same_host=True, timeout=_Default, pool_timeout=None, release_conn=None, chunked=False, body_pos=None, **response_kw ): """ Get a connection from the pool and perform an HTTP request. This is the lowest level call for making a request, so you'll need to specify all the raw details. .. note:: More commonly, it's appropriate to use a convenience method provided by :class:`.RequestMethods`, such as :meth:`request`. .. note:: `release_conn` will only behave as expected if `preload_content=False` because we want to make `preload_content=False` the default behaviour someday soon without breaking backwards compatibility. :param method: HTTP request method (such as GET, POST, PUT, etc.) :param body: Data to send in the request body (useful for creating POST requests, see HTTPConnectionPool.post_url for more convenience). :param headers: Dictionary of custom headers to send, such as User-Agent, If-None-Match, etc. If None, pool headers are used. If provided, these headers completely replace any pool-specific headers. :param retries: Configure the number of retries to allow before raising a :class:`~urllib3.exceptions.MaxRetryError` exception. Pass ``None`` to retry until you receive a response. Pass a :class:`~urllib3.util.retry.Retry` object for fine-grained control over different types of retries. Pass an integer number to retry connection errors that many times, but no other types of errors. Pass zero to never retry. If ``False``, then retries are disabled and any exception is raised immediately. Also, instead of raising a MaxRetryError on redirects, the redirect response will be returned. :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int. :param redirect: If True, automatically handle redirects (status codes 301, 302, 303, 307, 308). Each redirect counts as a retry. Disabling retries will disable redirect, too. :param assert_same_host: If ``True``, will make sure that the host of the pool requests is consistent else will raise HostChangedError. When False, you can use the pool on an HTTP proxy and request foreign hosts. :param timeout: If specified, overrides the default timeout for this one request. It may be a float (in seconds) or an instance of :class:`urllib3.util.Timeout`. :param pool_timeout: If set and the pool is set to block=True, then this method will block for ``pool_timeout`` seconds and raise EmptyPoolError if no connection is available within the time period. :param release_conn: If False, then the urlopen call will not release the connection back into the pool once a response is received (but will release if you read the entire contents of the response such as when `preload_content=True`). This is useful if you're not preloading the response's content immediately. You will need to call ``r.release_conn()`` on the response ``r`` to return the connection back into the pool. If None, it takes the value of ``response_kw.get('preload_content', True)``. :param chunked: If True, urllib3 will send the body using chunked transfer encoding. Otherwise, urllib3 will send the body using the standard content-length form. Defaults to False. :param int body_pos: Position to seek to in file-like body in the event of a retry or redirect. Typically this won't need to be set because urllib3 will auto-populate the value when needed. :param \\**response_kw: Additional parameters are passed to :meth:`urllib3.response.HTTPResponse.from_httplib` """ if headers is None: headers = self.headers if not isinstance(retries, Retry): retries = Retry.from_int(retries, redirect=redirect, default=self.retries) if release_conn is None: release_conn = response_kw.get("preload_content", True) # Check host if assert_same_host and not self.is_same_host(url): raise HostChangedError(self, url, retries) # Ensure that the URL we're connecting to is properly encoded if url.startswith("/"): url = six.ensure_str(_encode_target(url)) else: url = six.ensure_str(parse_url(url).url) conn = None # Track whether `conn` needs to be released before # returning/raising/recursing. Update this variable if necessary, and # leave `release_conn` constant throughout the function. That way, if # the function recurses, the original value of `release_conn` will be # passed down into the recursive call, and its value will be respected. # # See issue #651 [1] for details. # # [1] <https://github.com/urllib3/urllib3/issues/651> release_this_conn = release_conn # Merge the proxy headers. Only do this in HTTP. We have to copy the # headers dict so we can safely change it without those changes being # reflected in anyone else's copy. if self.scheme == "http": headers = headers.copy() headers.update(self.proxy_headers) # Must keep the exception bound to a separate variable or else Python 3 # complains about UnboundLocalError. err = None # Keep track of whether we cleanly exited the except block. This # ensures we do proper cleanup in finally. clean_exit = False # Rewind body position, if needed. Record current position # for future rewinds in the event of a redirect/retry. body_pos = set_file_position(body, body_pos) try: # Request a connection from the queue. timeout_obj = self._get_timeout(timeout) conn = self._get_conn(timeout=pool_timeout) conn.timeout = timeout_obj.connect_timeout is_new_proxy_conn = self.proxy is not None and not getattr( conn, "sock", None ) if is_new_proxy_conn: self._prepare_proxy(conn) # Make the request on the httplib connection object. httplib_response = self._make_request( conn, method, url, timeout=timeout_obj, body=body, headers=headers, chunked=chunked, ) # If we're going to release the connection in ``finally:``, then # the response doesn't need to know about the connection. Otherwise # it will also try to release it and we'll have a double-release # mess. response_conn = conn if not release_conn else None # Pass method to Response for length checking response_kw["request_method"] = method # Import httplib's response into our own wrapper object response = self.ResponseCls.from_httplib( httplib_response, pool=self, connection=response_conn, retries=retries, **response_kw ) # Everything went great! clean_exit = True except queue.Empty: # Timed out by queue. raise EmptyPoolError(self, "No pool connections are available.") except ( TimeoutError, HTTPException, SocketError, ProtocolError, BaseSSLError, SSLError, CertificateError, ) as e: # Discard the connection for these exceptions. It will be # replaced during the next _get_conn() call. clean_exit = False if isinstance(e, (BaseSSLError, CertificateError)): e = SSLError(e) elif isinstance(e, (SocketError, NewConnectionError)) and self.proxy: e = ProxyError("Cannot connect to proxy.", e) elif isinstance(e, (SocketError, HTTPException)): e = ProtocolError("Connection aborted.", e)
retries = retries.increment(
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] )
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:719:
self = Retry(total=False, connect=None, read=None, redirect=0, status=None) method = 'PUT', url = '/csv', response = None error = NewConnectionError('<botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>: Failed to establish a new connection: [Errno 111] Connection refused') _pool = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe457b21b20> _stacktrace = <traceback object at 0x7fe457dbb880>
def increment( self, method=None, url=None, response=None, error=None, _pool=None, _stacktrace=None, ): """ Return a new Retry object with incremented retry counters. :param response: A response object, or None, if the server did not return a response. :type response: :class:`~urllib3.response.HTTPResponse` :param Exception error: An error encountered during the request, or None if the response was received successfully. :return: A new ``Retry`` object. """ if self.total is False and error: # Disabled, indicate to re-raise the error.
raise six.reraise(type(error), error, _stacktrace)
/usr/lib/python3/dist-packages/urllib3/util/retry.py:376:
tp = <class 'urllib3.exceptions.NewConnectionError'>, value = None, tb = None
def reraise(tp, value, tb=None): try: if value is None: value = tp() if value.__traceback__ is not tb: raise value.with_traceback(tb)
raise value
../../../.local/lib/python3.8/site-packages/six.py:703:
self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe457b21b20> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} retries = Retry(total=False, connect=None, read=None, redirect=0, status=None) redirect = True, assert_same_host = False timeout = <object object at 0x7fe561827220>, pool_timeout = None release_conn = False, chunked = False, body_pos = None response_kw = {'decode_content': False, 'preload_content': False}, conn = None release_this_conn = True, err = None, clean_exit = False timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe48ec749a0> is_new_proxy_conn = False
def urlopen( self, method, url, body=None, headers=None, retries=None, redirect=True, assert_same_host=True, timeout=_Default, pool_timeout=None, release_conn=None, chunked=False, body_pos=None, **response_kw ): """ Get a connection from the pool and perform an HTTP request. This is the lowest level call for making a request, so you'll need to specify all the raw details. .. note:: More commonly, it's appropriate to use a convenience method provided by :class:`.RequestMethods`, such as :meth:`request`. .. note:: `release_conn` will only behave as expected if `preload_content=False` because we want to make `preload_content=False` the default behaviour someday soon without breaking backwards compatibility. :param method: HTTP request method (such as GET, POST, PUT, etc.) :param body: Data to send in the request body (useful for creating POST requests, see HTTPConnectionPool.post_url for more convenience). :param headers: Dictionary of custom headers to send, such as User-Agent, If-None-Match, etc. If None, pool headers are used. If provided, these headers completely replace any pool-specific headers. :param retries: Configure the number of retries to allow before raising a :class:`~urllib3.exceptions.MaxRetryError` exception. Pass ``None`` to retry until you receive a response. Pass a :class:`~urllib3.util.retry.Retry` object for fine-grained control over different types of retries. Pass an integer number to retry connection errors that many times, but no other types of errors. Pass zero to never retry. If ``False``, then retries are disabled and any exception is raised immediately. Also, instead of raising a MaxRetryError on redirects, the redirect response will be returned. :type retries: :class:`~urllib3.util.retry.Retry`, False, or an int. :param redirect: If True, automatically handle redirects (status codes 301, 302, 303, 307, 308). Each redirect counts as a retry. Disabling retries will disable redirect, too. :param assert_same_host: If ``True``, will make sure that the host of the pool requests is consistent else will raise HostChangedError. When False, you can use the pool on an HTTP proxy and request foreign hosts. :param timeout: If specified, overrides the default timeout for this one request. It may be a float (in seconds) or an instance of :class:`urllib3.util.Timeout`. :param pool_timeout: If set and the pool is set to block=True, then this method will block for ``pool_timeout`` seconds and raise EmptyPoolError if no connection is available within the time period. :param release_conn: If False, then the urlopen call will not release the connection back into the pool once a response is received (but will release if you read the entire contents of the response such as when `preload_content=True`). This is useful if you're not preloading the response's content immediately. You will need to call ``r.release_conn()`` on the response ``r`` to return the connection back into the pool. If None, it takes the value of ``response_kw.get('preload_content', True)``. :param chunked: If True, urllib3 will send the body using chunked transfer encoding. Otherwise, urllib3 will send the body using the standard content-length form. Defaults to False. :param int body_pos: Position to seek to in file-like body in the event of a retry or redirect. Typically this won't need to be set because urllib3 will auto-populate the value when needed. :param \\**response_kw: Additional parameters are passed to :meth:`urllib3.response.HTTPResponse.from_httplib` """ if headers is None: headers = self.headers if not isinstance(retries, Retry): retries = Retry.from_int(retries, redirect=redirect, default=self.retries) if release_conn is None: release_conn = response_kw.get("preload_content", True) # Check host if assert_same_host and not self.is_same_host(url): raise HostChangedError(self, url, retries) # Ensure that the URL we're connecting to is properly encoded if url.startswith("/"): url = six.ensure_str(_encode_target(url)) else: url = six.ensure_str(parse_url(url).url) conn = None # Track whether `conn` needs to be released before # returning/raising/recursing. Update this variable if necessary, and # leave `release_conn` constant throughout the function. That way, if # the function recurses, the original value of `release_conn` will be # passed down into the recursive call, and its value will be respected. # # See issue #651 [1] for details. # # [1] <https://github.com/urllib3/urllib3/issues/651> release_this_conn = release_conn # Merge the proxy headers. Only do this in HTTP. We have to copy the # headers dict so we can safely change it without those changes being # reflected in anyone else's copy. if self.scheme == "http": headers = headers.copy() headers.update(self.proxy_headers) # Must keep the exception bound to a separate variable or else Python 3 # complains about UnboundLocalError. err = None # Keep track of whether we cleanly exited the except block. This # ensures we do proper cleanup in finally. clean_exit = False # Rewind body position, if needed. Record current position # for future rewinds in the event of a redirect/retry. body_pos = set_file_position(body, body_pos) try: # Request a connection from the queue. timeout_obj = self._get_timeout(timeout) conn = self._get_conn(timeout=pool_timeout) conn.timeout = timeout_obj.connect_timeout is_new_proxy_conn = self.proxy is not None and not getattr( conn, "sock", None ) if is_new_proxy_conn: self._prepare_proxy(conn) # Make the request on the httplib connection object.
httplib_response = self._make_request(
conn, method, url, timeout=timeout_obj, body=body, headers=headers, chunked=chunked, )
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:665:
self = <botocore.awsrequest.AWSHTTPConnectionPool object at 0x7fe457b21b20> conn = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0> method = 'PUT', url = '/csv' timeout = <urllib3.util.timeout.Timeout object at 0x7fe48ec749a0> chunked = False httplib_request_kw = {'body': None, 'headers': {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-...nvocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}} timeout_obj = <urllib3.util.timeout.Timeout object at 0x7fe48d102f70>
def _make_request( self, conn, method, url, timeout=_Default, chunked=False, **httplib_request_kw ): """ Perform a request on a given urllib connection object taken from our pool. :param conn: a connection from one of our connection pools :param timeout: Socket timeout in seconds for the request. This can be a float or integer, which will set the same timeout value for the socket connect and the socket read, or an instance of :class:`urllib3.util.Timeout`, which gives you more fine-grained control over your timeouts. """ self.num_requests += 1 timeout_obj = self._get_timeout(timeout) timeout_obj.start_connect() conn.timeout = timeout_obj.connect_timeout # Trigger any extra validation we need to do. try: self._validate_conn(conn) except (SocketTimeout, BaseSSLError) as e: # Py2 raises this as a BaseSSLError, Py3 raises it as socket timeout. self._raise_timeout(err=e, url=url, timeout_value=conn.timeout) raise # conn.request() calls httplib.*.request, not the method in # urllib3.request. It also calls makefile (recv) on the socket. if chunked: conn.request_chunked(method, url, **httplib_request_kw) else:
conn.request(method, url, **httplib_request_kw)
/usr/lib/python3/dist-packages/urllib3/connectionpool.py:387:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}
def request(self, method, url, body=None, headers={}, *, encode_chunked=False): """Send a complete request to the server."""
self._send_request(method, url, body, headers, encode_chunked)
/usr/lib/python3.8/http/client.py:1256:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} args = (False,), kwargs = {}
def _send_request(self, method, url, body, headers, *args, **kwargs): self._response_received = False if headers.get('Expect', b'') == b'100-continue': self._expect_header_set = True else: self._expect_header_set = False self.response_class = self._original_response_cls
rval = super()._send_request(
method, url, body, headers, *args, **kwargs )
/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:94:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0> method = 'PUT', url = '/csv', body = None headers = {'x-amz-acl': b'public-read-write', 'User-Agent': b'Boto3/1.17.0 Python/3.8.10 Linux/4.15.0-108-generic Botocore/1.27....invocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'} encode_chunked = False
def _send_request(self, method, url, body, headers, encode_chunked): # Honor explicitly requested Host: and Accept-Encoding: headers. header_names = frozenset(k.lower() for k in headers) skips = {} if 'host' in header_names: skips['skip_host'] = 1 if 'accept-encoding' in header_names: skips['skip_accept_encoding'] = 1 self.putrequest(method, url, **skips) # chunked encoding will happen if HTTP/1.1 is used and either # the caller passes encode_chunked=True or the following # conditions hold: # 1. content-length has not been explicitly set # 2. the body is a file or iterable, but not a str or bytes-like # 3. Transfer-Encoding has NOT been explicitly set by the caller if 'content-length' not in header_names: # only chunk body if not explicitly set for backwards # compatibility, assuming the client code is already handling the # chunking if 'transfer-encoding' not in header_names: # if content-length cannot be automatically determined, fall # back to chunked encoding encode_chunked = False content_length = self._get_content_length(body, method) if content_length is None: if body is not None: if self.debuglevel > 0: print('Unable to determine size of %r' % body) encode_chunked = True self.putheader('Transfer-Encoding', 'chunked') else: self.putheader('Content-Length', str(content_length)) else: encode_chunked = False for hdr, value in headers.items(): self.putheader(hdr, value) if isinstance(body, str): # RFC 2616 Section 3.7.1 says that text default has a # default charset of iso-8859-1. body = _encode(body, 'body')
self.endheaders(body, encode_chunked=encode_chunked)
/usr/lib/python3.8/http/client.py:1302:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0> message_body = None
def endheaders(self, message_body=None, *, encode_chunked=False): """Indicate that the last header line has been sent to the server. This method sends the request to the server. The optional message_body argument can be used to pass a message body associated with the request. """ if self.__state == _CS_REQ_STARTED: self.__state = _CS_REQ_SENT else: raise CannotSendHeader()
self._send_output(message_body, encode_chunked=encode_chunked)
/usr/lib/python3.8/http/client.py:1251:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0> message_body = None, args = (), kwargs = {'encode_chunked': False} msg = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d8a43e95-257d-4027-8530-783e346bcd62\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def _send_output(self, message_body=None, *args, **kwargs): self._buffer.extend((b"", b"")) msg = self._convert_to_bytes(self._buffer) del self._buffer[:] # If msg and message_body are sent in a single send() call, # it will avoid performance problems caused by the interaction # between delayed ack and the Nagle algorithm. if isinstance(message_body, bytes): msg += message_body message_body = None
self.send(msg)
/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:123:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0> str = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d8a43e95-257d-4027-8530-783e346bcd62\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, str): if self._response_received: logger.debug( "send() called, but reseponse already received. " "Not sending data." ) return
return super().send(str)
/usr/local/lib/python3.8/dist-packages/botocore/awsrequest.py:218:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0> data = b'PUT /csv HTTP/1.1\r\nHost: 127.0.0.1:5000\r\nAccept-Encoding: identity\r\nx-amz-acl: public-read-write\r\nUser-Agent...-invocation-id: d8a43e95-257d-4027-8530-783e346bcd62\r\namz-sdk-request: attempt=5; max=5\r\nContent-Length: 0\r\n\r\n'
def send(self, data): """Send `data' to the server. ``data`` can be a string object, a bytes object, an array object, a file-like object that supports a .read() method, or an iterable object. """ if self.sock is None: if self.auto_open:
self.connect()
/usr/lib/python3.8/http/client.py:951:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
def connect(self):
conn = self._new_conn()
/usr/lib/python3/dist-packages/urllib3/connection.py:187:
self = <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>
def _new_conn(self): """ Establish a socket connection and set nodelay settings on it. :return: New socket connection. """ extra_kw = {} if self.source_address: extra_kw["source_address"] = self.source_address if self.socket_options: extra_kw["socket_options"] = self.socket_options try: conn = connection.create_connection( (self._dns_host, self.port), self.timeout, **extra_kw ) except SocketTimeout: raise ConnectTimeoutError( self, "Connection to %s timed out. (connect timeout=%s)" % (self.host, self.timeout), ) except SocketError as e:
raise NewConnectionError(
self, "Failed to establish a new connection: %s" % e )
E urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPConnection object at 0x7fe48d1023a0>: Failed to establish a new connection: [Errno 111] Connection refused
/usr/lib/python3/dist-packages/urllib3/connection.py:171: NewConnectionError
During handling of the above exception, another exception occurred:
s3_base = 'http://127.0.0.1:5000/' s3so = {'client_kwargs': {'endpoint_url': 'http://127.0.0.1:5000/'}} paths = ['/tmp/pytest-of-jenkins/pytest-18/csv0/dataset-0.csv', '/tmp/pytest-of-jenkins/pytest-18/csv0/dataset-1.csv'] datasets = {'cats': local('/tmp/pytest-of-jenkins/pytest-18/cats0'), 'csv': local('/tmp/pytest-of-jenkins/pytest-18/csv0'), 'csv-...ocal('/tmp/pytest-of-jenkins/pytest-18/csv-no-header0'), 'parquet': local('/tmp/pytest-of-jenkins/pytest-18/parquet0')} engine = 'csv' df = name-string id label x y 0 Victor 973 995 -0.613973 -0.434246 1 Bob ... Wendy 964 1065 -0.263394 -0.013804 2160 Ursula 970 1009 -0.394831 -0.651957
[4321 rows x 5 columns] patch_aiobotocore = None
@pytest.mark.parametrize("engine", ["parquet", "csv"]) def test_s3_dataset(s3_base, s3so, paths, datasets, engine, df, patch_aiobotocore): # Copy files to mock s3 bucket files = {} for i, path in enumerate(paths): with open(path, "rb") as f: fbytes = f.read() fn = path.split(os.path.sep)[-1] files[fn] = BytesIO() files[fn].write(fbytes) files[fn].seek(0) if engine == "parquet": # Workaround for nvt#539. In order to avoid the # bug in Dask's `create_metadata_file`, we need # to manually generate a "_metadata" file here. # This can be removed after dask#7295 is merged # (see https://github.com/dask/dask/pull/7295) fn = "_metadata" files[fn] = BytesIO() meta = create_metadata_file( paths, engine="pyarrow", out_dir=False, ) meta.write_metadata_file(files[fn]) files[fn].seek(0)
with s3_context(s3_base=s3_base, bucket=engine, files=files) as s3fs:
tests/unit/test_s3.py:97:
/usr/lib/python3.8/contextlib.py:113: in enter return next(self.gen) /usr/local/lib/python3.8/dist-packages/dask_cudf/io/tests/test_s3.py:96: in s3_context client.create_bucket(Bucket=bucket, ACL="public-read-write") /usr/local/lib/python3.8/dist-packages/botocore/client.py:508: in _api_call return self._make_api_call(operation_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/client.py:898: in _make_api_call http, parsed_response = self._make_request( /usr/local/lib/python3.8/dist-packages/botocore/client.py:921: in _make_request return self._endpoint.make_request(operation_model, request_dict) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:119: in make_request return self._send_request(request_dict, operation_model) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:202: in _send_request while self._needs_retry( /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:354: in _needs_retry responses = self._event_emitter.emit( /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:412: in emit return self._emitter.emit(aliased_event_name, **kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:256: in emit return self._emit(event_name, kwargs) /usr/local/lib/python3.8/dist-packages/botocore/hooks.py:239: in _emit response = handler(**kwargs) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:207: in call if self._checker(**checker_kwargs): /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:284: in call should_retry = self._should_retry( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:320: in _should_retry return self._checker(attempt_number, response, caught_exception) /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:363: in call checker_response = checker( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:247: in call return self._check_caught_exception( /usr/local/lib/python3.8/dist-packages/botocore/retryhandler.py:416: in _check_caught_exception raise caught_exception /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:281: in _do_get_response http_response = self._send(request) /usr/local/lib/python3.8/dist-packages/botocore/endpoint.py:377: in _send return self.http_session.send(request)
self = <botocore.httpsession.URLLib3Session object at 0x7fe45554bdf0> request = <AWSPreparedRequest stream_output=False, method=PUT, url=http://127.0.0.1:5000/csv, headers={'x-amz-acl': b'public-rea...nvocation-id': b'd8a43e95-257d-4027-8530-783e346bcd62', 'amz-sdk-request': b'attempt=5; max=5', 'Content-Length': '0'}>
def send(self, request): try: proxy_url = self._proxy_config.proxy_url_for(request.url) manager = self._get_connection_manager(request.url, proxy_url) conn = manager.connection_from_url(request.url) self._setup_ssl_cert(conn, request.url, self._verify) if ensure_boolean( os.environ.get('BOTO_EXPERIMENTAL__ADD_PROXY_HOST_HEADER', '') ): # This is currently an "experimental" feature which provides # no guarantees of backwards compatibility. It may be subject # to change or removal in any patch version. Anyone opting in # to this feature should strictly pin botocore. host = urlparse(request.url).hostname conn.proxy_headers['host'] = host request_target = self._get_request_target(request.url, proxy_url) urllib_response = conn.urlopen( method=request.method, url=request_target, body=request.body, headers=request.headers, retries=Retry(False), assert_same_host=False, preload_content=False, decode_content=False, chunked=self._chunked(request.headers), ) http_response = botocore.awsrequest.AWSResponse( request.url, urllib_response.status, urllib_response.headers, urllib_response, ) if not request.stream_output: # Cause the raw stream to be exhausted immediately. We do it # this way instead of using preload_content because # preload_content will never buffer chunked responses http_response.content return http_response except URLLib3SSLError as e: raise SSLError(endpoint_url=request.url, error=e) except (NewConnectionError, socket.gaierror) as e:
raise EndpointConnectionError(endpoint_url=request.url, error=e)
E botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "http://127.0.0.1:5000/csv"
/usr/local/lib/python3.8/dist-packages/botocore/httpsession.py:477: EndpointConnectionError _____________________ test_cpu_workflow[True-True-parquet] _____________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_pa0') df = name-cat name-string id label x y 0 Alice Victor 973 995 -0.613973 -0.434246 ...dy 964 1065 -0.263394 -0.013804 4320 Jerry Ursula 970 1009 -0.394831 -0.651957
[4321 rows x 6 columns] dataset = <merlin.io.dataset.Dataset object at 0x7fe42818a0a0>, cpu = True engine = 'parquet', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("dump", [True, False]) @pytest.mark.parametrize("cpu", [True]) def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump): # Make sure we are in cpu formats if cudf and isinstance(df, cudf.DataFrame): df = df.to_pandas() if cpu: dataset.to_cpu() cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] norms = ops.Normalize() conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms cats = cat_names >> ops.Categorify() workflow = nvt.Workflow(conts + cats + label_name) workflow.fit(dataset) if dump: workflow_dir = os.path.join(tmpdir, "workflow") workflow.save(workflow_dir) workflow = None workflow = Workflow.load(workflow_dir) def get_norms(tar: pd.Series): df = tar.fillna(0) df = df * (df >= 0).astype("int") return df assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4) assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4) assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3) assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3) # Check that categories match if engine == "parquet": cats_expected0 = df["name-cat"].unique() cats0 = get_cats(workflow, "name-cat", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist()) assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None]) cats_expected1 = df["name-string"].unique() cats1 = get_cats(workflow, "name-string", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist()) assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None]) # Write to new "shuffled" and "processed" dataset workflow.transform(dataset).to_parquet( output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION )
dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)
tests/unit/workflow/test_cpu_workflow.py:76:
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???
??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_pa0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_pa0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
pyarrow/error.pxi:99: ArrowInvalid _______________________ test_cpu_workflow[True-True-csv] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs0') df = name-string id label x y 0 Victor 973 995 -0.613973 -0.434246 1 Bob ... Wendy 964 1065 -0.263394 -0.013804 2160 Ursula 970 1009 -0.394831 -0.651957
[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7fe40c68ee50>, cpu = True engine = 'csv', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("dump", [True, False]) @pytest.mark.parametrize("cpu", [True]) def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump): # Make sure we are in cpu formats if cudf and isinstance(df, cudf.DataFrame): df = df.to_pandas() if cpu: dataset.to_cpu() cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] norms = ops.Normalize() conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms cats = cat_names >> ops.Categorify() workflow = nvt.Workflow(conts + cats + label_name) workflow.fit(dataset) if dump: workflow_dir = os.path.join(tmpdir, "workflow") workflow.save(workflow_dir) workflow = None workflow = Workflow.load(workflow_dir) def get_norms(tar: pd.Series): df = tar.fillna(0) df = df * (df >= 0).astype("int") return df assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4) assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4) assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3) assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3) # Check that categories match if engine == "parquet": cats_expected0 = df["name-cat"].unique() cats0 = get_cats(workflow, "name-cat", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist()) assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None]) cats_expected1 = df["name-string"].unique() cats1 = get_cats(workflow, "name-string", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist()) assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None]) # Write to new "shuffled" and "processed" dataset workflow.transform(dataset).to_parquet( output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION )
dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)
tests/unit/workflow/test_cpu_workflow.py:76:
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???
??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
pyarrow/error.pxi:99: ArrowInvalid __________________ test_cpu_workflow[True-True-csv-no-header] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs1') df = name-string id label x y 0 Victor 973 995 -0.613973 -0.434246 1 Bob ... Wendy 964 1065 -0.263394 -0.013804 2160 Ursula 970 1009 -0.394831 -0.651957
[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7fe45514a1c0>, cpu = True engine = 'csv-no-header', dump = True
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("dump", [True, False]) @pytest.mark.parametrize("cpu", [True]) def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump): # Make sure we are in cpu formats if cudf and isinstance(df, cudf.DataFrame): df = df.to_pandas() if cpu: dataset.to_cpu() cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] norms = ops.Normalize() conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms cats = cat_names >> ops.Categorify() workflow = nvt.Workflow(conts + cats + label_name) workflow.fit(dataset) if dump: workflow_dir = os.path.join(tmpdir, "workflow") workflow.save(workflow_dir) workflow = None workflow = Workflow.load(workflow_dir) def get_norms(tar: pd.Series): df = tar.fillna(0) df = df * (df >= 0).astype("int") return df assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4) assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4) assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3) assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3) # Check that categories match if engine == "parquet": cats_expected0 = df["name-cat"].unique() cats0 = get_cats(workflow, "name-cat", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist()) assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None]) cats_expected1 = df["name-string"].unique() cats1 = get_cats(workflow, "name-string", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist()) assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None]) # Write to new "shuffled" and "processed" dataset workflow.transform(dataset).to_parquet( output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION )
dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)
tests/unit/workflow/test_cpu_workflow.py:76:
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???
??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_True_cs1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
pyarrow/error.pxi:99: ArrowInvalid ____________________ test_cpu_workflow[True-False-parquet] _____________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_p0') df = name-cat name-string id label x y 0 Alice Victor 973 995 -0.613973 -0.434246 ...dy 964 1065 -0.263394 -0.013804 4320 Jerry Ursula 970 1009 -0.394831 -0.651957
[4321 rows x 6 columns] dataset = <merlin.io.dataset.Dataset object at 0x7fe4890b4160>, cpu = True engine = 'parquet', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("dump", [True, False]) @pytest.mark.parametrize("cpu", [True]) def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump): # Make sure we are in cpu formats if cudf and isinstance(df, cudf.DataFrame): df = df.to_pandas() if cpu: dataset.to_cpu() cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] norms = ops.Normalize() conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms cats = cat_names >> ops.Categorify() workflow = nvt.Workflow(conts + cats + label_name) workflow.fit(dataset) if dump: workflow_dir = os.path.join(tmpdir, "workflow") workflow.save(workflow_dir) workflow = None workflow = Workflow.load(workflow_dir) def get_norms(tar: pd.Series): df = tar.fillna(0) df = df * (df >= 0).astype("int") return df assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4) assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4) assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3) assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3) # Check that categories match if engine == "parquet": cats_expected0 = df["name-cat"].unique() cats0 = get_cats(workflow, "name-cat", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist()) assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None]) cats_expected1 = df["name-string"].unique() cats1 = get_cats(workflow, "name-string", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist()) assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None]) # Write to new "shuffled" and "processed" dataset workflow.transform(dataset).to_parquet( output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION )
dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)
tests/unit/workflow/test_cpu_workflow.py:76:
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???
??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_p0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_p0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
pyarrow/error.pxi:99: ArrowInvalid ______________________ test_cpu_workflow[True-False-csv] _______________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c0') df = name-string id label x y 0 Victor 973 995 -0.613973 -0.434246 1 Bob ... Wendy 964 1065 -0.263394 -0.013804 2160 Ursula 970 1009 -0.394831 -0.651957
[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7fe40c2770d0>, cpu = True engine = 'csv', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("dump", [True, False]) @pytest.mark.parametrize("cpu", [True]) def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump): # Make sure we are in cpu formats if cudf and isinstance(df, cudf.DataFrame): df = df.to_pandas() if cpu: dataset.to_cpu() cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] norms = ops.Normalize() conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms cats = cat_names >> ops.Categorify() workflow = nvt.Workflow(conts + cats + label_name) workflow.fit(dataset) if dump: workflow_dir = os.path.join(tmpdir, "workflow") workflow.save(workflow_dir) workflow = None workflow = Workflow.load(workflow_dir) def get_norms(tar: pd.Series): df = tar.fillna(0) df = df * (df >= 0).astype("int") return df assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4) assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4) assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3) assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3) # Check that categories match if engine == "parquet": cats_expected0 = df["name-cat"].unique() cats0 = get_cats(workflow, "name-cat", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist()) assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None]) cats_expected1 = df["name-string"].unique() cats1 = get_cats(workflow, "name-string", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist()) assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None]) # Write to new "shuffled" and "processed" dataset workflow.transform(dataset).to_parquet( output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION )
dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)
tests/unit/workflow/test_cpu_workflow.py:76:
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???
??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c0/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c0/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
pyarrow/error.pxi:99: ArrowInvalid _________________ test_cpu_workflow[True-False-csv-no-header] __________________
tmpdir = local('/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c1') df = name-string id label x y 0 Victor 973 995 -0.613973 -0.434246 1 Bob ... Wendy 964 1065 -0.263394 -0.013804 2160 Ursula 970 1009 -0.394831 -0.651957
[4321 rows x 5 columns] dataset = <merlin.io.dataset.Dataset object at 0x7fe4280583d0>, cpu = True engine = 'csv-no-header', dump = False
@pytest.mark.parametrize("engine", ["parquet", "csv", "csv-no-header"]) @pytest.mark.parametrize("dump", [True, False]) @pytest.mark.parametrize("cpu", [True]) def test_cpu_workflow(tmpdir, df, dataset, cpu, engine, dump): # Make sure we are in cpu formats if cudf and isinstance(df, cudf.DataFrame): df = df.to_pandas() if cpu: dataset.to_cpu() cat_names = ["name-cat", "name-string"] if engine == "parquet" else ["name-string"] cont_names = ["x", "y", "id"] label_name = ["label"] norms = ops.Normalize() conts = cont_names >> ops.FillMissing() >> ops.Clip(min_value=0) >> norms cats = cat_names >> ops.Categorify() workflow = nvt.Workflow(conts + cats + label_name) workflow.fit(dataset) if dump: workflow_dir = os.path.join(tmpdir, "workflow") workflow.save(workflow_dir) workflow = None workflow = Workflow.load(workflow_dir) def get_norms(tar: pd.Series): df = tar.fillna(0) df = df * (df >= 0).astype("int") return df assert math.isclose(get_norms(df.x).mean(), norms.means["x"], rel_tol=1e-4) assert math.isclose(get_norms(df.y).mean(), norms.means["y"], rel_tol=1e-4) assert math.isclose(get_norms(df.x).std(), norms.stds["x"], rel_tol=1e-3) assert math.isclose(get_norms(df.y).std(), norms.stds["y"], rel_tol=1e-3) # Check that categories match if engine == "parquet": cats_expected0 = df["name-cat"].unique() cats0 = get_cats(workflow, "name-cat", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected0.tolist()) for cat in cats0.tolist()) assert len(cats0.tolist()) == len(cats_expected0.tolist() + [None]) cats_expected1 = df["name-string"].unique() cats1 = get_cats(workflow, "name-string", cpu=True) # adding the None entry as a string because of move from gpu assert all(cat in [None] + sorted(cats_expected1.tolist()) for cat in cats1.tolist()) assert len(cats1.tolist()) == len(cats_expected1.tolist() + [None]) # Write to new "shuffled" and "processed" dataset workflow.transform(dataset).to_parquet( output_path=tmpdir, out_files_per_proc=10, shuffle=nvt.io.Shuffle.PER_PARTITION )
dataset_2 = Dataset(glob.glob(str(tmpdir) + "/*.parquet"), cpu=cpu)
tests/unit/workflow/test_cpu_workflow.py:76:
/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:303: in init self.engine = ParquetDatasetEngine( /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:313: in init self._path0, /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:338: in _path0 return next(self._dataset.get_fragments()).path /usr/local/lib/python3.8/dist-packages/merlin/io/parquet.py:365: in _dataset dataset = pa_ds.dataset(paths, filesystem=fs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:683: in dataset return _filesystem_dataset(source, **kwargs) /usr/local/lib/python3.8/dist-packages/pyarrow/dataset.py:435: in _filesystem_dataset return factory.finish(schema) pyarrow/_dataset.pyx:2473: in pyarrow._dataset.DatasetFactory.finish ??? pyarrow/error.pxi:143: in pyarrow.lib.pyarrow_internal_check_status ???
??? E pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c1/part_0.parquet': Could not open Parquet input source '/tmp/pytest-of-jenkins/pytest-18/test_cpu_workflow_True_False_c1/part_0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.. Is this a 'parquet' file?
pyarrow/error.pxi:99: ArrowInvalid =============================== warnings summary =============================== ../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33 /usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other)
nvtabular/loader/init.py:19 /var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The
nvtabular.loader
module has moved tomerlin.models.loader
. Support for importing fromnvtabular.loader
is deprecated, and will be removed in a future version. Please update your imports to refer tomerlin.models.loader
. warnings.warn(tests/unit/test_dask_nvt.py: 1 warning tests/unit/test_tf4rec.py: 1 warning tests/unit/test_tools.py: 5 warnings tests/unit/test_triton_inference.py: 8 warnings tests/unit/loader/test_dataloader_backend.py: 6 warnings tests/unit/loader/test_tf_dataloader.py: 66 warnings tests/unit/loader/test_torch_dataloader.py: 67 warnings tests/unit/ops/test_categorify.py: 69 warnings tests/unit/ops/test_drop_low_cardinality.py: 2 warnings tests/unit/ops/test_fill.py: 8 warnings tests/unit/ops/test_hash_bucket.py: 4 warnings tests/unit/ops/test_join.py: 88 warnings tests/unit/ops/test_lambda.py: 1 warning tests/unit/ops/test_normalize.py: 9 warnings tests/unit/ops/test_ops.py: 11 warnings tests/unit/ops/test_ops_schema.py: 17 warnings tests/unit/workflow/test_workflow.py: 27 warnings tests/unit/workflow/test_workflow_chaining.py: 1 warning tests/unit/workflow/test_workflow_node.py: 1 warning tests/unit/workflow/test_workflow_schemas.py: 1 warning /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files. warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers /usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters. warnings.warn(
tests/unit/test_notebooks.py: 1 warning tests/unit/test_tools.py: 17 warnings tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 54 warnings /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 12 warnings tests/unit/workflow/test_workflow.py: 9 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files. warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet] tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet] tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True] /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings tests/unit/workflow/test_workflow.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files. warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files. warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_parquet_output[True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None] /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files. warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================== short test summary info ============================ FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-device-150-csv-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-True-None-0-csv-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-0-csv-no-header-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-None-False-device-150-csv-no-header-0.1] FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-parquet] FAILED tests/unit/test_dask_nvt.py::test_dask_preproc_cpu[True-None-csv-no-header] FAILED tests/unit/test_s3.py::test_s3_dataset[parquet] - botocore.exceptions.... FAILED tests/unit/test_s3.py::test_s3_dataset[csv] - botocore.exceptions.Endp... FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-parquet] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-True-csv-no-header] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-parquet] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv] FAILED tests/unit/workflow/test_cpu_workflow.py::test_cpu_workflow[True-False-csv-no-header] ===== 14 failed, 1417 passed, 1 skipped, 617 warnings in 722.15s (0:12:02) ===== Build step 'Execute shell' marked build as failure Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins2653689168443043854.sh
rerun tests
Click to view CI Results
GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts. Running as SYSTEM Setting status of 5e149c8a6f16a47cd99a23f4c060318f247fca7b to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4629/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10 > git rev-parse 5e149c8a6f16a47cd99a23f4c060318f247fca7b^{commit} # timeout=10 Checking out Revision 5e149c8a6f16a47cd99a23f4c060318f247fca7b (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10 Commit message: "Merge branch 'main' into categorify-domain-max" > git rev-list --no-walk 9df466c566c9f80b1282693baecbd07c6a2d6bb6 # timeout=10 First time build. Skipping changelog. [nvtabular_tests] $ /bin/bash /tmp/jenkins5697026500764221364.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1430 items / 1 skippedtests/unit/test_dask_nvt.py ............................................ [ 3%] ........................................................................ [ 8%] .... [ 8%] tests/unit/test_notebooks.py ...... [ 8%] tests/unit/test_tf4rec.py . [ 8%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py Build timed out (after 60 minutes). Marking the build as failed. Build was aborted Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins6758279656828584530.sh
rerun tests
Click to view CI Results
GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts. GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts. Running as SYSTEM Setting status of 5e149c8a6f16a47cd99a23f4c060318f247fca7b to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4630/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10 > git rev-parse 5e149c8a6f16a47cd99a23f4c060318f247fca7b^{commit} # timeout=10 Checking out Revision 5e149c8a6f16a47cd99a23f4c060318f247fca7b (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10 Commit message: "Merge branch 'main' into categorify-domain-max" > git rev-list --no-walk 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10 [nvtabular_tests] $ /bin/bash /tmp/jenkins11669948025439148038.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1430 items / 1 skippedtests/unit/test_dask_nvt.py ............................................ [ 3%] ........................................................................ [ 8%] .... [ 8%] tests/unit/test_notebooks.py ...... [ 8%] tests/unit/test_tf4rec.py . [ 8%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py ................................ [ 12%] tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%] tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%] ..................................................F [ 18%] tests/unit/framework_utils/test_torch_layers.py . [ 18%] tests/unit/loader/test_dataloader_backend.py ...... [ 18%] tests/unit/loader/test_tf_dataloader.py ................................ [ 20%] ........................................s.. [ 23%] tests/unit/loader/test_torch_dataloader.py ............................. [ 25%] ...................................................... [ 29%] tests/unit/ops/test_categorify.py ...................................... [ 32%] ........................................................................ [ 37%] ........................................... [ 40%] tests/unit/ops/test_column_similarity.py ........................ [ 42%] tests/unit/ops/test_drop_low_cardinality.py .. [ 42%] tests/unit/ops/test_fill.py ............................................ [ 45%] ........ [ 45%] tests/unit/ops/test_groupyby.py ..................... [ 47%] tests/unit/ops/test_hash_bucket.py ......................... [ 49%] tests/unit/ops/test_join.py ............................................ [ 52%] ........................................................................ [ 57%] .................................. [ 59%] tests/unit/ops/test_lambda.py .......... [ 60%] tests/unit/ops/test_normalize.py ....................................... [ 63%] .. [ 63%] tests/unit/ops/test_ops.py ............................................. [ 66%] .................... [ 67%] tests/unit/ops/test_ops_schema.py ...................................... [ 70%] ........................................................................ [ 75%] ........................................................................ [ 80%] ........................................................................ [ 85%] ....................................... [ 88%] tests/unit/ops/test_reduce_dtype_size.py .. [ 88%] tests/unit/ops/test_target_encode.py ..................... [ 89%] tests/unit/workflow/test_cpu_workflow.py ...... [ 90%] tests/unit/workflow/test_workflow.py ................................... [ 92%] .......................................................... [ 96%] tests/unit/workflow/test_workflow_chaining.py ... [ 96%] tests/unit/workflow/test_workflow_node.py ........... [ 97%] tests/unit/workflow/test_workflow_ops.py ... [ 97%] tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%] ... [100%]
=================================== FAILURES =================================== ___________________________ test_multihot_empty_rows ___________________________
def test_multihot_empty_rows(): multi_hot = tf.feature_column.categorical_column_with_identity("multihot", 5) multi_hot_embedding = tf.feature_column.embedding_column(multi_hot, 8, combiner="sum") embedding_layer = layers.DenseFeatures([multi_hot_embedding]) inputs = { "multihot": ( tf.keras.Input(name="multihot__values", shape=(1,), dtype=tf.int64), tf.keras.Input(name="multihot__nnzs", shape=(1,), dtype=tf.int64), ) } output = embedding_layer(inputs) model = tf.keras.Model(inputs=inputs, outputs=output) model.compile("sgd", "binary_crossentropy") multi_hot_values = np.array([0, 2, 1, 4, 1, 3, 1]) multi_hot_nnzs = np.array([1, 0, 2, 4, 0]) x = {"multihot": (multi_hot_values[:, None], multi_hot_nnzs[:, None])} multi_hot_embedding_table = embedding_layer.embedding_tables["multihot"].numpy() multi_hot_embedding_rows = _compute_expected_multi_hot( multi_hot_embedding_table, multi_hot_values, multi_hot_nnzs, "sum" ) y_hat = model(x).numpy()
np.testing.assert_allclose(y_hat, multi_hot_embedding_rows, rtol=1e-06)
E AssertionError: E Not equal to tolerance rtol=1e-06, atol=0 E
E Mismatched elements: 1 / 40 (2.5%) E Max absolute difference: 1.1920929e-07 E Max relative difference: 1.502241e-06 E x: array([[-0.29789 , -0.016212, -0.051031, -0.248089, 0.250163, -0.30276 , E -0.253522, -0.074231], E [ 0. , 0. , 0. , 0. , 0. , 0. ,... E y: array([[-0.29789 , -0.016212, -0.051031, -0.248089, 0.250163, -0.30276 , E -0.253522, -0.074231], E [ 0. , 0. , 0. , 0. , 0. , 0. ,...tests/unit/framework_utils/test_tf_layers.py:321: AssertionError =============================== warnings summary =============================== ../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33 /usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other)
nvtabular/loader/init.py:19 /var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The
nvtabular.loader
module has moved tomerlin.models.loader
. Support for importing fromnvtabular.loader
is deprecated, and will be removed in a future version. Please update your imports to refer tomerlin.models.loader
. warnings.warn(tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1] /usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first self.make_current()
tests/unit/test_dask_nvt.py: 1 warning tests/unit/test_tf4rec.py: 1 warning tests/unit/test_tools.py: 5 warnings tests/unit/test_triton_inference.py: 8 warnings tests/unit/loader/test_dataloader_backend.py: 6 warnings tests/unit/loader/test_tf_dataloader.py: 66 warnings tests/unit/loader/test_torch_dataloader.py: 67 warnings tests/unit/ops/test_categorify.py: 69 warnings tests/unit/ops/test_drop_low_cardinality.py: 2 warnings tests/unit/ops/test_fill.py: 8 warnings tests/unit/ops/test_hash_bucket.py: 4 warnings tests/unit/ops/test_join.py: 88 warnings tests/unit/ops/test_lambda.py: 1 warning tests/unit/ops/test_normalize.py: 9 warnings tests/unit/ops/test_ops.py: 11 warnings tests/unit/ops/test_ops_schema.py: 17 warnings tests/unit/workflow/test_workflow.py: 27 warnings tests/unit/workflow/test_workflow_chaining.py: 1 warning tests/unit/workflow/test_workflow_node.py: 1 warning tests/unit/workflow/test_workflow_schemas.py: 1 warning /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files. warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers /usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters. warnings.warn(
tests/unit/test_notebooks.py: 1 warning tests/unit/test_tools.py: 17 warnings tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 54 warnings /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 12 warnings tests/unit/workflow/test_workflow.py: 9 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files. warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet] tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet] tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True] /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings tests/unit/workflow/test_workflow.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files. warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files. warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_parquet_output[True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None] /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files. warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html =========================== short test summary info ============================ FAILED tests/unit/framework_utils/test_tf_layers.py::test_multihot_empty_rows ===== 1 failed, 1428 passed, 2 skipped, 618 warnings in 722.63s (0:12:02) ====== Build step 'Execute shell' marked build as failure Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins4753025854975341293.sh
rerun tests
Click to view CI Results
GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts. GitHub pull request #1641 of commit 5e149c8a6f16a47cd99a23f4c060318f247fca7b, no merge conflicts. Running as SYSTEM Setting status of 5e149c8a6f16a47cd99a23f4c060318f247fca7b to PENDING with url http://10.20.17.181:8080/job/nvtabular_tests/4631/ and message: 'Build started for merge commit.' Using context: Jenkins Unit Test Run Building on master in workspace /var/jenkins_home/workspace/nvtabular_tests using credential nvidia-merlin-bot Cloning the remote Git repository Cloning repository https://github.com/NVIDIA-Merlin/NVTabular.git > git init /var/jenkins_home/workspace/nvtabular_tests/nvtabular # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git > git --version # timeout=10 using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 > git config --add remote.origin.fetch +refs/heads/*:refs/remotes/origin/* # timeout=10 > git config remote.origin.url https://github.com/NVIDIA-Merlin/NVTabular.git # timeout=10 Fetching upstream changes from https://github.com/NVIDIA-Merlin/NVTabular.git using GIT_ASKPASS to set credentials This is the bot credentials for our CI/CD > git fetch --tags --force --progress -- https://github.com/NVIDIA-Merlin/NVTabular.git +refs/pull/1641/*:refs/remotes/origin/pr/1641/* # timeout=10 > git rev-parse 5e149c8a6f16a47cd99a23f4c060318f247fca7b^{commit} # timeout=10 Checking out Revision 5e149c8a6f16a47cd99a23f4c060318f247fca7b (detached) > git config core.sparsecheckout # timeout=10 > git checkout -f 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10 Commit message: "Merge branch 'main' into categorify-domain-max" > git rev-list --no-walk 5e149c8a6f16a47cd99a23f4c060318f247fca7b # timeout=10 [nvtabular_tests] $ /bin/bash /tmp/jenkins9182884185066325902.sh ============================= test session starts ============================== platform linux -- Python 3.8.10, pytest-7.1.2, pluggy-1.0.0 rootdir: /var/jenkins_home/workspace/nvtabular_tests/nvtabular, configfile: pyproject.toml plugins: anyio-3.6.1, xdist-2.5.0, forked-1.4.0, cov-3.0.0 collected 1430 items / 1 skippedtests/unit/test_dask_nvt.py ............................................ [ 3%] ........................................................................ [ 8%] .... [ 8%] tests/unit/test_notebooks.py ...... [ 8%] tests/unit/test_tf4rec.py . [ 8%] tests/unit/test_tools.py ...................... [ 10%] tests/unit/test_triton_inference.py ................................ [ 12%] tests/unit/framework_utils/test_tf_feature_columns.py . [ 12%] tests/unit/framework_utils/test_tf_layers.py ........................... [ 14%] ................................................... [ 18%] tests/unit/framework_utils/test_torch_layers.py . [ 18%] tests/unit/loader/test_dataloader_backend.py ...... [ 18%] tests/unit/loader/test_tf_dataloader.py ................................ [ 20%] ........................................s.. [ 23%] tests/unit/loader/test_torch_dataloader.py ............................. [ 25%] ...................................................... [ 29%] tests/unit/ops/test_categorify.py ...................................... [ 32%] ........................................................................ [ 37%] ........................................... [ 40%] tests/unit/ops/test_column_similarity.py ........................ [ 42%] tests/unit/ops/test_drop_low_cardinality.py .. [ 42%] tests/unit/ops/test_fill.py ............................................ [ 45%] ........ [ 45%] tests/unit/ops/test_groupyby.py ..................... [ 47%] tests/unit/ops/test_hash_bucket.py ......................... [ 49%] tests/unit/ops/test_join.py ............................................ [ 52%] ........................................................................ [ 57%] .................................. [ 59%] tests/unit/ops/test_lambda.py .......... [ 60%] tests/unit/ops/test_normalize.py ....................................... [ 63%] .. [ 63%] tests/unit/ops/test_ops.py ............................................. [ 66%] .................... [ 67%] tests/unit/ops/test_ops_schema.py ...................................... [ 70%] ........................................................................ [ 75%] ........................................................................ [ 80%] ........................................................................ [ 85%] ....................................... [ 88%] tests/unit/ops/test_reduce_dtype_size.py .. [ 88%] tests/unit/ops/test_target_encode.py ..................... [ 89%] tests/unit/workflow/test_cpu_workflow.py ...... [ 90%] tests/unit/workflow/test_workflow.py ................................... [ 92%] .......................................................... [ 96%] tests/unit/workflow/test_workflow_chaining.py ... [ 96%] tests/unit/workflow/test_workflow_node.py ........... [ 97%] tests/unit/workflow/test_workflow_ops.py ... [ 97%] tests/unit/workflow/test_workflow_schemas.py ........................... [ 99%] ... [100%]
=============================== warnings summary =============================== ../../../../../usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33 /usr/local/lib/python3.8/dist-packages/dask_cudf/core.py:33: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. DASK_VERSION = LooseVersion(dask.version)
../../../.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: 34 warnings /var/jenkins_home/.local/lib/python3.8/site-packages/setuptools/_distutils/version.py:346: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. other = LooseVersion(other)
nvtabular/loader/init.py:19 /var/jenkins_home/workspace/nvtabular_tests/nvtabular/nvtabular/loader/init.py:19: DeprecationWarning: The
nvtabular.loader
module has moved tomerlin.models.loader
. Support for importing fromnvtabular.loader
is deprecated, and will be removed in a future version. Please update your imports to refer tomerlin.models.loader
. warnings.warn(tests/unit/test_dask_nvt.py::test_dask_workflow_api_dlrm[True-Shuffle.PER_WORKER-True-device-0-parquet-0.1] /usr/local/lib/python3.8/dist-packages/tornado/ioloop.py:350: DeprecationWarning: make_current is deprecated; start the event loop first self.make_current()
tests/unit/test_dask_nvt.py: 1 warning tests/unit/test_tf4rec.py: 1 warning tests/unit/test_tools.py: 5 warnings tests/unit/test_triton_inference.py: 8 warnings tests/unit/loader/test_dataloader_backend.py: 6 warnings tests/unit/loader/test_tf_dataloader.py: 66 warnings tests/unit/loader/test_torch_dataloader.py: 67 warnings tests/unit/ops/test_categorify.py: 69 warnings tests/unit/ops/test_drop_low_cardinality.py: 2 warnings tests/unit/ops/test_fill.py: 8 warnings tests/unit/ops/test_hash_bucket.py: 4 warnings tests/unit/ops/test_join.py: 88 warnings tests/unit/ops/test_lambda.py: 1 warning tests/unit/ops/test_normalize.py: 9 warnings tests/unit/ops/test_ops.py: 11 warnings tests/unit/ops/test_ops_schema.py: 17 warnings tests/unit/workflow/test_workflow.py: 27 warnings tests/unit/workflow/test_workflow_chaining.py: 1 warning tests/unit/workflow/test_workflow_node.py: 1 warning tests/unit/workflow/test_workflow_schemas.py: 1 warning /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:384: UserWarning: The deep parameter is ignored and is only included for pandas compatibility. warnings.warn(
tests/unit/test_dask_nvt.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 8 files. warnings.warn(
tests/unit/test_dask_nvt.py::test_merlin_core_execution_managers /usr/local/lib/python3.8/dist-packages/merlin/core/utils.py:431: UserWarning: Existing Dask-client object detected in the current context. New cuda cluster will not be deployed. Set force_new to True to ignore running clusters. warnings.warn(
tests/unit/test_notebooks.py: 1 warning tests/unit/test_tools.py: 17 warnings tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 54 warnings /usr/local/lib/python3.8/dist-packages/cudf/core/frame.py:2940: FutureWarning: Series.ceil and DataFrame.ceil are deprecated and will be removed in the future warnings.warn(
tests/unit/loader/test_tf_dataloader.py: 2 warnings tests/unit/loader/test_torch_dataloader.py: 12 warnings tests/unit/workflow/test_workflow.py: 9 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 2 files. warnings.warn(
tests/unit/ops/test_fill.py::test_fill_missing[True-True-parquet] tests/unit/ops/test_fill.py::test_fill_missing[True-False-parquet] tests/unit/ops/test_ops.py::test_filter[parquet-0.1-True] /usr/local/lib/python3.8/dist-packages/pandas/core/indexing.py:1732: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)
tests/unit/workflow/test_cpu_workflow.py: 6 warnings tests/unit/workflow/test_workflow.py: 12 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 1 files did not have enough partitions to create 10 files. warnings.warn(
tests/unit/workflow/test_workflow.py: 48 warnings /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 20 files. warnings.warn(
tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_parquet_output[True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_parquet_output[True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[True-True-None] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_WORKER] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-Shuffle.PER_PARTITION] tests/unit/workflow/test_workflow.py::test_workflow_apply[False-True-None] /usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py:862: UserWarning: Only created 2 files did not have enough partitions to create 4 files. warnings.warn(
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ========== 1429 passed, 2 skipped, 618 warnings in 709.60s (0:11:49) =========== Performing Post build task... Match found for : : True Logical operation result is TRUE Running script : #!/bin/bash cd /var/jenkins_home/ CUDA_VISIBLE_DEVICES=1 python test_res_push.py "https://api.GitHub.com/repos/NVIDIA-Merlin/NVTabular/issues/$ghprbPullId/comments" "/var/jenkins_home/jobs/$JOB_NAME/builds/$BUILD_NUMBER/log" [nvtabular_tests] $ /bin/bash /tmp/jenkins5371199130175045927.sh