databend icon indicating copy to clipboard operation
databend copied to clipboard

bug: COPY INTO GCS location seems to duplicate path

Open rad-pat opened this issue 1 year ago โ€ข 2 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

Version

v1.2.618-nightly

What's Wrong?

When issuing a COPY INTO command for GCS, the resulting path in GCS is duplicated

How to Reproduce?

CREATE table t1 (c1 int null);
INSERT INTO t1 values (1), (2), (3);

COPY INTO 'gcs://bucket/tables/t1'
CONNECTION = (
	CREDENTIAL = '<snip>'
)
FROM default.t1
FILE_FORMAT = (TYPE = PARQUET);

Looks in GCS, see that path is bucket/tables/t1/tables/t1

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

rad-pat avatar Aug 21 '24 10:08 rad-pat

So it seems that including a trailing slash on the end of the path makes it behave correctly. I can include the slash, but since it always exports one or many parquet files to the location, should it not be assumed that the location is always a path, or at least that /tables/ is the path and t1 is the file(?? for the one or many files)

Works correctly:

CREATE table t1 (c1 int null);
INSERT INTO t1 values (1), (2), (3);

COPY INTO 'gcs://bucket/tables/t1/'
CONNECTION = (
	CREDENTIAL = '<snip>'
)
FROM default.t1
FILE_FORMAT = (TYPE = PARQUET);

rad-pat avatar Aug 22 '24 10:08 rad-pat

@rad-pat thank you. it is bug.

youngsofun avatar Aug 23 '24 10:08 youngsofun

@youngsofun , presume this is fixed now with #16321?

Was this affecting internal storage if GCS is used, or would that have remained unaffected?

rad-pat avatar Aug 30 '24 14:08 rad-pat

it should have been fixed, please have a try

youngsofun avatar Aug 30 '24 15:08 youngsofun

Yes, seems fixed for COPY INTO, thanks. I just wondered if there was any effect to the parquet files stored by the system whilst this bug was happening?

rad-pat avatar Aug 30 '24 16:08 rad-pat

The behavior of the bug is as follows:

If your location string does not end with a /, copying into bucket/<path> will result in bucket/<path>/<path>/<file_name_containing_uuid> instead of bucket/<path>/<file_name_containing_uuid>

While itโ€™s unfortunate to make this mistake, I donโ€™t think itโ€™s a major issue in practice, especially if you are only using it for unloading. The additional <path>/ can be considered part of the randomly generated path created by Databend.

youngsofun avatar Aug 31 '24 08:08 youngsofun