hazelcast Row count is exactly 4 times higher from S3 than original row count in the file

Describe the bug Row count is exactly 4 times higher when using in SQL file connector to read from S3 than original row count in the file

Expected behavior Row count should always same for the same file.

To Reproduce

Steps to reproduce the behavior:

Upload a ~2.7 mil rows csv file into S3 bucket.
Create a file mapping locally to connect to the file in S3
Select row count from the mapping

Additional context SQL to use

CREATE MAPPING ustest
TYPE File
OPTIONS (
    'path' = 's3a://my-bucket/',
    'sharedFileSystem' = 'true',
    'format' = 'csv',
    'glob' = 'ustest.csv',
    'fs.s3a.endpoint' = 's3.region.amazonaws.com',
    'fs.s3a.access.key' = 'abc',
    'fs.s3a.secret.key' = 'efg'
);

Screenshot 2022-07-27 195245 Screenshot 2022-07-27 195221

Jul 27 '22 10:07 macsir

Create a IMAP mapping with same structure and sink into it. It looks like the row count is right.

Jul 27 '22 10:07 macsir

@macsir version?

Aug 01 '22 11:08 frant-hartm

Latest one 5.1.2

Get Outlook for iOShttps://aka.ms/o0ukef

From: František Hartman @.> Sent: Monday, August 1, 2022 9:09:57 PM To: hazelcast/hazelcast @.> Cc: tm1sir @.>; Mention @.> Subject: Re: [hazelcast/hazelcast] Row count is exactly 4 times higher from S3 than original row count in the file (Issue #21854)

@macsirhttps://github.com/macsir version?

— Reply to this email directly, view it on GitHubhttps://github.com/hazelcast/hazelcast/issues/21854#issuecomment-1201055961, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZQV7N4RH6Y4RFEY45LRULVW6WALANCNFSM54Y7NSRQ. You are receiving this because you were mentioned.Message ID: @.***>

Aug 01 '22 11:08 macsir

I am not able to reproduce this.

What's your cluster size? Is there anything special about the file? Would you be able to share the file? Or create a dummy file that would reproduce this issue?

Also are you sure the file doesn't contain duplicates (same data 4 times)?

Aug 01 '22 14:08 frant-hartm

Hi, @frant-hartm Thanks for looking into it. Here is the file I am using in S3. us_simplified2 (1).zip .

Aug 03 '22 08:08 macsir

@macsir I can confirm this is reproducible with the given file. We will look into it.

Thank you for taking time to report this.

Aug 03 '22 10:08 frant-hartm

As a temporary workaround you can increase the minimum split size to a value larger than your maximum file size, e.g. for 1 GB add the following to your mapping:

'mapreduce.input.fileinputformat.split.minsize'='1073741824'

Aug 03 '22 10:08 frant-hartm

Thanks and will give it go later.

Get Outlook for iOShttps://aka.ms/o0ukef

From: František Hartman @.> Sent: Wednesday, August 3, 2022 8:42:08 PM To: hazelcast/hazelcast @.> Cc: tm1sir @.>; Mention @.> Subject: Re: [hazelcast/hazelcast] Row count is exactly 4 times higher from S3 than original row count in the file (Issue #21854)

As a temporary workaround you can increase the minimum split size to a value larger than your maximum file size, e.g. for 1 GB add the following to your mapping:

'mapreduce.input.fileinputformat.split.minsize'='1073741824'

— Reply to this email directly, view it on GitHubhttps://github.com/hazelcast/hazelcast/issues/21854#issuecomment-1203780609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZQV7OKZUVW6NQGQHHHWFDVXJEIBANCNFSM54Y7NSRQ. You are receiving this because you were mentioned.Message ID: @.***>

Aug 03 '22 10:08 macsir

Internal Jira issue: HZ-1405

Aug 26 '22 11:08 github-actions[bot]

Thanks for working on this!

Get Outlook for iOShttps://aka.ms/o0ukef

From: github-actions[bot] @.> Sent: Friday, August 26, 2022 9:08:27 PM To: hazelcast/hazelcast @.> Cc: tm1sir @.>; Mention @.> Subject: Re: [hazelcast/hazelcast] Row count is exactly 4 times higher from S3 than original row count in the file HZ-1405 (Issue #21854)

Internal Jira issue: HZ-1405https://hazelcast.atlassian.net//browse/HZ-1405

— Reply to this email directly, view it on GitHubhttps://github.com/hazelcast/hazelcast/issues/21854#issuecomment-1228362067, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZQV7MVKLXGZVLX3D5ACKTV3CQSXANCNFSM54Y7NSRQ. You are receiving this because you were mentioned.Message ID: @.***>

Aug 26 '22 12:08 macsir

hazelcast hazelcast copied to clipboard

Row count is exactly 4 times higher from S3 than original row count in the file

hazelcast
hazelcast copied to clipboard