hazelcast icon indicating copy to clipboard operation
hazelcast copied to clipboard

Row count is exactly 4 times higher from S3 than original row count in the file

Open macsir opened this issue 2 years ago • 8 comments

Describe the bug Row count is exactly 4 times higher when using in SQL file connector to read from S3 than original row count in the file

Expected behavior Row count should always same for the same file.

To Reproduce

Steps to reproduce the behavior:

  1. Upload a ~2.7 mil rows csv file into S3 bucket.
  2. Create a file mapping locally to connect to the file in S3
  3. Select row count from the mapping

Additional context SQL to use

CREATE MAPPING ustest
TYPE File
OPTIONS (
    'path' = 's3a://my-bucket/',
    'sharedFileSystem' = 'true',
    'format' = 'csv',
    'glob' = 'ustest.csv',
    'fs.s3a.endpoint' = 's3.region.amazonaws.com',
    'fs.s3a.access.key' = 'abc',
    'fs.s3a.secret.key' = 'efg'
);

Screenshot 2022-07-27 195245 Screenshot 2022-07-27 195221

macsir avatar Jul 27 '22 10:07 macsir

Create a IMAP mapping with same structure and sink into it. It looks like the row count is right. image

macsir avatar Jul 27 '22 10:07 macsir

@macsir version?

frant-hartm avatar Aug 01 '22 11:08 frant-hartm

Latest one 5.1.2

Get Outlook for iOShttps://aka.ms/o0ukef


From: František Hartman @.> Sent: Monday, August 1, 2022 9:09:57 PM To: hazelcast/hazelcast @.> Cc: tm1sir @.>; Mention @.> Subject: Re: [hazelcast/hazelcast] Row count is exactly 4 times higher from S3 than original row count in the file (Issue #21854)

@macsirhttps://github.com/macsir version?

— Reply to this email directly, view it on GitHubhttps://github.com/hazelcast/hazelcast/issues/21854#issuecomment-1201055961, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZQV7N4RH6Y4RFEY45LRULVW6WALANCNFSM54Y7NSRQ. You are receiving this because you were mentioned.Message ID: @.***>

macsir avatar Aug 01 '22 11:08 macsir

I am not able to reproduce this.

What's your cluster size? Is there anything special about the file? Would you be able to share the file? Or create a dummy file that would reproduce this issue?

Also are you sure the file doesn't contain duplicates (same data 4 times)?

frant-hartm avatar Aug 01 '22 14:08 frant-hartm

Hi, @frant-hartm Thanks for looking into it. Here is the file I am using in S3. us_simplified2 (1).zip .

macsir avatar Aug 03 '22 08:08 macsir

@macsir I can confirm this is reproducible with the given file. We will look into it.

Thank you for taking time to report this.

frant-hartm avatar Aug 03 '22 10:08 frant-hartm

As a temporary workaround you can increase the minimum split size to a value larger than your maximum file size, e.g. for 1 GB add the following to your mapping:

'mapreduce.input.fileinputformat.split.minsize'='1073741824'

frant-hartm avatar Aug 03 '22 10:08 frant-hartm

Thanks and will give it go later.

Get Outlook for iOShttps://aka.ms/o0ukef


From: František Hartman @.> Sent: Wednesday, August 3, 2022 8:42:08 PM To: hazelcast/hazelcast @.> Cc: tm1sir @.>; Mention @.> Subject: Re: [hazelcast/hazelcast] Row count is exactly 4 times higher from S3 than original row count in the file (Issue #21854)

As a temporary workaround you can increase the minimum split size to a value larger than your maximum file size, e.g. for 1 GB add the following to your mapping:

'mapreduce.input.fileinputformat.split.minsize'='1073741824'

— Reply to this email directly, view it on GitHubhttps://github.com/hazelcast/hazelcast/issues/21854#issuecomment-1203780609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZQV7OKZUVW6NQGQHHHWFDVXJEIBANCNFSM54Y7NSRQ. You are receiving this because you were mentioned.Message ID: @.***>

macsir avatar Aug 03 '22 10:08 macsir

Internal Jira issue: HZ-1405

github-actions[bot] avatar Aug 26 '22 11:08 github-actions[bot]

Thanks for working on this!

Get Outlook for iOShttps://aka.ms/o0ukef


From: github-actions[bot] @.> Sent: Friday, August 26, 2022 9:08:27 PM To: hazelcast/hazelcast @.> Cc: tm1sir @.>; Mention @.> Subject: Re: [hazelcast/hazelcast] Row count is exactly 4 times higher from S3 than original row count in the file HZ-1405 (Issue #21854)

Internal Jira issue: HZ-1405https://hazelcast.atlassian.net//browse/HZ-1405

— Reply to this email directly, view it on GitHubhttps://github.com/hazelcast/hazelcast/issues/21854#issuecomment-1228362067, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZQV7MVKLXGZVLX3D5ACKTV3CQSXANCNFSM54Y7NSRQ. You are receiving this because you were mentioned.Message ID: @.***>

macsir avatar Aug 26 '22 12:08 macsir