hazelcast
hazelcast copied to clipboard
Row count is exactly 4 times higher from S3 than original row count in the file
Describe the bug Row count is exactly 4 times higher when using in SQL file connector to read from S3 than original row count in the file
Expected behavior Row count should always same for the same file.
To Reproduce
Steps to reproduce the behavior:
- Upload a ~2.7 mil rows csv file into S3 bucket.
- Create a file mapping locally to connect to the file in S3
- Select row count from the mapping
Additional context SQL to use
CREATE MAPPING ustest
TYPE File
OPTIONS (
'path' = 's3a://my-bucket/',
'sharedFileSystem' = 'true',
'format' = 'csv',
'glob' = 'ustest.csv',
'fs.s3a.endpoint' = 's3.region.amazonaws.com',
'fs.s3a.access.key' = 'abc',
'fs.s3a.secret.key' = 'efg'
);
Create a IMAP mapping with same structure and sink into it. It looks like the row count is right.
@macsir version?
Latest one 5.1.2
Get Outlook for iOShttps://aka.ms/o0ukef
From: František Hartman @.> Sent: Monday, August 1, 2022 9:09:57 PM To: hazelcast/hazelcast @.> Cc: tm1sir @.>; Mention @.> Subject: Re: [hazelcast/hazelcast] Row count is exactly 4 times higher from S3 than original row count in the file (Issue #21854)
@macsirhttps://github.com/macsir version?
— Reply to this email directly, view it on GitHubhttps://github.com/hazelcast/hazelcast/issues/21854#issuecomment-1201055961, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZQV7N4RH6Y4RFEY45LRULVW6WALANCNFSM54Y7NSRQ. You are receiving this because you were mentioned.Message ID: @.***>
I am not able to reproduce this.
What's your cluster size? Is there anything special about the file? Would you be able to share the file? Or create a dummy file that would reproduce this issue?
Also are you sure the file doesn't contain duplicates (same data 4 times)?
Hi, @frant-hartm Thanks for looking into it. Here is the file I am using in S3. us_simplified2 (1).zip .
@macsir I can confirm this is reproducible with the given file. We will look into it.
Thank you for taking time to report this.
As a temporary workaround you can increase the minimum split size to a value larger than your maximum file size, e.g. for 1 GB add the following to your mapping:
'mapreduce.input.fileinputformat.split.minsize'='1073741824'
Thanks and will give it go later.
Get Outlook for iOShttps://aka.ms/o0ukef
From: František Hartman @.> Sent: Wednesday, August 3, 2022 8:42:08 PM To: hazelcast/hazelcast @.> Cc: tm1sir @.>; Mention @.> Subject: Re: [hazelcast/hazelcast] Row count is exactly 4 times higher from S3 than original row count in the file (Issue #21854)
As a temporary workaround you can increase the minimum split size to a value larger than your maximum file size, e.g. for 1 GB add the following to your mapping:
'mapreduce.input.fileinputformat.split.minsize'='1073741824'
— Reply to this email directly, view it on GitHubhttps://github.com/hazelcast/hazelcast/issues/21854#issuecomment-1203780609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZQV7OKZUVW6NQGQHHHWFDVXJEIBANCNFSM54Y7NSRQ. You are receiving this because you were mentioned.Message ID: @.***>
Internal Jira issue: HZ-1405
Thanks for working on this!
Get Outlook for iOShttps://aka.ms/o0ukef
From: github-actions[bot] @.> Sent: Friday, August 26, 2022 9:08:27 PM To: hazelcast/hazelcast @.> Cc: tm1sir @.>; Mention @.> Subject: Re: [hazelcast/hazelcast] Row count is exactly 4 times higher from S3 than original row count in the file HZ-1405 (Issue #21854)
Internal Jira issue: HZ-1405https://hazelcast.atlassian.net//browse/HZ-1405
— Reply to this email directly, view it on GitHubhttps://github.com/hazelcast/hazelcast/issues/21854#issuecomment-1228362067, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAZQV7MVKLXGZVLX3D5ACKTV3CQSXANCNFSM54Y7NSRQ. You are receiving this because you were mentioned.Message ID: @.***>