secor icon indicating copy to clipboard operation
secor copied to clipboard

Secor uploads (different) files with the same "name" into different "days".

Open glebsam opened this issue 3 years ago • 2 comments

Secor uploads files (different content) with the same "name" into different days when first day ends and the next begins.

In the example below, I have two files:

/topic-name/dt=2021-06-15/1_0_00000000004302033536.gz
/topic-name/dt=2021-06-16/1_0_00000000004302033536.gz
2021-06-16 00:01:05,444 [Thread-4] (com.pinterest.secor.uploader.S3UploadManager) INFO  uploading file /mnt/secor_data/message_logs/partition/9_13/topic-name/dt=2021-06-15/1_0_00000000004302033536.gz to s3://kafka-backup.s3.domain/dumps/topic-name/dt=2021-06-15/1_0_00000000004302033536.gz with no encryption
2021-06-16 00:01:05,444 [Thread-4] (com.pinterest.secor.uploader.S3UploadManager) INFO  uploading file /mnt/secor_data/message_logs/partition/9_13/topic-name/dt=2021-06-16/1_0_00000000004302033536.gz to s3://kafka-backup.s3.domain/dumps/topic-name/dt=2021-06-16/1_0_00000000004302033536.gz with no encryption

Is it Ok? Where I can read more details about such behaviour? At least, I need to know, which offsets contains which file and possible ways to maybe set it explicitly in the file name (I expect that the first offset of the file is the offset specified in its name, but in described case it is not true).

glebsam avatar Jun 21 '21 17:06 glebsam

This is expected. You most likely have messages with timestamps jumping back and forth between old and new dates. In that case multiple data files will be created (one for each time bucket), the records will be written to respective files depending on which time bucket it belongs to. There are no duplicate records between those files. The file name convention is . When we get enough content in the files, we will upload all of them at once.

On Mon, Jun 21, 2021 at 10:42 AM glebsam @.***> wrote:

Secor uploads files (different content) with the same "name" into different days when first day ends and the next begins.

In the example below, I have two files:

/topic-name/dt=2021-06-15/1_0_00000000004302033536.gz /topic-name/dt=2021-06-16/1_0_00000000004302033536.gz

2021-06-16 00:01:05,444 [Thread-4] (com.pinterest.secor.uploader.S3UploadManager) INFO uploading file /mnt/secor_data/message_logs/partition/9_13/topic-name/dt=2021-06-15/1_0_00000000004302033536.gz to s3://kafka-backup.s3.domain/dumps/topic-name/dt=2021-06-15/1_0_00000000004302033536.gz with no encryption 2021-06-16 00:01:05,444 [Thread-4] (com.pinterest.secor.uploader.S3UploadManager) INFO uploading file /mnt/secor_data/message_logs/partition/9_13/topic-name/dt=2021-06-16/1_0_00000000004302033536.gz to s3://kafka-backup.s3.domain/dumps/topic-name/dt=2021-06-16/1_0_00000000004302033536.gz with no encryption

Is it Ok? Where I can read more details about such behaviour? At least, I need to know, which offsets contains which file and possible ways to maybe set it explicitly in the file name (I expect that the first offset of the file is the offset specified in its name, but in described case it is not true).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pinterest/secor/issues/2126, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYJP75EAE6UOUU6JPTJEF3TT52Y5ANCNFSM47CA6YKA .

HenryCaiHaiying avatar Jun 21 '21 22:06 HenryCaiHaiying

@HenryCaiHaiying thank you for the answer, but I still can't get, why <previous-persisted-kafka-offset> is in the convention while I can see that this offset is the first offset that contains the file? I mean, not the last offset of a previous file. Btw, I have a topic with a single partition and we use an idempotent producer for this topic.

glebsam avatar Jun 22 '21 09:06 glebsam