secor
secor copied to clipboard
Secor uploads (different) files with the same "name" into different "days".
Secor uploads files (different content) with the same "name" into different days when first day ends and the next begins.
In the example below, I have two files:
/topic-name/dt=2021-06-15/1_0_00000000004302033536.gz
/topic-name/dt=2021-06-16/1_0_00000000004302033536.gz
2021-06-16 00:01:05,444 [Thread-4] (com.pinterest.secor.uploader.S3UploadManager) INFO uploading file /mnt/secor_data/message_logs/partition/9_13/topic-name/dt=2021-06-15/1_0_00000000004302033536.gz to s3://kafka-backup.s3.domain/dumps/topic-name/dt=2021-06-15/1_0_00000000004302033536.gz with no encryption
2021-06-16 00:01:05,444 [Thread-4] (com.pinterest.secor.uploader.S3UploadManager) INFO uploading file /mnt/secor_data/message_logs/partition/9_13/topic-name/dt=2021-06-16/1_0_00000000004302033536.gz to s3://kafka-backup.s3.domain/dumps/topic-name/dt=2021-06-16/1_0_00000000004302033536.gz with no encryption
Is it Ok? Where I can read more details about such behaviour? At least, I need to know, which offsets contains which file and possible ways to maybe set it explicitly in the file name (I expect that the first offset of the file is the offset specified in its name, but in described case it is not true).
This is expected. You most likely have messages with timestamps jumping
back and forth between old and new dates. In that case multiple data files
will be created (one for each time bucket), the records will be written to
respective files depending on which time bucket it belongs to. There are
no duplicate records between those files. The file name convention is
On Mon, Jun 21, 2021 at 10:42 AM glebsam @.***> wrote:
Secor uploads files (different content) with the same "name" into different days when first day ends and the next begins.
In the example below, I have two files:
/topic-name/dt=2021-06-15/1_0_00000000004302033536.gz /topic-name/dt=2021-06-16/1_0_00000000004302033536.gz
2021-06-16 00:01:05,444 [Thread-4] (com.pinterest.secor.uploader.S3UploadManager) INFO uploading file /mnt/secor_data/message_logs/partition/9_13/topic-name/dt=2021-06-15/1_0_00000000004302033536.gz to s3://kafka-backup.s3.domain/dumps/topic-name/dt=2021-06-15/1_0_00000000004302033536.gz with no encryption 2021-06-16 00:01:05,444 [Thread-4] (com.pinterest.secor.uploader.S3UploadManager) INFO uploading file /mnt/secor_data/message_logs/partition/9_13/topic-name/dt=2021-06-16/1_0_00000000004302033536.gz to s3://kafka-backup.s3.domain/dumps/topic-name/dt=2021-06-16/1_0_00000000004302033536.gz with no encryption
Is it Ok? Where I can read more details about such behaviour? At least, I need to know, which offsets contains which file and possible ways to maybe set it explicitly in the file name (I expect that the first offset of the file is the offset specified in its name, but in described case it is not true).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pinterest/secor/issues/2126, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYJP75EAE6UOUU6JPTJEF3TT52Y5ANCNFSM47CA6YKA .
@HenryCaiHaiying thank you for the answer, but I still can't get, why <previous-persisted-kafka-offset>
is in the convention while I can see that this offset is the first offset that contains the file? I mean, not the last offset of a previous file. Btw, I have a topic with a single partition and we use an idempotent producer for this topic.