tutorials icon indicating copy to clipboard operation
tutorials copied to clipboard

how to speed up "File Tail" ?

Open dongbin86 opened this issue 8 years ago • 5 comments
trafficstars

I have a log named access.log size 3.8G

I create a simple pipeline filetail-trash

not rate limited . but I found Record Throughput only less then 20 records/s

pipeline with default config

how to optimize ?

dongbin86 avatar Mar 24 '17 12:03 dongbin86

What version of SDC? Can you export the pipeline and post it here? You should get thousands of records/sec!

metadaddy avatar Mar 24 '17 17:03 metadaddy

7 seconds to ingest 9890 records on my laptop:

image

metadaddy avatar Mar 24 '17 20:03 metadaddy

2.4.0.0 f056e6c0-40bd-4cdc-bb4a-8df2a53576c2.txt not support .json file ,so i rename to .txt ,you can download and rerename it yes ,yesterday ,I use a script write lines to a file with rate 10000 lines/sec, streamsets file tail can reach that rate , so I wonder the reason is the file size too big , every batch rewrite the offset, and next batch need to re seek from top to that offset ? I need your help ,@metadaddy

dongbin86 avatar Mar 25 '17 02:03 dongbin86

also I want know when FileTail trigger to collect log file ? if i have a file ,but no new line appended , FileTail will not be triggered ?

dongbin86 avatar Mar 25 '17 03:03 dongbin86

I looked at your pipeline - I don't see anything that would slow it down.

The file tail reader will only seek at the beginning of each batch, so it shouldn't impact performance that much. You could test this by changing the batch size. Note - you will need to edit sdc.properties to increase batch size beyond 1000 - see https://streamsets.com/documentation/datacollector/latest/help/#Troubleshooting/Troubleshooting_title.html#concept_ay2_w1l_2s

File tail will read all of the existing data, then wait for new data, so it should work for you. A better choice, if the file will not be changing, might be the directory origin.

metadaddy avatar Mar 28 '17 16:03 metadaddy