clamav icon indicating copy to clipboard operation
clamav copied to clipboard

24.75MB PDF results in 820MB 'data scanned' but only 24.75MB 'data read'

Open henricook opened this issue 2 years ago • 9 comments

This PDF file has been taking a long time to scan using clamav, more than I think a file of this size usually would.

Do you think this is a PDF parsing error that's making clam do a lot more work than it should have to do to scan it?

Attached is the file itself, and clamscan results that shows it took a very long 60 seconds:

Loading:     6s, ETA:   0s [========================>]    8.68M/8.68M sigs       
Compiling:   1s, ETA:   0s [========================>]       41/41 tasks 

/home/henricook/Downloads/big pdf.pdf: OK

----------- SCAN SUMMARY -----------
Known viruses: 8676590
Engine version: 1.0.2
Scanned directories: 0
Scanned files: 1
Infected files: 0
Data scanned: 820.26 MB
Data read: 24.75 MB (ratio 33.14:1)
Time: 66.143 sec (1 m 6 s)
Start Date: 2023:10:23 13:40:00
End Date:   2023:10:23 13:41:06

📎 big pdf.pdf

The Data scanned: 820.26 MB is suspicious - it might be good background to know how Clam determines that number

Edit: This might be my ignorance of PDF standards. Can a 24MB file really contain 820MB of compressed images? Or is this some quirk?

henricook avatar Oct 23 '23 12:10 henricook

Hi,

This file has a lot of files inside it, which would cause it to take longer to scan since each file has to be unpacked and scanned individually. I will open a ticket internally to investigate whether there is an issue parsing this file or not.

Thank you for the report, Andy

ragusaa avatar Oct 24 '23 20:10 ragusaa

Thanks, if anything I'm amazed that there might really be 820MB of stuff in a file that's only 24MB

henricook avatar Oct 24 '23 20:10 henricook

I am actually scanning with 1.0.3, but I am seeing a longer scan time, and less reported data.

`$ ~/install/bin/clamscan --leave-temps --tempdir=TD big.pdf.pdf -d ~/sigs.downloaded LibClamAV Warning: ************************************************** LibClamAV Warning: *** The virus database is older than 7 days! *** LibClamAV Warning: *** Please update it as soon as possible. *** LibClamAV Warning: ************************************************** Loading: 18s, ETA: 0s [========================>] 8.67M/8.67M sigs
Compiling: 3s, ETA: 0s [========================>] 41/41 tasks

/home/aragusa/githubeval/big.pdf.pdf: OK

----------- SCAN SUMMARY ----------- Known viruses: 8669832 Engine version: 1.3.0-devel-20231024 Scanned directories: 0 Scanned files: 1 Infected files: 0 Data scanned: 655.34 MB Data read: 24.75 MB (ratio 26.47:1) Time: 143.124 sec (2 m 23 s) Start Date: 2023:10:24 13:08:33 End Date: 2023:10:24 13:10:56 `

ragusaa avatar Oct 24 '23 20:10 ragusaa

Even more curious! What's your max scan size set to?

henricook avatar Oct 24 '23 20:10 henricook

Just the default (400M)

ragusaa avatar Oct 24 '23 20:10 ragusaa

Mines set at 1GB, I feel like it might make a difference to what you see based on what I saw when looking at this issue

henricook avatar Oct 24 '23 20:10 henricook

sounds good, i'll play with it.

We have some other issues we are currently looking at, so I am not sure when this ticket will be scheduled, but I will keep you up to date with it.

ragusaa avatar Oct 24 '23 22:10 ragusaa

For debugging when we do go to investigate this we should probably do it at the same time as https://github.com/Cisco-Talos/clamav/issues/590#issuecomment-1262116323 which has another large PDF that takes forever to scan.

Note we've not gotten around to this in quite some time. Just hard to prioritize with lots of other things going on.

But yeah PDF's do have a lot of compressed stuff. But I suspect we may be double-scanning some bits or something.

val-ms avatar Oct 25 '23 17:10 val-ms

I am also facing same issue. Tried changing config to old values but still no success. I am scanning file using INSTREAM command over the tcp connection. @micahsnyder any update on this ticket ?

amansaxo avatar Jan 18 '24 11:01 amansaxo