makezip improvements
Now that we know more about zips, there's much room for improvement:
- It seems that LZMA compression at level 3 takes the same amount of time as uncompressed (probably due to overheads and usb3 bottlenecks) with a compression ratio of about 62%. LZMA has 6% better compression but takes about 50% more time.
- File splitting needs to be reduced from 500 gigs to something more like 100 gigs, as more LTO tapes will be filled up this way
- Looks like the logging of makezip within sipcreator has repetition and some minor errors (like two verifications, but saying status=finished
- It's worth investigating if 7za actually does the Test at the end of the zipping process. There's a very long stall at the end of a large zipping process, perhaps this is a fixity check, or maybe it is generating the hashes? It's worth figuring this out by intentionally damaging files with a hex editor and also trying to analyse the source code.
- the mediatrace/mediainfo and dfxml all seem to hang around outside the SIP folder - these should be deleted
- the extra manifest files in the logs need to be deleted too.
@stephenmcconnachie curious to know about any compression experiments ye did. I'm going to write a blog on my benchmarking - using mac pro 12 threads with USB3 source, Pegasus Thunderbolt RAID destination, 1.6 TB 2K DCDM as input - it takes a similar amount of time (about 16 hours) to compress with LZMA level 3. Maybe this is because of usb3/150K files overhead etc - I usually get a flurry of files being zipped, then it stalls, possibly to gather the crc32 checksums?..
Here's some benchmarks with reel 2 of a DCDM (no easily compressed opening credits) I found that when i tested with the whole DCDM, the compression ratios for LZMA were much better 14% better) than the single reel values here - probably because of more instances of easily compressible data like closing and opening credits - fades to black etc.
I'm running further tests now on the speeds for uncompressed whole zips just to see how similar it is.
compression_type | compression_level | duration_seconds | source_folder_size | zip_file_size | compression_ratio
Copy | 1 | 0:06:57.940210 | 24450057988 | 24450600882 | 1.00002220420092
Copy | 3 | 0:06:27.893932 | 24450057988 | 24450600882 | 1.00002220420092
Copy | 5 | 0:05:59.902307 | 24450057988 | 24450600882 | 1.00002220420092
Copy | 7 | 0:05:05.365766 | 24450057988 | 24450600882 | 1.00002220420092
Deflate | 1 | 0:05:22.166162 | 24450057988 | 21961904137 | 0.898235257674187
Deflate | 3 | 0:04:42.248375 | 24450057988 | 21961904137 | 0.898235257674187
Deflate | 5 | 0:05:03.380526 | 24450057988 | 21831391361 | 0.892897324485478
Deflate | 7 | 0:08:05.586272 | 24450057988 | 21761943237 | 0.890056917152535
Deflate64 | 1 | 0:05:17.997120 | 24450057988 | 21564308133 | 0.881973700986054
Deflate64 | 3 | 0:05:03.372004 | 24450057988 | 21564308133 | 0.881973700986054
Deflate64 | 5 | 0:05:09.055852 | 24450057988 | 21393258699 | 0.874977830706975
Deflate64 | 7 | 0:09:01.028490 | 24450057988 | 21346412968 | 0.873061854433095
BZip2 | 1 | 0:13:45.623366 | 24450057988 | 20338580411 | 0.831841806714
BZip2 | 3 | 0:15:25.845792 | 24450057988 | 18796070689 | 0.768753624151936
BZip2 | 5 | 0:14:04.008644 | 24450057988 | 18337629336 | 0.750003511034618
BZip2 | 7 | 0:46:38.657094 | 24450057988 | 18337511965 | 0.749998710596105
LZMA | 1 | 0:13:59.295197 | 24450057988 | 20206549136 | 0.82644176737402
LZMA | 3 | 0:12:38.996991 | 24450057988 | 18621387158 | 0.76160912040124
LZMA | 5 | 0:18:08.816210 | 24450057988 | 16949567062 | 0.693232182529742
LZMA | 7 | 0:17:11.254498 | 24450057988 | 16953373782 | 0.693387876230014
PPMd | 1 | 0:16:43.967409 | 24450057988 | 20482188787 | 0.837715345994377
PPMd | 3 | 0:18:22.252274 | 24450057988 | 19155556252 | 0.783456475293493
PPMd | 5 | 0:20:37.762131 | 24450057988 | 17747657942 | 0.725873858896796
PPMd | 7 | 0:30:14.558052 | 24450057988 | 16221178580 | 0.663441313225568
To clarify - I would have expected more of a gap in zipping time between lzma level 3 and uncompressed - it's the uncompressed that does the stalling after about 800 megs or so are processed (probably some I/O overhead and bottlenecks due to image sequence silliness)
Hi Kieran, we've never actually used any compression method for our DCDMs (or DCP or DPX either) - we use TAR only, not ZIP. I'm interested in your tests for ZIP modelling though, for DCDM, as obviously the data storage overheads are huge...
Yeah, we defo don't do it for DCP. For large DCDMs, I keep seeing that the uncompressed 7za processes take longer than Defalte and even LZMA level 3. So approx 60% space saving for even less time than uncompressed. I think it all depends on the I/O and the >100k small files are definitely a factor here, but we are moving ahead with LZMA level 3 for DCDM now anyhow - until we figure out how to get rawcooked to play nice with the folder structure of DCDM - and the confusion that will arise from having audio stems..
Keep us posted, I’d love to learn how that ZIP workflow pans out
From: kieranjol [mailto:[email protected]] Sent: 04 November 2019 09:36 To: kieranjol/IFIscripts [email protected] Cc: Stephen McConnachie [email protected]; Mention [email protected] Subject: Re: [kieranjol/IFIscripts] makezip improvements (#366)
Yeah, we defo don't do it for DCP. For large DCDMs, I keep seeing that the uncompressed 7za processes take longer than Defalte and even LZMA level 3. So approx 60% space saving for even less time than compressed. I think it all depends on the I/O and the >100k small files are definitely a factor here, but we are moving ahead with LZMA level 3 for DCDM now anyhow - until we figure out how to get rawcooked to play nice with the folder structure of DCDM - and the confusion that will arise from having audio stems..
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://url4.mailanyone.net/v1/?m=1iRYmS-000146-4q&i=57e1b682&c=GfpOQLhUhSRJmqmbfB212NfWDOBZmYGZ5-C5onza3OHQ0x4j54cUFPwJr5bTsVt7HxVHNlE9Ieh-85mqjhlrlGrPMVeJO0rO7qLrPb1FLx7Gk5gB6yOA7NRvPx3qLkT7TzZoo2P1PQalOm8DmnZbY7oJAGS9latzCQp8a6BZiFk_ENt22upZ25sKbXumLjjO55JlKQncTEUH6WAGHDluLGCBDEuOKIlsbnRvuOX1xhnJgLXZe51Hnh9Ljcdu_GTFVAIT2KUbDYdSQk7F8nVq9oyNfgGMupd0sDJin97yfsb_2udrKRwFEaAkfVcXI5gKuqxYpXYpeKJE1lxNhbWnWKcF1jwnVezZV4EI5BnD6MQDBEhXfE1nNGA3Icj83GOrOkjrZiWOn07VEJbkJ3r6DUXiAR3PdMMhzmDzbLslYKwgXF9pe4LrUdRKFlkyk_XvQu8akZ9z5Qb-RuM5GNzPfPMn9RH-0ARVv4kIAzC7WV4, or unsubscribehttps://url4.mailanyone.net/v1/?m=1iRYmS-000146-4q&i=57e1b682&c=QJPfEU3FBg8Jwjg0OEfsuOLSbLMzrFjEEAZpnSVZV-mH3P1BkjIBa0RJ3dPTNDZIWy1S7_JcGY0WtJ0h7FL-Zrn2-zXq3YqpUMKvoQzKwc6_DbIyH0FMBJ0SxElhhnNh3zt18ueTC5PYK-uGgrTZw4g17ea80pqZAOmaM5UaOpRmHPrpNpwsmUNp55cQ1bhZvitn-A0ijkkxrpThFylkuyVWH5tqebnhPXXce3cjtIMOI5ImAhETRFIDJ4HG92yOYrK_c5ujsW3ll5NLuiPmJSia1Wv2EuHXGuxUQB0KUzuBv8a33PVnGzKvMS3WaRGS.
The British Film Institute is governed by Royal Charter and is a charity registered in England and Wales number 287780. The contents of this e-mail are confidential and may be legally privileged. If you are not the intended recipient, kindly notify the sender that you have received this message in error and immediately delete it. Unless you are the intended recipient, you may not forward this e-mail to anybody, nor make any use of its contents.
Most of these have been added. I need to add a -uncompressed option to both sipcreator and makezip, as there are some DCDMs that have terrible compression ratios and the maxed out CPUs and added verification time are simply not worth the hassle. It's totally worth it for the times when we get significant reductions in size though.