webdataset
webdataset copied to clipboard
Large size of the tar files generated by TarWriter
- Problem
I try transforming the dataset ImageNet-C (an image classification dataset) into webdataset tarfile formats. The original dataset includes 4 tar files that store image samples. The size of these 4 files (blur.tar, digital.tar, noise.tar and weather.tar) is 46.7 GB. There are 3, 750, 000 image samples in these 4 files.
After transforming, the total size of the webdataset shards is 60.7 GB (+14GB). From this issue #115 , I know that each entry in tarfile has a size of at least 1024 bytes due to the tar headers and the padding. In my understanding, the major difference between the original 4 image tar files and the webdataset shards is that the webdataset shards include .cls file. Such .cls file has very small size (< 4 bytes) but occupies lots of disk space since it has some tar headers and the padding.
However, when I try calculating the average disk usage of each .cls file 14GB / 3, 750, 000 files = 4008 bytes, I found it is much larger than 1024 bytes mentioned in issue #115.
Then, I select one webdataset shard named 000000.tar (18MB) to unpack and then pack into a new tarfile example.tar (15MB) in the shell (see supp in the followings). The tarfile example.tar is smaller than the shard 000000.tar by 3MB. This may result in the file size growth when transforming into webdataset. I think these two tarfiles include totally the same things. And I have no idea why 000000.tar is larger than example.tar.
- Supplementary:
I put these two files (000000.tar and example.tar) in Google Drive. And I provide the command I used to generate example.tar in the followings.
(base) {22-09-01 2:25}vm:~/playground chenyaofo% ls -lha
total 18M
drwxrwxr-x 2 chenyaofo chenyaofo 4.0K Sep 1 02:25 .
drwxr-xr-x 39 chenyaofo chenyaofo 4.0K Sep 1 02:25 ..
-rw-rw-r-- 1 chenyaofo chenyaofo 18M Sep 1 02:19 000000.tar
(base) {22-09-01 2:25}vm:~/playground chenyaofo% mkdir images
(base) {22-09-01 2:25}vm:~/playground chenyaofo% tar -xf 000000.tar -C images
(base) {22-09-01 2:26}vm:~/playground chenyaofo% ls -lha images | head
total 22M
drwxrwxr-x 2 chenyaofo chenyaofo 160K Sep 1 02:26 .
drwxrwxr-x 3 chenyaofo chenyaofo 4.0K Sep 1 02:25 ..
-r--r--r-- 1 chenyaofo chenyaofo 3 Aug 31 08:43 ILSVRC2012_val_00000012.cls
-r--r--r-- 1 chenyaofo chenyaofo 9.5K Aug 31 08:43 ILSVRC2012_val_00000012.jpg
-r--r--r-- 1 chenyaofo chenyaofo 3 Aug 31 08:43 ILSVRC2012_val_00000063.cls
-r--r--r-- 1 chenyaofo chenyaofo 2.6K Aug 31 08:43 ILSVRC2012_val_00000063.jpg
-r--r--r-- 1 chenyaofo chenyaofo 3 Aug 31 08:43 ILSVRC2012_val_00000085.cls
-r--r--r-- 1 chenyaofo chenyaofo 5.0K Aug 31 08:43 ILSVRC2012_val_00000085.jpg
-r--r--r-- 1 chenyaofo chenyaofo 3 Aug 31 08:43 ILSVRC2012_val_00000108.cls
(base) {22-09-01 2:26}vm:~/playground chenyaofo% tar -cf example.tar -C images .
(base) {22-09-01 2:26}vm:~/playground chenyaofo% tar -tvf 000000.tar | head
-r--r--r-- bigdata/bigdata 3 2022-08-31 08:43 ILSVRC2012_val_00014375.cls
-r--r--r-- bigdata/bigdata 8948 2022-08-31 08:43 ILSVRC2012_val_00014375.jpg
-r--r--r-- bigdata/bigdata 3 2022-08-31 08:43 ILSVRC2012_val_00049752.cls
-r--r--r-- bigdata/bigdata 7118 2022-08-31 08:43 ILSVRC2012_val_00049752.jpg
-r--r--r-- bigdata/bigdata 3 2022-08-31 08:43 ILSVRC2012_val_00008913.cls
-r--r--r-- bigdata/bigdata 7901 2022-08-31 08:43 ILSVRC2012_val_00008913.jpg
-r--r--r-- bigdata/bigdata 3 2022-08-31 08:43 ILSVRC2012_val_00016187.cls
-r--r--r-- bigdata/bigdata 9061 2022-08-31 08:43 ILSVRC2012_val_00016187.jpg
-r--r--r-- bigdata/bigdata 3 2022-08-31 08:43 ILSVRC2012_val_00039550.cls
-r--r--r-- bigdata/bigdata 6127 2022-08-31 08:43 ILSVRC2012_val_00039550.jpg
(base) {22-09-01 2:26}vm:~/playground chenyaofo% tar -tvf example.tar | head
drwxrwxr-x chenyaofo/chenyaofo 0 2022-09-01 02:26 ./
-r--r--r-- chenyaofo/chenyaofo 3 2022-08-31 08:43 ./ILSVRC2012_val_00020387.cls
-r--r--r-- chenyaofo/chenyaofo 12795 2022-08-31 08:43 ./ILSVRC2012_val_00017458.jpg
-r--r--r-- chenyaofo/chenyaofo 3 2022-08-31 08:43 ./ILSVRC2012_val_00020643.cls
-r--r--r-- chenyaofo/chenyaofo 9208 2022-08-31 08:43 ./ILSVRC2012_val_00029569.jpg
-r--r--r-- chenyaofo/chenyaofo 7825 2022-08-31 08:43 ./ILSVRC2012_val_00031200.jpg
-r--r--r-- chenyaofo/chenyaofo 6799 2022-08-31 08:43 ./ILSVRC2012_val_00041945.jpg
-r--r--r-- chenyaofo/chenyaofo 3 2022-08-31 08:43 ./ILSVRC2012_val_00046556.cls
-r--r--r-- chenyaofo/chenyaofo 3 2022-08-31 08:43 ./ILSVRC2012_val_00026275.cls
-r--r--r-- chenyaofo/chenyaofo 3 2022-08-31 08:43 ./ILSVRC2012_val_00003294.cls
(base) {22-09-01 2:26}vm:~/playground chenyaofo% ls -lha
total 33M
drwxrwxr-x 3 chenyaofo chenyaofo 4.0K Sep 1 02:26 .
drwxr-xr-x 39 chenyaofo chenyaofo 4.0K Sep 1 02:26 ..
-rw-rw-r-- 1 chenyaofo chenyaofo 18M Sep 1 02:19 000000.tar
-rw-rw-r-- 1 chenyaofo chenyaofo 15M Sep 1 02:26 example.tar
drwxrwxr-x 2 chenyaofo chenyaofo 160K Sep 1 02:26 images
In addition, I provide some system info and library info.
I use Ubuntu 22.04 Linux vm 5.15.0-39-generic #42-Ubuntu SMP Thu Jun 9 23:42:32 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux. I use python=3.9.13 and webdataset=0.2.20 to generate webdataset shards. The version of tar command is tar (GNU tar) 1.34.
The code used to generate webdataset shards is available in Github Gist.
I maybe found the reason for the large tarfile size.
I found a similar question in Stackoverflow. In short, python (>=3.8) built-in library uses tarfile.PAX_FORMAT as default to store tarfile. In constrast, the default format of tar command on Linux is tarfile.GNU_FORMAT. See more about these two formats here and here. PAX_FORMAT has a longer tar header than GNU_FORMAT and thus lead to large tarfile size.
In my opinion, the infomation in tar header is useless and unneccessary for machine learning/deep learning training. Therefore, I recommand tarfile.GNU_FORMAT to generate webdataset shards. The shards with smaller size are beneficial to boosting theefficiency of data loading. I recommand to add format param while opening tarfile in TarWriter.
Thanks for tracking this down. I wasn't aware they had changed the default, that is good to know. I will look into fixing this.
A couple of reasons why we haven't noticed this before is that (1) we usually transform data with tarp after writing them with ShardWriter, and that "resets" the format, and (2) if there are many small files, we use compressed tar fils (.tgz, .tar.gz), which removes the overhead of the tar headers.
If you have a lot of small files in each sample, you also have an alternative file format in which each individual sample is represented as a single message pack file containing a dictionary.
Default is now USTAR_FORMAT, which should be more compact.