bcolz icon indicating copy to clipboard operation
bcolz copied to clipboard

Generation of high number of files on disk

Open arita37 opened this issue 9 years ago • 5 comments

When saving with bcolz an array from csv files, it generates inside bcolz file-folder an enormous amount of files : 300Mo of CSV generates 300,000 Bcolz sub-file on the disk.

Is it a way to reduce the number of files ?

arita37 avatar Dec 09 '16 21:12 arita37

You can give a target length when generating a new bcolz ctable or carray. There's an explanation in the docs, but basically every carray is split into small chunks, with the size of the chunks depending on the carray data type and the total expected length of the file. It's good practice to give this at least the length of the file you are putting into it, but if you really want less chunks you cana even put in a much higher expected length (which will make chunks longer)

BR

CarstVaartjes avatar Dec 10 '16 14:12 CarstVaartjes

Because having a large number of files on disk, is not good for disk performance.... ( 300k files x 1000 ).

Is it a way than to encapsulate the folder into a zip / file format ?

Even when we want to move the data, copying takes big amount of time.

Doc was not clear that format is folder and the number can be adjustes.

Thanks

On 10 Dec 2016, at 23:03, Carst Vaartjes [email protected] wrote:

You can give a target length when generating a new bcolz ctable or carray. There's an explanation in the docs, but basically every carray is split into small chunks, with the size of the chunks depending on the carray data type and the total expected length of the file. It's good practice to give this at least the length of the file you are putting into it, but if you really want less chunks you cana even put in a much higher expected length (which will make chunks longer)

BR

― You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

arita37 avatar Dec 10 '16 14:12 arita37

No you cannot encapsulate it, there are many advantages to the smaller chunks (performance, adding chunks when streaming data into bcolz etc) The expected length parameter will make the carray a lot less chunks however so that should help you already. For moving the files between systems you can do something like this: http://superuser.com/questions/529926/how-can-i-use-tar-command-to-group-files-without-compression

CarstVaartjes avatar Dec 10 '16 14:12 CarstVaartjes

You can give a target length when generating a new bcolz ctable or carray.

I'm creating a ctable from a DataFrame and the number of files is an issue with my storage type. I'd like to try adjusting the destination length to measure the performance impact. Is there an argument which I can give fromdataframe()? I read the documentation but I don't see anything obvious.

fredfortier avatar Jan 18 '18 01:01 fredfortier

I was able to resolve my issue by converting the numpy dtypes in the source DataFrame like so: http://danielhnyk.cz/numpy-pandas-reducing-dtype-size/. Now, converting to/from ctable is an order of magnitude faster and it's using a limited number of files.

fredfortier avatar Feb 07 '18 22:02 fredfortier