mysql-5.6 icon indicating copy to clipboard operation
mysql-5.6 copied to clipboard

File management options with RocksDB

Open mdcallag opened this issue 7 years ago • 14 comments

This is a discussion. Eventually it might become a feature request.

The old InnoDB used a few files for all tables. InnoDB compression requires file-per-table and with that there is one *.ibd file per table and a separate directory per database so users rarely end up with thousands of files in a directory.

TokuDB used files-per-table (or maybe file-per-index) with all files in one directory which means there will be many more files in one directory compared to InnoDB file-per-table. Some TokuDB users requested files in the database directory to avoid too many files in on directory. I don't know whether that was ever implemented for TokuDB.

By database directory I mean $data_dir/$database_name

RocksDB puts all files in one directory and there can be many files given a typical SST file size is less than 128MB. MyRocks will never support file-per-table like InnoDB but I think it might be possible to support a directory per column family (files for each column family in different directories) and then the question is whether we make it easier to do column-family-per-database including use of the database directories for the LSM files.

mdcallag avatar Sep 06 '16 22:09 mdcallag

This is more like a feature request for RocksDB. I talked with Siying about this. Currently RocksDB does not have an option to change directories by column family. It will be able to extend to support. One potential concern to me is handling online binary backups. A backup tool like myrocks_hotbackup will have to be smart enough to traverse all directories.

yoshinorim avatar Sep 07 '16 20:09 yoshinorim

Yes, it is a feature request. "enhancement" seems like the label that fits best. We don't have to implement but I am sure this request will be repeated many times over many years so I prefer we keep something open to explain our plans.

mdcallag avatar Sep 07 '16 20:09 mdcallag

Well, I guess it will be too much to ask, but I wanted this for many years for InnoDB, and we still do not have it, so I will ask it for RocksDB.

I want to have a support for separate storage classes (i.e. slow storage and fast storage). This can be done per "column family" or even it might be interesting to have it done "per level". That's we want all Level 7 blocks to be stored on slow, but large disk.

vadimtk avatar Sep 07 '16 20:09 vadimtk

RocksDB has db_paths option to put database files in multiple directories. It is not widely used today and there isn't an option for per column-family db_paths. But it might be feasible. https://github.com/facebook/rocksdb/blob/master/include/rocksdb/options.h#L951

mdcallag avatar Sep 07 '16 20:09 mdcallag

If I want to remove all the "old" data from all my rocksdb tables, is it safe to just remove all .sst files, older than that "old" date? Will I get into some data corruption after I do so?

mickvav avatar Sep 12 '17 18:09 mickvav

While I don't understand why @mdcallag proposes a directory per column family (perhaps he suggests an easier to implement hack), I'd like to comment on the problem with the MySQL/MariaDB schemas aka "databases": The other storage engines I used, store the data per-schema in a separate directory. So If you want to make a binary backup of one schema, you can do so. MyRocks still does not have this level of integration with MySQL/MariaDB. There are other considerations: You might want to limit the individual database size. Now this means the whole MySQL instance is limited. But if we can do this per schema, we could shard by simply making several schemas under one LRU cache. This would allow for better RAM utilization, because will save us from running several instances with their own non-shared RAM caches.

AGenchev avatar Dec 13 '20 02:12 AGenchev

RocksDB storage is in terms of column families -- there is an LSM tree per column family. If support were added for column-families to use different directories, then someone could add support to use a column family per schema.

WRT binary backup per schema, I am not sure what you expect. Does InnoDB hot backup (from upstream or Percona) support that? For what I propose above, hot backup per schema is feasible once schemas use different column families. Otherwise I wonder if by binary backup you mean to put a server into read-only mode, then do a file copy, then take it out of read-only mode.

mdcallag avatar Dec 13 '20 15:12 mdcallag

When you create multiple CF, the problem is that myRocks hard-divides your block cache among them. In general this is not what you want (well, unless you have too many cores and big RAM so the shared cache to become bottleneck for NUMA SMP). Then, if you have say 5 CF for 5 databases and the load varies between them, data does not fit in RAM, aren't you going to have much worse (select) performance compared to having 5 databases with only one CF (with same amount of memory for block cache) ? Or can we have multiple CF/LSM trees which dynamically share the same block cache, like the people at rockset.com claim to have implemented ?

AGenchev avatar Dec 18 '20 20:12 AGenchev

I support having an option to use a directory per schema, I just don't think it will happen. I don't speak for the RocksDB team and don't know their plans.

Within a single RocksDB instance, the block cache is shared across all CFs so I am confused by what you write about Rockset.

There are options to share things (thread pools, maybe the block cache) across multiple RocksDB instances running in the same Linux process. But I don't know much about that as I never run RocksDB in that manner.

mdcallag avatar Dec 18 '20 20:12 mdcallag

Well, I read it here: https://minervadb.com/index.php/2018/11/02/tuning-myrocks-for-performance/ He says "Do not set block_cache at rocksdb_default_cf_options (block_based_table_factory). If you do provide a block cache size on a default column family, the same cache is NOT reused for all such column families."

AGenchev avatar Dec 18 '20 22:12 AGenchev

I set rocksdb_block_cache_size in my.cnf and AFAIK the cache is shared by all CFs

mdcallag avatar Dec 18 '20 22:12 mdcallag

my.cnf configuration samples can be found here -- https://github.com/facebook/mysql-5.6/wiki/my.cnf-tuning There is a dedicated MyRocks parameter rocksdb_block_cache_size. As Mark said, the block cache size is shared across all column families.

yoshinorim avatar Dec 18 '20 23:12 yoshinorim

OK, my big misunderstanding. After clearing this, it remains to chat on the "option to use a directory per schema" like the rest of the storage engines in MySQL/MariaDB. @mdcallag suggested that these can be separate CFs. Ideally a "schema" should have own directory where all of its CFs SST files reside. What do you think ?

AGenchev avatar Dec 19 '20 02:12 AGenchev

I agree with you that directory per schema would be great, I am just skeptical that it will get implemented.

On Fri, Dec 18, 2020 at 6:04 PM Angel G. [email protected] wrote:

OK, my big misunderstanding. After clearing this, it remains to chat on the "option to use a directory per schema" like the rest of the storage engines in MySQL/MariaDB. @mdcallag https://github.com/mdcallag suggested that these can be separate CFs. Ideally a "schema" should have own directory where all of its CFs SST files reside.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebook/mysql-5.6/issues/311#issuecomment-748402657, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMQUTKITQ3GXWQFT2QONCTSVQC3JANCNFSM4COZOA7Q .

-- Mark Callaghan [email protected]

mdcallag avatar Dec 19 '20 03:12 mdcallag