ScaleStore icon indicating copy to clipboard operation
ScaleStore copied to clipboard

This is the source code for our (Tobias Ziegler, Carsten Binnig and Viktor Leis) published paper at SIGMOD’22: ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA.

  • ScaleStore This is the source code for our (Tobias Ziegler, Carsten Binnig and Viktor Leis) published paper at SIGMOD'22: ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA. Paper can be found here: [[https://dl.acm.org/doi/10.1145/3514221.3526187][Paper Link]]

** Abstract In this paper, we propose ScaleStore, a novel distributed storage engine that exploits DRAM caching, NVMe storage, and RDMA networking to achieve high performance, cost-efficiency, and scalability at the same time. Using low latency RDMA messages, ScaleStore implements a transparent memory abstraction that provides access to the aggregated DRAM memory and NVMe storage of all nodes. In contrast to existing distributed RDMA designs such as NAM-DB or FaRM, ScaleStore integrates seamlessly with NVMe SSDs, lowering the overall hardware cost significantly. The core of ScaleStore is a distributed caching strategy that dynamically decides which data to keep in memory (and which on SSDs) based on the workload. The caching protocol also provides strong consistency in the presence of concurrent data modifications. In our YCSB-based evaluation, we show that ScaleStore can provide high performance for various types of workloads (read/write-dominated, uniform/skewed) even when the data size is larger than the aggregated memory of all nodes. We further show that ScaleStore can efficiently handle dynamic workload changes and support elasticity.

** Citation

#+begin_src
@inproceedings{DBLP:conf/sigmod/0001BL22, author = {Tobias Ziegler and Carsten Binnig and Viktor Leis}, title = {ScaleStore: {A} Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and {RDMA}}, booktitle = {{SIGMOD} '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12 - 17, 2022}, pages = {685--699}, publisher = {{ACM}}, year = {2022}, url = {https://doi.org/10.1145/3514221.3526187}, doi = {10.1145/3514221.3526187} } #+end_src

** Setup

*** Cluster Setup All experiments were conducted on a 5-node cluster running Ubuntu 18.04.1 LTS, with Linux 4.15.0 kernel. Each node is equipped with two Intel(R) Xeon(R) Gold 5120 CPUs (14 cores), 512 GB main-memory split between both sockets, and four Samsung SSD 980 Pro M.2 1 TB connected via PCIe by one ASRock Hyper Quad M.2 PCIe card. The nodes of the cluster are connected with an InfiniBand network using one Mellanox ConnectX-5 MT27800 NICs (InfiniBand EDR 4x, 100 Gbps) per node.

*** Mellanox RDMA We used the following Mellanox OFED installation:

**** ofed_info #+begin_src shell MLNX_OFED_LINUX-5.1-2.5.8.0 (OFED-5.1-2.5.8): Installed Packages:

ii ar-mgr 1.0-0.3.MLNX20200824.g8577618.51258 amd64 Adaptive Routing Manager ii dapl2-utils 2.1.10.1.mlnx-OFED.51258 amd64 Utilities for use with the DAPL libraries ii dpcp 1.1.0-1.51258 amd64 Direct Packet Control Plane (DPCP) is a library to use Devx ii dump-pr 1.0-0.3.MLNX20200824.g8577618.51258 amd64 Dump PathRecord Plugin ii hcoll 4.6.3125-1.51258 amd64 Hierarchical collectives (HCOLL) ii ibacm 51mlnx1-1.51258 amd64 InfiniBand Communication Manager Assistant (ACM) ii ibdump 6.0.0-1.51258 amd64 Mellanox packets sniffer tool ii ibsim 0.9-1.51258 amd64 InfiniBand fabric simulator for management ii ibsim-doc 0.9-1.51258 all documentation for ibsim ii ibutils2 2.1.1-0.126.MLNX20200721.gf95236b.51258 amd64 OpenIB Mellanox InfiniBand Diagnostic Tools ii ibverbs-providers:amd64 51mlnx1-1.51258 amd64 User space provider drivers for libibverbs ii ibverbs-utils 51mlnx1-1.51258 amd64 Examples for the libibverbs library ii infiniband-diags 51mlnx1-1.51258 amd64 InfiniBand diagnostic programs ii iser-dkms 5.1-OFED.5.1.2.5.3.1 all DKMS support fo iser kernel modules ii isert-dkms 5.1-OFED.5.1.2.5.3.1 all DKMS support fo isert kernel modules ii kernel-mft-dkms 4.15.1-100 all DKMS support for kernel-mft kernel modules ii knem 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 amd64 userspace tools for the KNEM kernel module ii knem-dkms 1.1.4.90mlnx1-OFED.5.1.2.5.0.1 all DKMS support for mlnx-ofed kernel modules ii libdapl-dev 2.1.10.1.mlnx-OFED.51258 amd64 Development files for the DAPL libraries ii libdapl2 2.1.10.1.mlnx-OFED.51258 amd64 The Direct Access Programming Library (DAPL) ii libibmad-dev:amd64 51mlnx1-1.51258 amd64 Development files for libibmad ii libibmad5:amd64 51mlnx1-1.51258 amd64 Infiniband Management Datagram (MAD) library ii libibnetdisc5:amd64 51mlnx1-1.51258 amd64 InfiniBand diagnostics library ii libibumad-dev:amd64 51mlnx1-1.51258 amd64 Development files for libibumad ii libibumad3:amd64 51mlnx1-1.51258 amd64 InfiniBand Userspace Management Datagram (uMAD) library ii libibverbs-dev:amd64 51mlnx1-1.51258 amd64 Development files for the libibverbs library ii libibverbs1:amd64 51mlnx1-1.51258 amd64 Library for direct userspace use of RDMA (InfiniBand/iWARP) ii libibverbs1-dbg:amd64 51mlnx1-1.51258 amd64 Debug symbols for the libibverbs library ii libopensm 5.7.3.MLNX20201102.e56fd90-0.1.51258 amd64 Infiniband subnet manager libraries ii libopensm-devel 5.7.3.MLNX20201102.e56fd90-0.1.51258 amd64 Developement files for OpenSM ii librdmacm-dev:amd64 51mlnx1-1.51258 amd64 Development files for the librdmacm library ii librdmacm1:amd64 51mlnx1-1.51258 amd64 Library for managing RDMA connections ii mlnx-ethtool 5.4-1.51258 amd64 This utility allows querying and changing settings such as speed, ii mlnx-iproute2 5.6.0-1.51258 amd64 This utility allows querying and changing settings such as speed, ii mlnx-ofed-kernel-dkms 5.1-OFED.5.1.2.5.8.1 all DKMS support for mlnx-ofed kernel modules ii mlnx-ofed-kernel-utils 5.1-OFED.5.1.2.5.8.1 amd64 Userspace tools to restart and tune mlnx-ofed kernel modules ii mpitests 3.2.20-5d20b49.51258 amd64 Set of popular MPI benchmarks and tools IMB 2018 OSU benchmarks ver 4.0.1 mpiP-3.3 IPM-2.0.6 ii mstflint 4.14.0-3.51258 amd64 Mellanox firmware burning application ii openmpi 4.0.4rc3-1.51258 all Open MPI ii opensm 5.7.3.MLNX20201102.e56fd90-0.1.51258 amd64 An Infiniband subnet manager ii opensm-doc 5.7.3.MLNX20201102.e56fd90-0.1.51258 amd64 Documentation for opensm ii perftest 4.4+0.5-1 amd64 Infiniband verbs performance tests ii rdma-core 51mlnx1-1.51258 amd64 RDMA core userspace infrastructure and documentation ii rdmacm-utils 51mlnx1-1.51258 amd64 Examples for the librdmacm library ii sharp 2.2.2.MLNX20201102.b26a0fd-1.51258 amd64 SHArP switch collectives ii srp-dkms 5.1-OFED.5.1.2.5.3.1 all DKMS support fo srp kernel modules ii srptools 51mlnx1-1.51258 amd64 Tools for Infiniband attached storage (SRP) ii ucx 1.9.0-1.51258 amd64 Unified Communication X #+end_src

*** SSD 4x 512 GB main-memory split between both sockets, and four Samsung SSD 980 Pro M.2 1 TB connected via PCIe by one ASRock Hyper Quad M.2 PCIe card. All SSDs are used as block device and organized as a RAID 0 via

#+begin_src shell sudo mdadm --create /dev/md0 --auto md --level=0 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 #+end_src

*** Huge pages We are using huge pages for the memory buffers: #+begin_src shell
echo N | sudo tee /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
#+end_src

*** CMake build To build ScaleStore we use CMake. First we create a build folder in the top level folder of scalestore: #+begin_src shell mkdir build cd build #+end_src

Afterwards, we can build the executable with either in debug mode with address sanitizers enabled:

#+begin_src shell cmake -D CMAKE_C_COMPILER=gcc-10 -D CMAKE_CXX_COMPILER=g++-10 -DCMAKE_BUILD_TYPE=Debug -DSANI=On .. && make -j #+end_src or in release mode: #+begin_src shell cmake -D CMAKE_C_COMPILER=gcc-10 -D CMAKE_CXX_COMPILER=g++-10 -DCMAKE_BUILD_TYPE=Release .. && make -j #+end_src

*** Libraries

  • gflags
  • lib_aio
  • ibverbs
  • tabulate
  • rdma cm

** Run executable

All executables can be found in ~scalestore/build/frontend~. For instance, the follwoing command can be used to run ycsb in a single node setup: #+begin_src shell
make -j && numactl --membind=0 --cpunodebind=0 ./ycsb -ownIp=172.18.94.80 -nodes=1 -YCSB_all_workloads -worker=20 -YCSB_tuple_count=1000000000 -dramGB=150 -csvFile=singlenode_oom_scalestore_ycsb_zipf.csv -YCSB_run_for_seconds=60 -ssd_path=/dev/md0 --ssd_gib=400 -pageProviderThreads=4 -YCSB_all_zipf #+end_src

** Configuration The main configuration file in order to execute ScaleStore can be found in ~shared-headers/Defs.hpp~.

*** IPs To configure the servers and their ips the following configuration needs to be adapted:

#+begin_src cpp const std::vector<std::vectorstd::string> NODES{ {""}, // 0 to allow direct offset {"172.18.94.80"}, // 1 {"172.18.94.80", "172.18.94.70"}, // 2 {"172.18.94.80", "172.18.94.70", "172.18.94.10"}, // 3 {"172.18.94.80", "172.18.94.70", "172.18.94.10", "172.18.94.20"}, // 4 {"172.18.94.80", "172.18.94.70", "172.18.94.10", "172.18.94.20", "172.18.94.40"}, // 5 {"172.18.94.80", "172.18.94.70", "172.18.94.10", "172.18.94.20", "172.18.94.40", "172.18.94.30"}, // 6 }; #+end_src cpp

*** CPU Cores We implemented a very simple ~CoreManager~ which can be found in (~scalestore/backend/threads/CoreManager.hpp~). All configurations are hard-coded to fit our servers (2 NUMA nodes) and might need to be adapted to fit yours.

** Gflags help Besides the ~Defs.hpp~ file there are gflags parameters. Most of them are stored in ~backend/ScaleStore/Config.hpp~. However, some are attached to the main executable file, e.g. ycsb has the ~YCSB_tuple_count~ flag. To see all (custom) gflags parameters and their description one can run:

#+begin_src shell ./exe --help #+end_src

** Paper Benchmarks The paper benchmark implementations can be found in ~frontend/ycsb~. The distributed experiment runner scripts can be found in ~distexperiments/experiments~. In order to run them please consult the following github page: [[https://github.com/mjasny/distexprunner]]

** Benchmark Runners

  • YCSB runner
  • OLAP scan queries

** Tests

  • consistency checks
  • TPC-C consistency checks

** Known Issues/Bugs

*** Startup

If you see the following exception at the startup of ScaleStore:

#+BEGIN_SRC "Consider adjusting BATCH_SIZE and PARTITIONS" in /home/tziegler/ScaleStore/backend/scalestore/storage/buffermanager/Buffermanager.cpp:62 #+END_SRC

You would need to change the ~PARTITIONS~ and ~BATCH_SIZE~ variable in the ~Defs.hpp~ file. The reason is that we use a partitioned queue of batches to reduce contention in the free lists and accesses to the latch. To calculate the right number of batches per partition we use.

#+BEGIN_SRC NUMBER_BATCHES = (DRAM_SIZE / PAGE_SIZE) / PARTITIONS / BATCH_SIZE #+END_SRC

Therefore, this may be needed if the DRAM_SIZE is too small or the page size has been changed.