Simplified distributed block storage with strong consistency, like in Ceph
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

5.7 KiB

Documentation → Introduction → Architecture

Читать на русском


Basic concepts

  • OSD (Object Storage Daemon) is a process that stores data and serves read/write requests.
  • PG (Placement Group) is a "shard" of the cluster, group of data stored on one set of replicas.
  • Pool is a container for data that has equal redundancy scheme and placement rules.
  • Monitor is a separate daemon that watches cluster state and handles failures.
  • Failure Domain is a group of OSDs that you allow to fail. It's "host" by default.
  • Placement Tree groups OSDs in a hierarchy to later split them into Failure Domains.

Similarities to Ceph

  • Vitastor also has Pools, PGs, OSDs, Monitors, Failure Domains, Placement Tree:
  • Vitastor also distributes every image data across the whole cluster.
  • Vitastor is also transactional. Even though there's a "lazy fsync mode" which doesn't implicitly flush every operation to disks, every write to the cluster is atomic.
  • OSDs also have journal and metadata and they can also be put on separate drives.
  • Just like in Ceph, client library attempts to recover from any cluster failure so you can basically reboot the whole cluster and only pause, but not crash, your clients.

Differences from Ceph

  • Vitastor's primary focus is on SSDs: SSD-only and SSD+HDD clusters.
  • The basic layer of Vitastor is block storage with fixed-size blocks, not object storage with rich semantics like in Ceph (RADOS).
  • PGs are ephemeral in Vitastor. This means that they aren't stored on data disks and only exist in memory while OSDs are running.
  • Vitastor OSD is (and will always be) single-threaded. If you want to dedicate more than 1 core per drive you should run multiple OSDs each on a different partition of the drive. Vitastor isn't CPU-hungry though (as opposed to Ceph), so 1 core is sufficient in a lot of cases.
  • Metadata is always kept in memory which removes the need for extra disk reads. Metadata size depends linearly on drive capacity and data store block size which is 128 KB by default. With 128 KB blocks metadata takes around 512 MB per 1 TB (which is still less than Ceph wants). Journal is also kept in memory by default, but in SSD-only clusters it's only 32 MB, and in SSD+HDD clusters, where it's beneficial to increase it, inmemory_journal can be disabled.
  • Vitastor storage layer doesn't have internal copy-on-write or redirect-write. I know that maybe it's possible to create a good copy-on-write storage, but it's much harder and makes performance less deterministic, so CoW isn't used in Vitastor.
  • There's a "lazy fsync" mode which allows to batch writes before flushing them to the disk. This allows to use Vitastor with desktop SSDs, but still lowers performance due to additional network roundtrips, so use server SSDs with capacitor-based power loss protection ("Advanced Power Loss Protection") for best performance.
  • Recovery process is per-object (per-block), not per-PG. Also there are no PGLOGs.
  • Monitors don't store data. Cluster configuration and state is stored in etcd in simple human-readable JSON structures. Monitors only watch cluster state and handle data movement. Thus Vitastor's Monitor isn't a critical component of the system and is more similar to Ceph's Manager. Vitastor's Monitor is implemented in node.js.
  • PG distribution isn't based on consistent hashes. All PG mappings are stored in etcd. Rebalancing PGs between OSDs is done by mathematical optimization - data distribution problem is reduced to a linear programming problem and solved by lp_solve. This allows for almost perfect (96-99% uniformity compared to Ceph's 80-90%) data distribution in most cases, ability to map PGs by hand without breaking rebalancing logic, reduced OSD peer-to-peer communication (on average, OSDs have fewer peers) and less data movement. It also probably has a drawback - this method may fail in very large clusters, but up to several hundreds of OSDs it's perfectly fine. It's also easy to add consistent hashes in the future if something proves their necessity.
  • There's no separate CRUSH layer. You select pool redundancy scheme, placement root, failure domain and so on directly in pool configuration.

Implementation Principles

  • I like architecturally simple solutions. Vitastor is and will always be designed exactly like that.
  • I also like reinventing the wheel to some extent, like writing my own HTTP client for etcd interaction instead of using prebuilt libraries, because in this case I'm confident about what my code does and what it doesn't do.
  • I don't care about C++ "best practices" like RAII or proper inheritance or usage of smart pointers or whatever and I don't intend to change my mind, so if you're here looking for ideal reference C++ code, this probably isn't the right place.
  • I like node.js better than any other dynamically-typed language interpreter because it's faster than any other interpreter in the world, has neutral C-like syntax and built-in event loop. That's why Monitor is implemented in node.js.

Known Problems

  • Deleting images in a degraded cluster may currently lead to objects reappearing after dead OSDs come back, and in case of erasure-coded pools, they may even reappear as incomplete. Just repeat the removal request again in this case. This problem will be fixed in the nearest future, the fix is already implemented in the "epoch-deletions" branch.