vitastor/docs/hugo/content/introduction/architecture.md

5.0 KiB

title weight
Architecture 3

For people familiar with Ceph, Vitastor is quite similar:

  • Vitastor also has Pools, PGs, OSDs, Monitors, Failure Domains, Placement Tree:
    • OSD (Object Storage Daemon) is a process that stores data and serves read/write requests.
    • PG (Placement Group) is a container for data that (normally) shares the same replicas.
    • Pool is a container for data that has the same redundancy scheme and placement rules.
    • Monitor is a separate daemon that watches cluster state and controls data distribution.
    • Failure Domain is a group of OSDs that you allow to fail. It's "host" by default.
    • Placement Tree groups OSDs in a hierarchy to later split them into Failure Domains.
  • Vitastor also distributes every image data across the whole cluster.
  • Vitastor is also transactional (every write to the cluster is atomic).
  • OSDs also have journal and metadata and they can also be put on separate drives.
  • Just like in Ceph, client library attempts to recover from any cluster failure so you can basically reboot the whole cluster and only pause, but not crash, your clients (please report a bug if the client crashes in that case).

However, there are also differences:

  • Vitastor's main focus is on SSDs. Hybrid SSD+HDD setups are also possible.
  • Vitastor OSD is (and will always be) single-threaded. If you want to dedicate more than 1 core per drive you should run multiple OSDs each on a different partition of the drive. Vitastor isn't CPU-hungry though (as opposed to Ceph), so 1 core is sufficient in a lot of cases.
  • Metadata and journal are always kept in memory. Metadata size depends linearly on drive capacity and data store block size which is 128 KB by default. With 128 KB blocks metadata should occupy around 512 MB per 1 TB (which is still less than Ceph wants). Journal doesn't have to be big, the example test below was conducted with only 16 MB journal. A big journal is probably even harmful as dirty write metadata also take some memory.
  • Vitastor storage layer doesn't have internal copy-on-write or redirect-write. I know that maybe it's possible to create a good copy-on-write storage, but it's much harder and makes performance less deterministic, so CoW isn't used in Vitastor.
  • The basic layer of Vitastor is block storage with fixed-size blocks, not object storage with rich semantics like in Ceph (RADOS).
  • There's a "lazy fsync" mode which allows to batch writes before flushing them to the disk. This allows to use Vitastor with desktop SSDs, but still lowers performance due to additional network roundtrips, so use server SSDs with capacitor-based power loss protection ("Advanced Power Loss Protection") for best performance.
  • PGs are ephemeral. This means that they aren't stored on data disks and only exist in memory while OSDs are running.
  • Recovery process is per-object (per-block), not per-PG. Also there are no PGLOGs.
  • Monitors don't store data. Cluster configuration and state is stored in etcd in simple human-readable JSON structures. Monitors only watch cluster state and handle data movement. Thus Vitastor's Monitor isn't a critical component of the system and is more similar to Ceph's Manager. Vitastor's Monitor is implemented in node.js.
  • PG distribution isn't based on consistent hashes. All PG mappings are stored in etcd. Rebalancing PGs between OSDs is done by mathematical optimization - data distribution problem is reduced to a linear programming problem and solved by lp_solve. This allows for almost perfect (96-99% uniformity compared to Ceph's 80-90%) data distribution in most cases, ability to map PGs by hand without breaking rebalancing logic, reduced OSD peer-to-peer communication (on average, OSDs have fewer peers) and less data movement. It also probably has a drawback - this method may fail in very large clusters, but up to several hundreds of OSDs it's perfectly fine. It's also easy to add consistent hashes in the future if something proves their necessity.
  • There's no separate CRUSH layer. You select pool redundancy scheme, placement root, failure domain and so on directly in pool configuration.
  • Images are global i.e. you can't create multiple images with the same name in different pools.

Implementation Principles

  • I like architecturally simple solutions. Vitastor is and will always be designed exactly like that.
  • I also like reinventing the wheel to some extent, like writing my own HTTP client for etcd interaction instead of using prebuilt libraries, because in this case I'm confident about what my code does and what it doesn't do.
  • I don't care about C++ "best practices" like RAII or proper inheritance or usage of smart pointers or whatever and I don't intend to change my mind, so if you're here looking for ideal reference C++ code, this probably isn't the right place.
  • I like node.js better than any other dynamically-typed language interpreter because it's faster than any other interpreter in the world, has neutral C-like syntax and built-in event loop. That's why Monitor is implemented in node.js.