Compare commits
348 Commits
313 changed files with 39368 additions and 4345 deletions
@ -1,423 +1,71 @@ |
|||
## Vitastor |
|||
# Vitastor |
|||
|
|||
[Читать на русском](README-ru.md) |
|||
|
|||
## The Idea |
|||
|
|||
Make Software-Defined Block Storage Great Again. |
|||
|
|||
Vitastor is a small, simple and fast clustered block storage (storage for VM drives), |
|||
architecturally similar to Ceph which means strong consistency, primary-replication, symmetric |
|||
clustering and automatic data distribution over any number of drives of any size |
|||
with configurable redundancy (replication or erasure codes/XOR). |
|||
|
|||
## Features |
|||
|
|||
Vitastor is currently a pre-release, a lot of features are missing and you can still expect |
|||
breaking changes in the future. However, the following is implemented: |
|||
|
|||
0.5.x (stable): |
|||
- Basic part: highly-available block storage with symmetric clustering and no SPOF |
|||
- Performance ;-D |
|||
- Multiple redundancy schemes: Replication, XOR n+1, Reed-Solomon erasure codes |
|||
based on jerasure library with any number of data and parity drives in a group |
|||
- Configuration via simple JSON data structures in etcd |
|||
- Automatic data distribution over OSDs, with support for: |
|||
- Mathematical optimization for better uniformity and less data movement |
|||
- Multiple pools |
|||
- Placement tree, OSD selection by tags (device classes) and placement root |
|||
- Configurable failure domains |
|||
- Recovery of degraded blocks |
|||
- Rebalancing (data movement between OSDs) |
|||
- Lazy fsync support |
|||
- I/O statistics reporting to etcd |
|||
- Generic user-space client library |
|||
- QEMU driver (built out-of-tree) |
|||
- Loadable fio engine for benchmarks (also built out-of-tree) |
|||
- NBD proxy for kernel mounts |
|||
- Inode removal tool (vitastor-rm) |
|||
- Packaging for Debian and CentOS |
|||
|
|||
0.6.x (master): |
|||
- Per-inode I/O and space usage statistics |
|||
- Inode metadata storage in etcd |
|||
- Snapshots and copy-on-write image clones |
|||
- Write throttling to smooth random write workloads in SSD+HDD configurations |
|||
|
|||
## Roadmap |
|||
|
|||
- Better OSD creation and auto-start tools |
|||
- Other administrative tools |
|||
- Plugins for OpenStack, Kubernetes, OpenNebula, Proxmox and other cloud systems |
|||
- iSCSI proxy |
|||
- Faster failover |
|||
- Scrubbing without checksums (verification of replicas) |
|||
- Checksums |
|||
- Tiered storage |
|||
- RDMA and NVDIMM support |
|||
- Web GUI |
|||
- Compression (possibly) |
|||
- Read caching using system page cache (possibly) |
|||
|
|||
## Architecture |
|||
|
|||
Similarities: |
|||
|
|||
- Just like Ceph, Vitastor has Pools, PGs, OSDs, Monitors, Failure Domains, Placement Tree. |
|||
- Just like Ceph, Vitastor is transactional (even though there's a "lazy fsync mode" which |
|||
doesn't implicitly flush every operation to disks). |
|||
- OSDs also have journal and metadata and they can also be put on separate drives. |
|||
- Just like in Ceph, client library attempts to recover from any cluster failure so |
|||
you can basically reboot the whole cluster and only pause, but not crash, your clients |
|||
(I consider this a bug if the client crashes in that case). |
|||
|
|||
Some basic terms for people not familiar with Ceph: |
|||
|
|||
- OSD (Object Storage Daemon) is a process that stores data and serves read/write requests. |
|||
- PG (Placement Group) is a container for data that (normally) shares the same replicas. |
|||
- Pool is a container for data that has the same redundancy scheme and placement rules. |
|||
- Monitor is a separate daemon that watches cluster state and handles failures. |
|||
- Failure Domain is a group of OSDs that you allow to fail. It's "host" by default. |
|||
- Placement Tree groups OSDs in a hierarchy to later split them into Failure Domains. |
|||
|
|||
Architectural differences from Ceph: |
|||
|
|||
- Vitastor's primary focus is on SSDs. Proper SSD+HDD optimizations may be added in the future, though. |
|||
- Vitastor OSD is (and will always be) single-threaded. If you want to dedicate more than 1 core |
|||
per drive you should run multiple OSDs each on a different partition of the drive. |
|||
Vitastor isn't CPU-hungry though (as opposed to Ceph), so 1 core is sufficient in a lot of cases. |
|||
- Metadata and journal are always kept in memory. Metadata size depends linearly on drive capacity |
|||
and data store block size which is 128 KB by default. With 128 KB blocks metadata should occupy |
|||
around 512 MB per 1 TB (which is still less than Ceph wants). Journal doesn't have to be big, |
|||
the example test below was conducted with only 16 MB journal. A big journal is probably even |
|||
harmful as dirty write metadata also take some memory. |
|||
- Vitastor storage layer doesn't have internal copy-on-write or redirect-write. I know that maybe |
|||
it's possible to create a good copy-on-write storage, but it's much harder and makes performance |
|||
less deterministic, so CoW isn't used in Vitastor. |
|||
- The basic layer of Vitastor is block storage with fixed-size blocks, not object storage with |
|||
rich semantics like in Ceph (RADOS). |
|||
- There's a "lazy fsync" mode which allows to batch writes before flushing them to the disk. |
|||
This allows to use Vitastor with desktop SSDs, but still lowers performance due to additional |
|||
network roundtrips, so use server SSDs with capacitor-based power loss protection |
|||
("Advanced Power Loss Protection") for best performance. |
|||
- PGs are ephemeral. This means that they aren't stored on data disks and only exist in memory |
|||
while OSDs are running. |
|||
- Recovery process is per-object (per-block), not per-PG. Also there are no PGLOGs. |
|||
- Monitors don't store data. Cluster configuration and state is stored in etcd in simple human-readable |
|||
JSON structures. Monitors only watch cluster state and handle data movement. |
|||
Thus Vitastor's Monitor isn't a critical component of the system and is more similar to Ceph's Manager. |
|||
Vitastor's Monitor is implemented in node.js. |
|||
- PG distribution isn't based on consistent hashes. All PG mappings are stored in etcd. |
|||
Rebalancing PGs between OSDs is done by mathematical optimization - data distribution problem |
|||
is reduced to a linear programming problem and solved by lp_solve. This allows for almost |
|||
perfect (96-99% uniformity compared to Ceph's 80-90%) data distribution in most cases, ability |
|||
to map PGs by hand without breaking rebalancing logic, reduced OSD peer-to-peer communication |
|||
(on average, OSDs have fewer peers) and less data movement. It also probably has a drawback - |
|||
this method may fail in very large clusters, but up to several hundreds of OSDs it's perfectly fine. |
|||
It's also easy to add consistent hashes in the future if something proves their necessity. |
|||
- There's no separate CRUSH layer. You select pool redundancy scheme, placement root, failure domain |
|||
and so on directly in pool configuration. |
|||
|
|||
## Understanding Storage Performance |
|||
|
|||
The most important thing for fast storage is latency, not parallel iops. |
|||
|
|||
The best possible latency is achieved with one thread and queue depth of 1 which basically means |
|||
"client load as low as possible". In this case IOPS = 1/latency, and this number doesn't |
|||
scale with number of servers, drives, server processes or threads and so on. |
|||
Single-threaded IOPS and latency numbers only depend on *how fast a single daemon is*. |
|||
|
|||
Why is it important? It's important because some of the applications *can't* use |
|||
queue depth greater than 1 because their task isn't parallelizable. A notable example |
|||
is any ACID DBMS because all of them write their WALs sequentially with fsync()s. |
|||
|
|||
fsync, by the way, is another important thing often missing in benchmarks. The point is |
|||
that drives have cache buffers and don't guarantee that your data is actually persisted |
|||
until you call fsync() which is translated to a FLUSH CACHE command by the OS. |
|||
|
|||
Desktop SSDs are very fast without fsync - NVMes, for example, can process ~80000 write |
|||
operations per second with queue depth of 1 without fsync - but they're really slow with |
|||
fsync because they have to actually write data to flash chips when you call fsync. Typical |
|||
number is around 1000-2000 iops with fsync. |
|||
|
|||
Server SSDs often have supercapacitors that act as a built-in UPS and allow the drive |
|||
to flush its DRAM cache to the persistent flash storage when a power loss occurs. |
|||
This makes them perform equally well with and without fsync. This feature is called |
|||
"Advanced Power Loss Protection" by Intel; other vendors either call it similarly |
|||
or directly as "Full Capacitor-Based Power Loss Protection". |
|||
|
|||
All software-defined storages that I currently know are slow in terms of latency. |
|||
Notable examples are Ceph and internal SDSes used by cloud providers like Amazon, Google, |
|||
Yandex and so on. They're all slow and can only reach ~0.3ms read and ~0.6ms 4 KB write latency |
|||
with best-in-slot hardware. |
|||
|
|||
And that's in the SSD era when you can buy an SSD that has ~0.04ms latency for 100 $. |
|||
|
|||
I use the following 6 commands with small variations to benchmark any storage: |
|||
|
|||
- Linear write: |
|||
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX` |
|||
- Linear read: |
|||
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX` |
|||
- Random write latency (T1Q1, this hurts storages the most): |
|||
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/sdX` |
|||
- Random read latency (T1Q1): |
|||
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX` |
|||
- Parallel write iops (use numjobs if a single CPU core is insufficient to saturate the load): |
|||
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randwrite -runtime=60 -filename=/dev/sdX` |
|||
- Parallel read iops (use numjobs if a single CPU core is insufficient to saturate the load): |
|||
`fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randread -runtime=60 -filename=/dev/sdX` |
|||
|
|||
## Vitastor's Theoretical Maximum Random Access Performance |
|||
|
|||
Replicated setups: |
|||
- Single-threaded (T1Q1) read latency: 1 network roundtrip + 1 disk read. |
|||
- Single-threaded write+fsync latency: |
|||
- With immediate commit: 2 network roundtrips + 1 disk write. |
|||
- With lazy commit: 4 network roundtrips + 1 disk write + 1 disk flush. |
|||
- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)). |
|||
- Saturated parallel write iops: min(network bandwidth, sum(disk write iops / number of replicas / write amplification)). |
|||
|
|||
EC/XOR setups: |
|||
- Single-threaded (T1Q1) read latency: 1.5 network roundtrips + 1 disk read. |
|||
- Single-threaded write+fsync latency: |
|||
- With immediate commit: 3.5 network roundtrips + 1 disk read + 2 disk writes. |
|||
- With lazy commit: 5.5 network roundtrips + 1 disk read + 2 disk writes + 2 disk fsyncs. |
|||
- 0.5 in actually (k-1)/k which means that an additional roundtrip doesn't happen when |
|||
the read sub-operation can be served locally. |
|||
- Saturated parallel read iops: min(network bandwidth, sum(disk read iops)). |
|||
- Saturated parallel write iops: min(network bandwidth, sum(disk write iops * number of data drives / (number of data + parity drives) / write amplification)). |
|||
In fact, you should put disk write iops under the condition of ~10% reads / ~90% writes in this formula. |
|||
|
|||
Write amplification for 4 KB blocks is usually 3-5 in Vitastor: |
|||
1. Journal block write |
|||
2. Journal data write |
|||
3. Metadata block write |
|||
4. Another journal block write for EC/XOR setups |
|||
5. Data block write |
|||
|
|||
If you manage to get an SSD which handles 512 byte blocks well (Optane?) you may |
|||
lower 1, 3 and 4 to 512 bytes (1/8 of data size) and get WA as low as 2.375. |
|||
|
|||
Lazy fsync also reduces WA for parallel workloads because journal blocks are only |
|||
written when they fill up or fsync is requested. |
|||
|
|||
## Example Comparison with Ceph |
|||
|
|||
Hardware configuration: 4 nodes, each with: |
|||
- 6x SATA SSD Intel D3-4510 3.84 TB |
|||
- 2x Xeon Gold 6242 (16 cores @ 2.8 GHz) |
|||
- 384 GB RAM |
|||
- 1x 25 GbE network interface (Mellanox ConnectX-4 LX), connected to a Juniper QFX5200 switch |
|||
|
|||
CPU powersaving was disabled. Both Vitastor and Ceph were configured with 2 OSDs per 1 SSD. |
|||
|
|||
All of the results below apply to 4 KB blocks and random access (unless indicated otherwise). |
|||
|
|||
Raw drive performance: |
|||
- T1Q1 write ~27000 iops (~0.037ms latency) |
|||
- T1Q1 read ~9800 iops (~0.101ms latency) |
|||
- T1Q32 write ~60000 iops |
|||
- T1Q32 read ~81700 iops |
|||
|
|||
Ceph 15.2.4 (Bluestore): |
|||
- T1Q1 write ~1000 iops (~1ms latency) |
|||
- T1Q1 read ~1750 iops (~0.57ms latency) |
|||
- T8Q64 write ~100000 iops, total CPU usage by OSDs about 40 virtual cores on each node |
|||
- T8Q64 read ~480000 iops, total CPU usage by OSDs about 40 virtual cores on each node |
|||
|
|||
T8Q64 tests were conducted over 8 400GB RBD images from all hosts (every host was running 2 instances of fio). |
|||
This is because Ceph has performance penalties related to running multiple clients over a single RBD image. |
|||
|
|||
cephx_sign_messages was set to false during tests, RocksDB and Bluestore settings were left at defaults. |
|||
|
|||
In fact, not that bad for Ceph. These servers are an example of well-balanced Ceph nodes. |
|||
However, CPU usage and I/O latency were through the roof, as usual. |
|||
|
|||
Vitastor: |
|||
- T1Q1 write: 7087 iops (0.14ms latency) |
|||
- T1Q1 read: 6838 iops (0.145ms latency) |
|||
- T2Q64 write: 162000 iops, total CPU usage by OSDs about 3 virtual cores on each node |
|||
- T8Q64 read: 895000 iops, total CPU usage by OSDs about 4 virtual cores on each node |
|||
- Linear write (4M T1Q32): 2800 MB/s |
|||
- Linear read (4M T1Q32): 1500 MB/s |
|||
|
|||
T8Q64 read test was conducted over 1 larger inode (3.2T) from all hosts (every host was running 2 instances of fio). |
|||
Vitastor has no performance penalties related to running multiple clients over a single inode. |
|||
If conducted from one node with all primary OSDs moved to other nodes the result was slightly lower (689000 iops), |
|||
this is because all operations resulted in network roundtrips between the client and the primary OSD. |
|||
When fio was colocated with OSDs (like in Ceph benchmarks above), 1/4 of the read workload actually |
|||
used the loopback network. |
|||
|
|||
Vitastor was configured with: `--disable_data_fsync true --immediate_commit all --flusher_count 8 |
|||
--disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 |
|||
--journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 |
|||
--journal_size 16777216`. |
|||
|
|||
### EC/XOR 2+1 |
|||
|
|||
Vitastor: |
|||
- T1Q1 write: 2808 iops (~0.355ms latency) |
|||
- T1Q1 read: 6190 iops (~0.16ms latency) |
|||
- T2Q64 write: 85500 iops, total CPU usage by OSDs about 3.4 virtual cores on each node |
|||
- T8Q64 read: 812000 iops, total CPU usage by OSDs about 4.7 virtual cores on each node |
|||
- Linear write (4M T1Q32): 3200 MB/s |
|||
- Linear read (4M T1Q32): 1800 MB/s |
|||
|
|||
Ceph: |
|||
- T1Q1 write: 730 iops (~1.37ms latency) |
|||
- T1Q1 read: 1500 iops with cold cache (~0.66ms latency), 2300 iops after 2 minute metadata cache warmup (~0.435ms latency) |
|||
- T4Q128 write (4 RBD images): 45300 iops, total CPU usage by OSDs about 30 virtual cores on each node |
|||
- T8Q64 read (4 RBD images): 278600 iops, total CPU usage by OSDs about 40 virtual cores on each node |
|||
- Linear write (4M T1Q32): 1950 MB/s before preallocation, 2500 MB/s after preallocation |
|||
- Linear read (4M T1Q32): 2400 MB/s |
|||
|
|||
### NBD |
|||
|
|||
NBD is currently required to mount Vitastor via kernel, but it imposes additional overhead |
|||
due to additional copying between the kernel and userspace. This mostly hurts linear |
|||
bandwidth, not iops. |
|||
|
|||
Vitastor with single-thread NBD on the same hardware: |
|||
- T1Q1 write: 6000 iops (0.166ms latency) |
|||
- T1Q1 read: 5518 iops (0.18ms latency) |
|||
- T1Q128 write: 94400 iops |
|||
- T1Q128 read: 103000 iops |
|||
- Linear write (4M T1Q128): 1266 MB/s (compared to 2800 MB/s via fio) |
|||
- Linear read (4M T1Q128): 975 MB/s (compared to 1500 MB/s via fio) |
|||
|
|||