|
pirms 18 stundām | |
---|---|---|
cpp-btree @ 5dc108754a | pirms 4 mēnešiem | |
debian | pirms 3 nedēļām | |
json11 @ 97f06cb20c | pirms 4 mēnešiem | |
mon | pirms 18 stundām | |
rpm | pirms 3 nedēļām | |
.dockerignore | pirms 2 mēnešiem | |
.gitmodules | pirms 4 mēnešiem | |
GPL-2.0.txt | pirms 4 mēnešiem | |
Make-gen.pl | pirms 7 mēnešiem | |
Makefile | pirms 2 nedēļām | |
README.md | pirms 18 stundām | |
VNPL-1.0.txt | pirms 4 mēnešiem | |
allocator.cpp | pirms 2 nedēļām | |
allocator.h | pirms 2 nedēļām | |
base64.cpp | pirms 4 mēnešiem | |
base64.h | pirms 4 mēnešiem | |
blockstore.cpp | pirms 2 dienas | |
blockstore.h | pirms 2 dienas | |
blockstore_flush.cpp | pirms 1 nedēļas | |
blockstore_flush.h | pirms 2 nedēļām | |
blockstore_impl.cpp | pirms 1 nedēļas | |
blockstore_impl.h | pirms 2 dienas | |
blockstore_init.cpp | pirms 3 dienas | |
blockstore_init.h | pirms 4 mēnešiem | |
blockstore_journal.cpp | pirms 2 mēnešiem | |
blockstore_journal.h | pirms 1 nedēļas | |
blockstore_open.cpp | pirms 1 nedēļas | |
blockstore_read.cpp | pirms 1 nedēļas | |
blockstore_rollback.cpp | pirms 3 dienas | |
blockstore_stable.cpp | pirms 3 dienas | |
blockstore_sync.cpp | pirms 3 mēnešiem | |
blockstore_write.cpp | pirms 2 dienas | |
cluster_client.cpp | pirms 1 nedēļas | |
cluster_client.h | pirms 1 nedēļas | |
copy-fio-includes.sh | pirms 2 mēnešiem | |
copy-qemu-includes.sh | pirms 2 mēnešiem | |
crc32c.c | pirms 1 gada | |
crc32c.h | pirms 1 gada | |
dump_journal.cpp | pirms 2 nedēļām | |
epoll_manager.cpp | pirms 3 nedēļām | |
epoll_manager.h | pirms 4 mēnešiem | |
etcd_state_client.cpp | pirms 1 mēnesi | |
etcd_state_client.h | pirms 1 mēnesi | |
fio_cluster.cpp | pirms 3 mēnešiem | |
fio_engine.cpp | pirms 3 mēnešiem | |
fio_headers.h | pirms 3 mēnešiem | |
fio_sec_osd.cpp | pirms 3 mēnešiem | |
http_client.cpp | pirms 3 mēnešiem | |
http_client.h | pirms 4 mēnešiem | |
lambda_size.cpp | pirms 4 mēnešiem | |
malloc_or_die.h | pirms 3 mēnešiem | |
messenger.cpp | pirms 3 nedēļām | |
messenger.h | pirms 18 stundām | |
msgr_receive.cpp | pirms 1 nedēļas | |
msgr_send.cpp | pirms 18 stundām | |
nbd_proxy.cpp | pirms 2 mēnešiem | |
object_id.h | pirms 4 mēnešiem | |
osd.cpp | pirms 1 nedēļas | |
osd.h | pirms 18 stundām | |
osd_cluster.cpp | pirms 18 stundām | |
osd_flush.cpp | pirms 2 mēnešiem | |
osd_id.h | pirms 1 mēnesi | |
osd_main.cpp | pirms 1 nedēļas | |
osd_ops.cpp | pirms 4 mēnešiem | |
osd_ops.h | pirms 1 nedēļas | |
osd_peering.cpp | pirms 2 mēnešiem | |
osd_peering_pg.cpp | pirms 2 mēnešiem | |
osd_peering_pg.h | pirms 3 nedēļām | |
osd_peering_pg_test.cpp | pirms 4 mēnešiem | |
osd_primary.cpp | pirms 1 nedēļas | |
osd_primary.h | pirms 1 mēnesi | |
osd_primary_subops.cpp | pirms 18 stundām | |
osd_rmw.cpp | pirms 1 nedēļas | |
osd_rmw.h | pirms 1 nedēļas | |
osd_rmw_test.cpp | pirms 1 nedēļas | |
osd_secondary.cpp | pirms 1 nedēļas | |
osd_test.cpp | pirms 4 mēnešiem | |
pg_states.cpp | pirms 4 mēnešiem | |
pg_states.h | pirms 3 mēnešiem | |
qemu-3.1-vitastor.patch | pirms 2 mēnešiem | |
qemu-4.2-vitastor.patch | pirms 2 mēnešiem | |
qemu-5.0-vitastor.patch | pirms 2 mēnešiem | |
qemu-5.1-vitastor.patch | pirms 2 mēnešiem | |
qemu_driver.c | pirms 2 mēnešiem | |
qemu_proxy.cpp | pirms 4 mēnešiem | |
qemu_proxy.h | pirms 4 mēnešiem | |
ringloop.cpp | pirms 2 mēnešiem | |
ringloop.h | pirms 3 mēnešiem | |
rm_inode.cpp | pirms 3 nedēļām | |
rw_blocking.cpp | pirms 4 mēnešiem | |
rw_blocking.h | pirms 4 mēnešiem | |
stub_bench.cpp | pirms 4 mēnešiem | |
stub_osd.cpp | pirms 4 mēnešiem | |
stub_uring_osd.cpp | pirms 4 mēnešiem | |
test-build-el7.sh | pirms 2 mēnešiem | |
test_allocator.cpp | pirms 4 mēnešiem | |
test_blockstore.cpp | pirms 4 mēnešiem | |
test_pattern.h | pirms 1 mēnesi | |
test_shit.cpp | pirms 2 mēnešiem | |
timerfd_interval.cpp | pirms 4 mēnešiem | |
timerfd_interval.h | pirms 4 mēnešiem | |
timerfd_manager.cpp | pirms 3 mēnešiem | |
timerfd_manager.h | pirms 4 mēnešiem | |
xor.h | pirms 4 mēnešiem |
Make Software-Defined Block Storage Great Again.
Vitastor is a small, simple and fast clustered block storage (storage for VM drives), architecturally similar to Ceph which means strong consistency, primary-replication, symmetric clustering and automatic data distribution over any number of drives of any size with configurable redundancy (replication or erasure codes/XOR).
Vitastor is currently a pre-release, a lot of features are missing and you can still expect breaking changes in the future. However, the following is implemented:
Similarities:
Some basic terms for people not familiar with Ceph:
Architectural differences from Ceph:
The most important thing for fast storage is latency, not parallel iops.
The best possible latency is achieved with one thread and queue depth of 1 which basically means “client load as low as possible”. In this case IOPS = 1/latency, and this number doesn’t scale with number of servers, drives, server processes or threads and so on. Single-threaded IOPS and latency numbers only depend on how fast a single daemon is.
Why is it important? It’s important because some of the applications can’t use queue depth greater than 1 because their task isn’t parallelizable. A notable example is any ACID DBMS because all of them write their WALs sequentially with fsync()s.
fsync, by the way, is another important thing often missing in benchmarks. The point is that drives have cache buffers and don’t guarantee that your data is actually persisted until you call fsync() which is translated to a FLUSH CACHE command by the OS.
Desktop SSDs are very fast without fsync - NVMes, for example, can process ~80000 write operations per second with queue depth of 1 without fsync - but they’re really slow with fsync because they have to actually write data to flash chips when you call fsync. Typical number is around 1000-2000 iops with fsync.
Server SSDs often have supercapacitors that act as a built-in UPS and allow the drive to flush its DRAM cache to the persistent flash storage when a power loss occurs. This makes them perform equally well with and without fsync. This feature is called “Advanced Power Loss Protection” by Intel; other vendors either call it similarly or directly as “Full Capacitor-Based Power Loss Protection”.
All software-defined storages that I currently know are slow in terms of latency. Notable examples are Ceph and internal SDSes used by cloud providers like Amazon, Google, Yandex and so on. They’re all slow and can only reach ~0.3ms read and ~0.6ms 4 KB write latency with best-in-slot hardware.
And that’s in the SSD era when you can buy an SSD that has ~0.04ms latency for 100 $.
I use the following 6 commands with small variations to benchmark any storage:
fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX
fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX
fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/sdX
fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX
fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randwrite -runtime=60 -filename=/dev/sdX
fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randread -runtime=60 -filename=/dev/sdX
Replicated setups:
EC/XOR setups:
Write amplification for 4 KB blocks is usually 3-5 in Vitastor:
If you manage to get an SSD which handles 512 byte blocks well (Optane?) you may lower 1, 3 and 4 to 512 bytes (1/8 of data size) and get WA as low as 2.375.
Lazy fsync also reduces WA for parallel workloads because journal blocks are only written when they fill up or fsync is requested.
Hardware configuration: 4 nodes, each with:
CPU powersaving was disabled. Both Vitastor and Ceph were configured with 2 OSDs per 1 SSD.
All of the results below apply to 4 KB blocks and random access (unless indicated otherwise).
Raw drive performance:
Ceph 15.2.4 (Bluestore):
T8Q64 tests were conducted over 8 400GB RBD images from all hosts (every host was running 2 instances of fio). This is because Ceph has performance penalties related to running multiple clients over a single RBD image.
cephx_sign_messages was set to false during tests, RocksDB and Bluestore settings were left at defaults.
In fact, not that bad for Ceph. These servers are an example of well-balanced Ceph nodes. However, CPU usage and I/O latency were through the roof, as usual.
Vitastor:
T8Q64 read test was conducted over 1 larger inode (3.2T) from all hosts (every host was running 2 instances of fio). Vitastor has no performance penalties related to running multiple clients over a single inode. If conducted from one node with all primary OSDs moved to other nodes the result was slightly lower (689000 iops), this is because all operations resulted in network roundtrips between the client and the primary OSD. When fio was colocated with OSDs (like in Ceph benchmarks above), 1/4 of the read workload actually used the loopback network.
Vitastor was configured with: --disable_data_fsync true --immediate_commit all --flusher_count 8 --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096 --journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024 --journal_size 16777216
.
Vitastor:
Ceph:
NBD is currently required to mount Vitastor via kernel, but it imposes additional overhead due to additional copying between the kernel and userspace. This mostly hurts linear bandwidth, not iops.
Vitastor with single-thread NBD on the same hardware:
wget -q -O - https://vitastor.io/debian/pubkey | sudo apt-key add -
deb https://vitastor.io/debian bullseye main
deb https://vitastor.io/debian buster main
deb http://deb.debian.org/debian buster-backports main
apt update; apt install vitastor lp-solve etcd linux-image-amd64
yum install https://vitastor.io/rpms/centos/7/vitastor-release-1.0-1.el7.noarch.rpm
dnf install https://vitastor.io/rpms/centos/8/vitastor-release-1.0-1.el8.noarch.rpm
yum/dnf install epel-release
yum install centos-release-scl
dnf install centos-release-advanced-virtualization
yum install https://www.elrepo.org/elrepo-release-7.el7.elrepo.noarch.rpm
dnf install https://www.elrepo.org/elrepo-release-8.el8.elrepo.noarch.rpm
yum/dnf install vitastor lpsolve etcd kernel-ml qemu-kvm
<qemu>/include
→ <vitastor>/qemu/include
<qemu>/b/qemu/config-host.h
→ <vitastor>/qemu/b/qemu/config-host.h
<qemu>/b/qemu/qapi
→ <vitastor>/qemu/b/qemu/qapi
yum install centos-release-advanced-virtualization.noarch
and then yum install qemu
<qemu>/config-host.h
→ <vitastor>/qemu/b/qemu/config-host.h
<qemu>/qapi
→ <vitastor>/qemu/b/qemu/qapi
<qemu>/qapi-types.h
→ <vitastor>/qemu/b/qemu/qapi-types.h
config-host.h
and qapi
are required because they contain generated headersqemu-*.*-vitastor.patch
.<vitastor>/fio
.make -j8
.make install
(optionally with LIBDIR=/usr/lib64 QEMU_PLUGINDIR=/usr/lib64/qemu-kvm
if you’re using an RPM-based distro).Please note that startup procedure isn’t currently simple - you specify configuration and calculate disk offsets almost by hand. This will be fixed in near future.
cpupower idle-set -D 0 && cpupower frequency-set -g performance
.--max-txn-ops=100000 --auto-compaction-retention=10 --auto-compaction-mode=revision
options.etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'
(if all your drives have capacitors).etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'
.
For jerasure pools the configuration should look like the following: 2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}
.node /usr/lib/vitastor/mon/simple-offsets.js --device /dev/sdX
./usr/lib/vitastor/mon/make-units.sh
for example.
Notable configuration variables from the example:
disable_data_fsync 1
- only safe with server-grade drives with capacitors.immediate_commit all
- use this if all your drives are server-grade.disable_device_lock 1
- only required if you run multiple OSDs on one block device.flusher_count 16
- flusher is a micro-thread that removes old data from the journal.
More flushers mean more aggressive journal flushing which allows for more throughput
but slightly hurts latency under less load. Flushing will probably be improved in the future
because currently high queue depths sometimes lead to performance degradation.disk_alignment
, journal_block_size
, meta_block_size
should be set to the internal
block size of your SSDs which is 4096 on most drives.journal_no_same_sector_overwrites true
prevents multiple overwrites of the same journal sector.
Most (99%) SSDs don’t need this option. But Intel D3-4510 does because it doesn’t like when you
overwrite the same sector twice in a short period of time. The setting forces Vitastor to never
overwrite the same journal sector twice in a row which makes D3-4510 almost happy. Not totally
happy, because overwrites of the same block can still happen in the metadata area... When this
setting is set, it is also required to raise journal_sector_buffer_count
setting, which is the
number of dirty journal sectors that may be written to at the same time.systemctl start vitastor.target
everywhere.node /usr/lib/vitastor/mon/mon-main.js --etcd_url 'http://10.115.0.10:2379,http://10.115.0.11:2379,http://10.115.0.12:2379,http://10.115.0.13:2379' --etcd_prefix '/vitastor' --etcd_start_timeout 5
.etcdctl --endpoints=... get --prefix /vitastor/pg/state
. All PGs should become ‘active’.fio -thread -ioengine=/usr/lib/x86_64-linux-gnu/vitastor/libfio_cluster.so -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd=10.115.0.10:2379/v3 -pool=1 -inode=1 -size=400G
.LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-img convert -f qcow2 debian10.qcow2 -p
-O raw 'vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648'
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/block-vitastor.so qemu-system-x86_64 -enable-kvm -m 1024
-drive 'file=vitastor:etcd_host=10.115.0.10\:2379/v3:pool=1:inode=1:size=2147483648',format=raw,if=none,id=drive-virtio-disk0,cache=none
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
-vnc 0.0.0.0:0
vitastor-rm --etcd_address 10.115.0.10:2379/v3 --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
Copyright (c) Vitaliy Filippov (vitalif [at] yourcmc.ru), 2019+
You can also find me in the Russian Telegram Ceph chat: https://t.me/ceph_ru
All server-side code (OSD, Monitor and so on) is licensed under the terms of Vitastor Network Public License 1.0 (VNPL 1.0), a copyleft license based on GNU GPLv3.0 with the additional “Network Interaction” clause which requires opensourcing all programs directly or indirectly interacting with Vitastor through a computer network (“Proxy Programs”). Proxy Programs may be made public not only under the terms of the same license, but also under the terms of any GPL-Compatible Free Software License, as listed by the Free Software Foundation. This is a stricter copyleft license than the Affero GPL.
Basically, you can’t use the software in a proprietary environment to provide its functionality to users without opensourcing all intermediary components standing between the user and Vitastor or purchasing a commercial license from the author 😀.
Client libraries (cluster_client and so on) are dual-licensed under the same VNPL 1.0 and also GNU GPL 2.0 or later to allow for compatibility with GPLed software like QEMU and fio.
You can find the full text of VNPL-1.0 in the file VNPL-1.0.txt. GPL 2.0 is also included in this repository as GPL-2.0.txt.