Изменения

Ceph performance

7817 байтов добавлено, 21:37, 30 мая 2022

Нет описания правки

[[File:Ceph-funnel-en.svg|500px|right]] [[ru:Производительность Ceph]]

Ceph is a Software-Defined Storage system. It’s very feature-rich: it provides object storage, VM disk storage, shared cluster filesystem and a lot of additional features. In some ways, it’s even unique.

It could be an excellent solution which you could take for free, immediately solve all your problems, become a cloud provider and earn piles of money. However there is a subtle problem: PERFORMANCE. Rational people rarely want to lower the performance by 95 % in production. It seems cloud providers like AWS, GCP, Yandex don’t care — all of them run their clouds on top of their own crafted SDS-es (not even Ceph) and all these SDS-es are just as slow. :-) we don’t judge them of course, that’s their own business.

This article describes which performance numbers you can achieve with Ceph and how. But I warn you: you won’t catch up with local SSDs. Local SSDs (especially NVMe) are REALLY fast right now, their latency is about 0.05ms. It’s very hard for an SDS to achieve the same result, and beating it is almost impossible. The network alone eats those 0.05ms...

'''UPDATE: It’s possible to achieve good latency with an SDS. I did it in my own project — Vitastor: https://vitastor.io :-) it's a block SDS architecturally similar to Ceph, but FAST. It achieved 0.14 ms latency (both read and write) in a cluster with SATA SSDs. Ceph only achieved 1 ms for writes and 0.57 ms for reads on the same hardware. See [https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md README] for details.'''

== General benchmarking principles ==

[[File:Warning icon.svg|32px|link=]] A useful habit is to leave an empty partition for later benchmarking on each SSD you deploy Ceph OSDs on, because some SSDs tend to slow down when filled.

==== Lyrical digression ====

Why use this approach in benchmarking? After all, disk performance depends on many parameters, such as:

* Block size;

* Mode — read, write, or various mixed read/write modes;

* Parallelism — queue depth and the number of threads, in other words, the number of parallel I/O requests;

* Test duration;

* Initial disk state — empty, filled linearly, filled randomly, randomly written over a specific period of time;

* Data distribution — for example, 10% of hot data and 90% of cold data or hot data located in a certain place (e.g., at the beginning of the disk);

* Other mixed test modes, e.g, benchmarking using different block sizes at the same time.

The results can also be presented with varying levels of detail — you can provide graphs, histograms, percentiles, and so on in addition to mere average operation count or megabytes per second. This, of course, can reveal more information about the behavior of the disk under test.

Benchmarking also contains a bit of philosophy. For example, some manufacturers of server SSDs argue that you must do preconditioning by randomly overwriting the disk at least twice to fill translation tables before testing. I rather believe that it puts the SSD in unrealistically bad conditions rarely seen in real life.

Others say you should plot a graph of latency against the number of operations per second, but my opinion is that it’s also a bit strange because it implies that you plot a graph of F1(q) against F2(q) instead of «q» itself.

In short, benchmarking can be a never-ending process. It can take quite a few days to get a complete view. This is usually what resources like 3dnews do in their SSD reviews. But we don’t want to waste several days. We need a test that allows us to estimate performance quickly.

Therefore we isolate a few «extreme» modes, check the disk in them and pretend that other results are somewhere between these «extreme points», forming some kind of a smooth function depending on the parameters. It’s also handy that each of these modes also corresponds to a valid use case:

* Applications that mainly use linear or large-block access. For such applications, the crucial characteristic is the linear I/O speed in megabytes per second. Therefore, the first test mode is linear read/write with 4 MB blocks and medium queue depth — 16-32 operations. Test results should be in MB/s.

* Applications that use random small-block access and support parallelism. This leads us to 4 KB random I/O modes with large queue depth — at least 128 operations. 4 KB is the standard block size for most filesystems and DBMS. Multiple (2-4-8) CPU threads should be used if a single thread can’t saturate the drive during test. Test results should include iops (I/O operations per second), but not latency. Latency is meaningless in this test because it can be arbitrarily increased just by increasing queue depth — latency is directly related to iops with a formula latency=queue/iops.

* Applications that use random small-block access and DO NOT support parallelism. There are more such applications than you might think; regarding writes, all transactional DBMSs are a notable example. This leads us to 4 KB random I/O test with queue depth of 1 and, for writes, with an fsync after each operation to prevent the disk or storage system from «cheating» by writing the data into a volatile cache. Results should include either iops or latency, but not both because, as already said, they directly relate to each other.

=== Test your Ceph cluster ===

*: Or https://github.com/vitalif/ceph-bench. The original idea comes from the «Mark’s bench» from russian Ceph chat ([https://github.com/socketpair/ceph-bench original outdated tool was here]). Both use a non-replicated Ceph pool (size=1), create several 4MB objects (16 by default) in each separate OSD and do random single-thread 4kb writes in randomly selected objects within one OSD. This mimics random writes to RBD and allows to determine the problematic OSDs by benchmarking them separately.

*: To create the non-replicated benchmark pool use {{Cmd|ceph osd pool create bench 128 replicated; ceph osd pool set bench size 1; ceph osd pool set bench min_size 1}}. Just note that 128 (PG count) should be enough for all OSDs to get at least one PG each.

* S3 (rgw):

** [https://github.com/intel-cloud/cosbench cosbench]

** [https://github.com/markhpc/hsbench hsbench]

** [https://github.com/minio/warp minio warp]

Notes:

sockperf. On the first node, run <tt>sockperf sr -i IP --tcp</tt>. On the second, run <tt>sockperf pp -i SERVER_IP --tcp -m 4096</tt>. Decent average number is around 0.05-0.07ms.

<s>qperf. On the first node, just run <tt>qperf</tt>. On the second, <tt>qperf -vvs SERVER_IP tcp_lat -m 4096</tt>.</s> Don’t use qperf. It is super-stupid: it doesn’t disable Nagle (no TCP_NODELAY) and it doesn’t honor the <tt>-m 4096</tt> parameter — message size is always set to 1 BYTE in latency tests. [[File:Warning icon.svg|32px|link=]] Warning: Ubuntu has AppArmor enabled by default and it affects network latency adversely. Disable it if you want good performance. The effect of AppArmor is like the following (Intel X520-DA2): * centos 3.10: rtt min/avg/max/mdev = 0.039/0.053/0.132/0.012 ms* ubuntu 4.x + apparmor: rtt min/avg/max/mdev = 0.068/0.163/0.230/0.029 ms* ubuntu 4.x: rtt min/avg/max/mdev = 0.037/0.071/0.157/0.018 ms

== Why is it so slow ==

=== Expected performance ===

Estimating the cluster performance based on ~~the performance of~~ disks ' performance is absolutely wrong.

The real expected performance for Bluestore is like the following (iops applies to random 4KB reads/writes):

TODO: This section lacks random read performance comparisons.

Bluestore is the «new» storage layer of Ceph. All presentations and documents say it’s better in all ways, which ~~in fact~~ indeed seems reasonable for something «new».

Bluestore is really 2x faster than Filestore for linear write workloads, because it has no double-writes — big blocks are written only once, not twice as in Filestore. Filestore journals everything, so all writes first go to the journal and then get copied to the main device.

=== About block.db sizing ===

~~As usual, there’s something wrong :~~Who's tired of spillovers? Everyone's tired of spillovers! Spillover is when you use Bluestore in an SSD+HDD configuration putting Bluestore’s database (block.db)on the SSD partition, but it constantly spills over to the HDD. This often happens despite that SSD partition is much larger than the actual DB. Spillovers show up with a warning in `ceph -s` starting with Ceph 14 Nautilus. There’s also an attempt to fix them with additional RocksDB «allocation hints» in Ceph 15 Octopus, however, generally the situation is still the same as before.

Official documents say that you should allocate 4 % of the slow device space for block.db (Bluestore’s metadata partition). This is a lot, Bluestore rarely needs that amount of space.

* …But trying to tune them is pointless, default configuration (1x5 for HDDs and 2x8 for SSDs) is optimal. The problem is that all worker threads still serialize writes into a single kv_sync_thread, and the whole scheme only scales up to ~6 worker threads.

* There is one thing that decreases latency 2-3 times at once. It’s disabling all power-save functions of CPUs:

** <tt>cpupower idle-set -D 10</tt> — this disables C-States (or you can pass <tt>processor.max_cstate=1 intel_idle.max_cstate=0</tt> to the kernel command-line)

** <tt>cpupower frequency-set -g performance</tt> or (for older versions) <tt>for i in $(seq 0 $((`nproc`-1))); do cpufreq-set -c $i -g performance; done</tt> — this disables frequency scaling.

* When power-save is disabled CPU heats up as a GTX, but you get 2-3 times more iops.

* Drive cache in qemu is controlled by the `cache` option (surprise). It can be <missing>, writethrough, writeback, none, unsafe, directsync. With RBD this option also affects rbd cache, which is the cache on the Ceph’s client library (librbd) side.

* But cache=unsafe doesn’t work with RBD, it still waits for write confirmations. And writethrough, <missing> and directsync are basically equivalent.

* RBD cache helps a lot on HDDs, but ~~on all-flash clusters~~ it slows everything downin all-flash clusters. Something is implemented with locks, something is single-threaded, somebody tries to optimize it all, but the work isn’t done yet.

* There are the following drive emulation options: lsi (slowest), virtio-scsi (fast), virtio (fastest, but can’t do TRIM until QEMU 4.0). virtio-scsi can use multiple queues and thus should be the fastest with fast underlying storage (with a local NVMe?) — but it seems it doesn’t matter with Ceph.

* The filesystem also slows things down! Specifically it updates inode mtime on each small write if you don’t have lazytime enabled. mtime is part of the metadata, so this change is journaled, which makes the <tt>fio -sync=1 -iodepth=1 -direct=1</tt> test result 3-4 times worse when you run it over a file in FS.

== Quick insight into SSD and flash memory organization ==

~~Although~~ The distinctive feature of NAND flash memory ~~allows fast random writes~~ is that you can write it in small blocks ~~(usually 512 to 4096 bytes)~~, ~~its distinctive feature is that every~~ but erase only big block groups at once, and you must ~~be erased~~ erase any block before ~~being written to~~overwriting it. ~~But~~ Write unit is called «page», erase unit is called «block». Actual NAND chips have 16 KB pages and 16-24 MB blocks (1024 pages for Micron MLC and 1536 pages for Micron TLC). This is probably because erasing is slow compared to ~~reading and~~ writing, ~~so manufacturers design memory chips so that they always erase~~ but it can be done for a ~~large group~~ lot of blocks at once~~, as this takes almost the same time as~~ (common sense suggests that erasing ~~one block could take. This group of blocks called «erase unit»~~ is ~~typically 2-4 megabytes in size~~~1000 times slower than writing). Another distinctive feature is that the total number of erase/program cycles is physically limited — after several thousands cycles (a usual number for MLC memory) the block becomes faulty and stops accepting new writes or even loses the data previously written to it. Denser and cheaper (MLC/TLC/QLC, 2/3/4 bits per cell) memory chips have smaller erase limits, while sparser and more expensive ones (SLC, 1 bit per cell) have bigger limits (up to 100000 rewrites). However, all limits are still finite, so stupidly overwriting the same block would be very slow and would break SSD very rapidly.

But that’s not the case with modern SSDs — even cheap models are very fast and usually very durable. But why? The credit goes to SSD controllers: SSDs contain very smart and powerful controllers, usually with at least 4 cores and 1-2 GHz clock frequency, which means they’re as powerful as mobile phones' processors. All that power is required to make FTL firmware run smoothly. FTL stands for «Flash Translation Layer» and it is the firmware responsible for translating addresses of small blocks into physical addresses on flash memory chips. Every write request is always put into a space freed in advance, and FTL just remembers the new physical location of the data. This makes writes very fast. FTL also defragments free space and moves blocks around to achieve uniform wear across all memory cells. This feature is called Wear Leveling. SSDs also usually have some extra physical space reserved to add even more endurance and to make wear leveling easier; this is called overprovisioning. Pricier server SSDs have a lot of space overprovisioned, for example, Micron 5100 Max has 37,5 % of physical memory reserved (extra 60 % is added to the user-visible capacity).

When I tried to lecture someone in the mailing list about «all SSDs doing fsyncs correctly» I got this as the reply: https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf. Long story short, it says that in 2013 a common scenario was SSDs not syncing metadata on fsync calls at all which led to all kinds of funny things on a power loss, up to (!!!) total failures of some SSDs.

There also exist some very old SSDs without capacitors (OCZ Vector/Vertex) which are capable of very large sync iops numbers. How do they work? Nobody knows, but I suspect that they just don’t do safe writes :). The core principle of flash memory overwrites ~~hasn't~~ hasn’t changed in recent years. 5 years ago SSDs were also FTL-based, just as now.

So it seems there are two kinds of «power loss protection»: simple PLP means «we do fsyncs and don’t die or lose your data when a power loss occurs», and advanced PLP means that fsync’ed writes are just as fast as non-fsynced. It also seems that in the current years (2018—2019) simple PLP is already a standard and most SSDs don’t lose data on power failure.

This limit is always sufficient to copy big files to a flash drive formatted in any of common filesystems. One opened block receives metadata and another receives data, then it just moves on. But if you start doing random writes you stop hitting the opened blocks and this is where lags come in.

== Bonus: Micron vSAN reference architecture ==

[https://media-www.micron.com/-/media/client/global/documents/products/other-documents/micron_vsan_6,-d-,7_on_x86_smc_reference_architecture.pdf Micron Accelerated All-Flash SATA vSAN 6.7 Solution]

Node configuration:

* 384 GB RAM 2667 MHz

* 2X Micron 5100 MAX 960 GB (randread: 93k iops, randwrite: 74k iops)

* 8X Micron 5200 ECO 3.84TB (randread: 95k iops, randwrite: 17k iops)

* 2x Xeon Gold 6142 (16c 2.6GHz)

* Mellanox ConnectX-4 Lx

* Connected to 2x Mellanox SN2410 25GbE switches

«Aligns with VMWare AF-6, aims up to 50K read iops per node»

* 2 replicas (like Ceph size=2)

* 4 nodes

* 4 VMs on each node

* 8 vmdk per VM

* 4 threads per vmdk

Total I/O parallelism: 512

100%/70%/50%/30%/0% write

* «Baseline» (fits in cache): 121k/178k/249k/314k/486k iops

* «Capacity» (doesn’t): 51k/66k/90k/134k/363k

* Latency is 1000*512/IOPS ms in all tests (1000ms * parallelism / iops)

* '''No latency tests with low parallelism'''

* '''No linear read/write tests'''

Conclusion:

* ~3800 write iops per drive

* ~11343 read iops per drive

* ~1600 write iops per drive when not in cache

* Parallel workload doesn’t look better than Ceph. vSAN is hyperconverged, though.

== Good SSD models ==

* Micron 5100/5200 ~~and soon~~ , 9300. Maybe 5300, 7300 too

* Seagate Nytro 1351/1551

* HGST SN260

* At least until Nautilus: <tt>[global] debug objecter = 0/0</tt> (there is a big client-side slowdown)

* Try to disable rbd cache in the userspace driver (QEMU options cache=none)

* <s>For HDD-only or Bad-SSD-Only and at least until it’s ~~backported —~~ backported (it is) — remove the handbrake https://github.com/ceph/ceph/pull/26909</s>