Изменения

Ceph performance

4865 байтов добавлено, 21:37, 30 мая 2022
Нет описания правки
This article describes which performance numbers you can achieve with Ceph and how. But I warn you: you won’t catch up with local SSDs. Local SSDs (especially NVMe) are REALLY fast right now, their latency is about 0.05ms. It’s very hard for an SDS to achieve the same result, and beating it is almost impossible. The network alone eats those 0.05ms...
 
'''UPDATE: It’s possible to achieve good latency with an SDS. I did it in my own project — Vitastor: https://vitastor.io :-) it's a block SDS architecturally similar to Ceph, but FAST. It achieved 0.14 ms latency (both read and write) in a cluster with SATA SSDs. Ceph only achieved 1 ms for writes and 0.57 ms for reads on the same hardware. See [https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md README] for details.'''
== General benchmarking principles ==
[[File:Warning icon.svg|32px|link=]] A useful habit is to leave an empty partition for later benchmarking on each SSD you deploy Ceph OSDs on, because some SSDs tend to slow down when filled.
 
==== Lyrical digression ====
 
Why use this approach in benchmarking? After all, disk performance depends on many parameters, such as:
* Block size;
* Mode — read, write, or various mixed read/write modes;
* Parallelism — queue depth and the number of threads, in other words, the number of parallel I/O requests;
* Test duration;
* Initial disk state — empty, filled linearly, filled randomly, randomly written over a specific period of time;
* Data distribution — for example, 10% of hot data and 90% of cold data or hot data located in a certain place (e.g., at the beginning of the disk);
* Other mixed test modes, e.g, benchmarking using different block sizes at the same time.
 
The results can also be presented with varying levels of detail — you can provide graphs, histograms, percentiles, and so on in addition to mere average operation count or megabytes per second. This, of course, can reveal more information about the behavior of the disk under test.
 
Benchmarking also contains a bit of philosophy. For example, some manufacturers of server SSDs argue that you must do preconditioning by randomly overwriting the disk at least twice to fill translation tables before testing. I rather believe that it puts the SSD in unrealistically bad conditions rarely seen in real life.
 
Others say you should plot a graph of latency against the number of operations per second, but my opinion is that it’s also a bit strange because it implies that you plot a graph of F1(q) against F2(q) instead of «q» itself.
 
In short, benchmarking can be a never-ending process. It can take quite a few days to get a complete view. This is usually what resources like 3dnews do in their SSD reviews. But we don’t want to waste several days. We need a test that allows us to estimate performance quickly.
 
Therefore we isolate a few «extreme» modes, check the disk in them and pretend that other results are somewhere between these «extreme points», forming some kind of a smooth function depending on the parameters. It’s also handy that each of these modes also corresponds to a valid use case:
 
* Applications that mainly use linear or large-block access. For such applications, the crucial characteristic is the linear I/O speed in megabytes per second. Therefore, the first test mode is linear read/write with 4 MB blocks and medium queue depth — 16-32 operations. Test results should be in MB/s.
* Applications that use random small-block access and support parallelism. This leads us to 4 KB random I/O modes with large queue depth — at least 128 operations. 4 KB is the standard block size for most filesystems and DBMS. Multiple (2-4-8) CPU threads should be used if a single thread can’t saturate the drive during test. Test results should include iops (I/O operations per second), but not latency. Latency is meaningless in this test because it can be arbitrarily increased just by increasing queue depth — latency is directly related to iops with a formula latency=queue/iops.
* Applications that use random small-block access and DO NOT support parallelism. There are more such applications than you might think; regarding writes, all transactional DBMSs are a notable example. This leads us to 4 KB random I/O test with queue depth of 1 and, for writes, with an fsync after each operation to prevent the disk or storage system from «cheating» by writing the data into a volatile cache. Results should include either iops or latency, but not both because, as already said, they directly relate to each other.
=== Test your Ceph cluster ===
=== About block.db sizing ===
As usual, there’s something wrong :Who's tired of spillovers? Everyone's tired of spillovers! Spillover is when you use Bluestore in an SSD+HDD configuration putting Bluestore’s database (block.db)on the SSD partition, but it constantly spills over to the HDD. This often happens despite that SSD partition is much larger than the actual DB. Spillovers show up with a warning in `ceph -s` starting with Ceph 14 Nautilus. There’s also an attempt to fix them with additional RocksDB «allocation hints» in Ceph 15 Octopus, however, generally the situation is still the same as before.
Official documents say that you should allocate 4 % of the slow device space for block.db (Bluestore’s metadata partition). This is a lot, Bluestore rarely needs that amount of space.
== Bonus: Micron vSAN reference architecture ==
 
[https://media-www.micron.com/-/media/client/global/documents/products/other-documents/micron_vsan_6,-d-,7_on_x86_smc_reference_architecture.pdf Micron Accelerated All-Flash SATA vSAN 6.7 Solution]
Node configuration:
* 2x Xeon Gold 6142 (16c 2.6GHz)
* Mellanox ConnectX-4 Lx
* Connected to 2x Mellanox SN2410 25GbE switches
«Aligns with VMWare AF-6, aims up to 50K read iops per node»
* 1 replica2 replicas (like Ceph size=2)
* 4 nodes
* 4 VMs on each node
* 8 vmdk per VM
* 4 threads per vmdk
Total I/O parallelism: 512
100 100%/70%/50%/30%/0% write
* «Baseline» (fits in cache): 121k/178k/249k/314k/486k iops
* «Capacity» (doesn’t): 51k/66k/90k/134k/363k
* Latency is 1000*512/IOPS ms in all tests (1000ms * parallelism / iops)
* '''No latency tests with low parallelism'''
* '''No linear read/write tests'''
Conclusion: