Изменения

Ceph performance

30 483 байта добавлено, 21:37, 30 мая 2022

Нет описания правки

[[~~Category~~File:~~VitaliPrivate~~Ceph-funnel-en.svg|500px|right]][[ru:Производительность Ceph]]Ceph is a Software-Defined Storage system. It’s very feature-rich: it provides object storage, VM disk storage, shared cluster filesystem and a lot of additional features. In some ways, it’s even unique. It could be an excellent solution which you could take for free, immediately solve all your problems, become a cloud provider and earn piles of money. However there is a subtle problem: PERFORMANCE. Rational people rarely want to lower the performance by 95 % in production. It seems cloud providers like AWS, GCP, Yandex don’t care — all of them run their clouds on top of their own crafted SDS-es (not even Ceph) and all these SDS-es are just as slow. :-) we don’t judge them of course, that’s their own business. This article describes which performance numbers you can achieve with Ceph and how. But I warn you: you won’t catch up with local SSDs. Local SSDs (especially NVMe) are REALLY fast right now, their latency is about 0.05ms. It’s very hard for an SDS to achieve the same result, and beating it is almost impossible. The network alone eats those 0.05ms... '''UPDATE: It’s possible to achieve good latency with an SDS. I did it in my own project — Vitastor: https://vitastor.io :-) it's a block SDS architecturally similar to Ceph, but FAST. It achieved 0.14 ms latency (both read and write) in a cluster with SATA SSDs. Ceph only achieved 1 ms for writes and 0.57 ms for reads on the same hardware. See [https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md README] for details.'''

== General benchmarking principles ==

Main test cases for benchmarking are:

* Linear read and write (big blocks, ~~big~~ long queue) in MB/s

* Highly parallel random read and write of small blocks (4-8kb, iodepth=32-128) in IOPS (Input/Output ops per second)

* Single-threaded transactional random write (4-8kb, iodepth=1) and read (though single-threaded reads are more rare) in IOPS

=== Test your disks ===

[https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc SSD Bench Google Docs]

Run `fio` on your drives before deploying Ceph:

{{Box|[[File:Warning icon.svg|32px|link=]] {{red|WARNING!}} For those under a rock — fio write test is DESTRUCTIVE. Don’t dare to run it on disks ~~which have~~ containing important data… for example, OSD journals (I’ve seen such cases).}}

* Try to disable drive cache before testing: {{Cmd|hdparm -W 0 /dev/sdX}} (SATA drives), {{Cmd|1=sdparm --set WCE=0 /dev/sdX}} (SAS drives). This is usually ABSOLUTELY required for server SSDs like Micron 5100 or Seagate Nytro (see [[#Drive cache is slowing you down]]) as it increases random write iops ''more than by two magnitudes'' (from 288 iops to 18000 iops!). In some cases it may not improve anything, so try both options -W0 and -W1.

* Linear read: {{Cmd|1=fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX}}

* Linear write: {{Cmd|1=fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX}}

* Peak parallel random read: {{Cmd|1=fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 -rw=randread -runtime=60 -filename=/dev/sdX}}

* Single-threaded read latency: {{Cmd|1=fio -ioengine=libaio -sync=1 -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX}}

* Peak parallel random ~~read~~write: {{Cmd|1=fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -runtime=60 -filename=/dev/sdX}}

* Journal write latency: {{Cmd|1=fio -ioengine=libaio -sync=1 -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=write -runtime=60 -filename=/dev/sdX}}. Also try it with <tt>-fsync=1</tt> instead of <tt>-sync=1</tt> and write down the worst result, because sometimes one of sync or fsync is ignored by messy hardware.

* Single-threaded random write latency: {{Cmd|1=fio -ioengine=libaio -sync=1 -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/sdX}}

[[File:Warning icon.svg|32px|link=]] A useful habit is to leave an empty partition for later benchmarking on each SSD you deploy Ceph OSDs on, because some SSDs tend to slow down when filled.

==== Lyrical digression ====

Why use this approach in benchmarking? After all, disk performance depends on many parameters, such as:

* Block size;

* Mode — read, write, or various mixed read/write modes;

* Parallelism — queue depth and the number of threads, in other words, the number of parallel I/O requests;

* Test duration;

* Initial disk state — empty, filled linearly, filled randomly, randomly written over a specific period of time;

* Data distribution — for example, 10% of hot data and 90% of cold data or hot data located in a certain place (e.g., at the beginning of the disk);

* Other mixed test modes, e.g, benchmarking using different block sizes at the same time.

The results can also be presented with varying levels of detail — you can provide graphs, histograms, percentiles, and so on in addition to mere average operation count or megabytes per second. This, of course, can reveal more information about the behavior of the disk under test.

Benchmarking also contains a bit of philosophy. For example, some manufacturers of server SSDs argue that you must do preconditioning by randomly overwriting the disk at least twice to fill translation tables before testing. I rather believe that it puts the SSD in unrealistically bad conditions rarely seen in real life.

Others say you should plot a graph of latency against the number of operations per second, but my opinion is that it’s also a bit strange because it implies that you plot a graph of F1(q) against F2(q) instead of «q» itself.

In short, benchmarking can be a never-ending process. It can take quite a few days to get a complete view. This is usually what resources like 3dnews do in their SSD reviews. But we don’t want to waste several days. We need a test that allows us to estimate performance quickly.

Therefore we isolate a few «extreme» modes, check the disk in them and pretend that other results are somewhere between these «extreme points», forming some kind of a smooth function depending on the parameters. It’s also handy that each of these modes also corresponds to a valid use case:

* Applications that mainly use linear or large-block access. For such applications, the crucial characteristic is the linear I/O speed in megabytes per second. Therefore, the first test mode is linear read/write with 4 MB blocks and medium queue depth — 16-32 operations. Test results should be in MB/s.

* Applications that use random small-block access and support parallelism. This leads us to 4 KB random I/O modes with large queue depth — at least 128 operations. 4 KB is the standard block size for most filesystems and DBMS. Multiple (2-4-8) CPU threads should be used if a single thread can’t saturate the drive during test. Test results should include iops (I/O operations per second), but not latency. Latency is meaningless in this test because it can be arbitrarily increased just by increasing queue depth — latency is directly related to iops with a formula latency=queue/iops.

* Applications that use random small-block access and DO NOT support parallelism. There are more such applications than you might think; regarding writes, all transactional DBMSs are a notable example. This leads us to 4 KB random I/O test with queue depth of 1 and, for writes, with an fsync after each operation to prevent the disk or storage system from «cheating» by writing the data into a volatile cache. Results should include either iops or latency, but not both because, as already said, they directly relate to each other.

=== Test your Ceph cluster ===

Recommended benchmarking tools:

* ~~The first recommended tool is again `~~fio~~` with `~~-ioengine=rbd . Run the following:*# fio -ioengine=rbd -direct=1 -name=test -bs=4M -iodepth=16 -rw=write -pool=~~<your~~ rpool_hdd -runtime=60 -rbdname=testimg*# fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -pool> =rpool_hdd -runtime=60 -rbdname=~~<your image>`~~testimg*# fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -pool=rpool_hdd -runtime=60 -rbdname=testimg*: Then repeat for rw=read/randread. ~~All of~~ *: The idea is to test a) the ~~above~~ best possible latency b) linear bandwidth c) random access iops.*: Reading from an empty RBD image is very fast :) so pre-fill it before testing.*: Run tests ~~valid for raw drives can be repeated for~~ from node(s) where your actual RBD ~~and they mean the~~ users will reside. The results are usually slightly better when you run tests from a separate physical server.* The same ~~things~~from inside a VM or through the kernel RBD driver (krbd):*# fio -ioengine=libaio -direct=1 -name=test -bs=4M -iodepth=16 -rw=write -runtime=60 -filename=/dev/rbdX*# fio -ioengine=libaio -direct=1 -sync=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/rbdX*# fio -ioengine=libaio -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -runtime=60 -filename=/dev/rbdX*: Don't miss the added -sync=1 option. ~~Sync~~It is added on purpose, ~~direct and invalidate flags can be omitted, because RBD~~ to match the ioengine=rbd test. ioengine=rbd has no concept of ~~«sync» — all operations are~~ sync, everything is always «sync»with it. ~~And there’s no page cache involved either, so «direct» also doesn’t mean anything~~Overall this write pattern — transactional single-threaded write — corresponds to a DBMS.* ~~The second recommended tool, especially useful for hunting performance problems, comes in~~ : Note that regardless of the ~~several improved varieties~~ supposed overhead of ~~«Mark’s bench» from russian Ceph chat~~moving data in and out the kernel, the kernel client is actually faster.* ceph-gobench*: Or https://github.com/~~rumanzo~~vitalif/ceph-~~gobench or~~ bench. The original idea comes from the «Mark’s bench» from russian Ceph chat ([https://github.com/~~vitalif~~socketpair/ceph-benchoriginal outdated tool was here]). Both use a non-replicated Ceph pool (size=1), create several 4MB objects (16 by default) in each separate OSD and do random single-thread 4kb writes in randomly selected objects within one OSD. This mimics random writes to RBD and allows to determine the problematic OSDs by benchmarking them separately. ~~Original Mark’s bench (outdated) was here: https://github.com/socketpair/ceph-bench~~

*: To create the non-replicated benchmark pool use {{Cmd|ceph osd pool create bench 128 replicated; ceph osd pool set bench size 1; ceph osd pool set bench min_size 1}}. Just note that 128 (PG count) should be enough for all OSDs to get at least one PG each.

* ~~Do not~~ S3 (rgw):** [https://github.com/intel-cloud/cosbench cosbench]** [https://github.com/markhpc/hsbench hsbench]** [https://github.com/minio/warp minio warp] Notes:* Never use dd to test disk performance.* Don't use `rados bench`. It creates a small number of objects (1-2 for a thread) so all of them always reside in cache and improve the results far beyond they should be.* You can ~~also~~ use ~~the simple~~ `rbd bench`, but fio ~~-ioengine~~is better. =~~libaio` with a kernel~~== Test your network === ping -~~mounted RBD~~f (flood ping). ~~However, that requires to disable some features of that RBD, because kernel client still lacks their support~~ sockperf. ~~Note that regardless of~~ On the ~~overhead of moving data in and out the kernel~~first node, run <tt>sockperf sr -i IP --tcp</tt>. On the ~~kernel client~~ second, run <tt>sockperf pp -i SERVER_IP --tcp -m 4096</tt>. Decent average number is ~~actually faster~~around 0.05-0.07ms.* And you can also use it from inside your VMs, the results are usually similar to the above<s>qperf. ~~Just note that the result also depends on~~ On the ~~storage driver being used~~first node, just run <tt>qperf</tt>. ~~Virtio is~~ On the ~~fastest~~second, ~~virtio~~<tt>qperf -~~scsi~~ vvs SERVER_IP tcp_lat -m 4096</tt>.</s> Don’t use qperf. It is ~~slightly slower and everything else~~ super-stupid: it doesn’t disable Nagle (~~like LSI emulation~~no TCP_NODELAY) and it doesn’t honor the <tt>-m 4096</tt> parameter — message size is ~~terribly slow~~always set to 1 BYTE in latency tests. ~~Results are also considerably affected~~ [[File:Warning icon.svg|32px|link=]] Warning: Ubuntu has AppArmor enabled by ~~whether the RBD cache~~ default and it affects network latency adversely. Disable it if you want good performance. The effect of AppArmor is ~~enabled or not~~ like the following (~~RBD cache turns on automatically with cache~~Intel X520-DA2): * centos 3.10: rtt min/avg/max/mdev =~~writeback~~0.039/~~none)~~0. ~~For random reads or writes, disabling RBD cache is faster~~053/0.132/0.012 ms* ubuntu 4.x + apparmor: rtt min/avg/max/mdev = 0.068/0.163/0.230/0.029 ms* ubuntu 4.x: rtt min/avg/max/mdev = 0.037/0.071/0.157/0.018 ms

== Why is it so slow ==

# Ceph isn’t slow for linear reads and writes.

# Ceph isn’t slow on HDDs: theoretical single-thread random write performance of Bluestore is 66 % (2/3) of your drive’s IOPS (currently it’s 33 % in practice, but if you push this handbrake down: https://github.com/ceph/ceph/pull/26909 it goes back to 66 %), and multi-threaded read/write performance is ~~about~~ almost 100 % of the raw drive speed.

'''However''', the naive expectation is that as you replace your HDDs to with SSDs and use a fast network — Ceph should become almost as faster. Everyone is used to the idea that I/O is slow and software is fast. And this is generally NOT true with Ceph.

Ceph is a Software-Defined Storage system, and its «software» is a significant overhead. The general rule currently is: with Ceph it’s hard to achieve random read latencies ~~less than~~ below 0.5ms and random write latencies ~~less than~~ below 1ms, '''no matter what drives or network you use'''. ~~This~~ With one thread, this stands for only 2000 ~~iops~~ random read iops and 1000 ~~iops~~ random write ~~with one thread~~iops, and even if you manage to achieve this result you’re already in a good shape. With ~~BIS~~ best-in-slot hardware and some tuning you may be able to improve it further, but only twice or so.

But does latency matter? Yes, it does, when it comes to single-threaded (synchronous) random reads or writes. Basically, ~~everything~~ all software that wants the data to be durable does fsync()'s calls which ~~serializes~~ serialize writes. For example, all ~~DBMS’s~~ DBMSs do. So to understand the performance limit of these apps you should benchmark your cluster with iodepth=1.

The latency doesn’t scale with the number of servers or OSDs-per-SSD or ~~with~~ two-RBD-in-RAID0. When you’re benchmarking your cluster with iodepth=1 you’re benchmarking only ONE placement group at a time (PG is a triplet or a pair of OSDs). The result is only affected by how fast 1 a single OSD ~~is responding to 1~~ processes a single request. In fact, with iodepth=1 IOPS=1/latency. There is Nick Fisk’s presentation titled «Low-latency Ceph». By «low-latency» he means 0.7ms, which is only ~1500 iops.

Another issue is that '''all writes in Ceph are transactional''', even ones that aren’t specifically requested to be. It means that write operations do not complete until they are written into all OSD journals and fsync()'ed to disks. This is to prevent [[#RAID WRITE HOLE]]-like situations.=== Expected performance ===

~~To make it more clear this means that Ceph~~ Estimating the cluster performance based on disks'~~''does not use any drive write buffers'''. It does quite the opposite — it clears all buffers after each write. It doesn’t mean that there’s no write buffering at all — there~~ performance is ~~some on the client side (RBD cache, Linux page cache inside VMs). But internal disk write buffers aren’t used~~absolutely wrong.

~~This makes typical desktop SSDs perform absolutely terrible~~ The real expected performance for ~~Ceph journal in terms of write IOPS. The numbers you can expect are something between 100 and 1000~~ Bluestore is like the following (~~or 500—2000)~~ iops~~, while you’d probably like~~ applies to ~~see at least 10000 (even Chinese noname SSD can do~~random 4KB reads/writes).:

~~So your disks should also be benchmarked~~ 1 HDD (usual SATA, 7200 rpm, non SMR, without SSD cache) is:* ~100-120 iops with QD=128* ~66 iops with QD=1* ~40 MB/s with linear read/write* The numbers will be worse if you're short on available RAM, because you''ll get a lot of metadata cache misses 1 fast SSD or NVMe SSD with capacitors (see below) and write iops >= 25000:* ~1000 write iops with QD=1. May vary between 300 and, in the best possible case, ~2500 iops depending on CPU frequency and settings.* Up to ~10000-~~iodepth~~20000 write iops with QD=128 per 1 OSD.* Read iops are around 2-~~fsync~~2.5 times better: QD=1~2000 iops (up to ~4000), QD=128 ~20000 (up to ~50000 depending on the CPU).* Of course, the QD=128 iops number is limited by the performance of the disk itself :). However, as good SSDs usually perform great in parallel mode, they're usually not a bottleneck.* By running multiple OSDs on a single drive, you can multiply your parallel (QD=128) iops number by the number of OSDs, as long as the drive allows it. Of course, you get the same increase in CPU load. HUGE increase.* Linear reads and writes are almost as fast as raw disk reads and writes.* Difference between SATA SSDs and NVMes in terms of random I/O in Ceph is negligible as long as they both have capacitors. Of course, server NVMes are still the best and you should try to get them instead of SATA and SAS, but it's hard to notice the difference with Ceph and random I/O.* Modern SSDs often have slower QD=1 random reads than writes, just because they write into a fast capacitor-protected cache, but they can' t serve all random reads from it. The difference is usually like 8000 QD=1 read iops compared to 40000 QD=1 write iops. Aggregate performance:* Linear read from the cluster = OSD number * MB/s of one OSD* Linear write to a replicated pool = OSD number / Replica number * MB/s of one OSD* Linear write to a EC pool = OSD number / (or K+M) * K * MB/s of one OSD* Random QD=1 performance is the average for all OSDs (treat it like latency); iops with QD=128 is the sum* Random IOPS are limited by the client, too. 1 RBD client can squeeze out up to ~30000 read iops and up to ~15000 write iops* Linear I/O is of course limited by the network bandwidth, too === Micron setup example === Here’s an example setup from Micron. They used 2x replication, very costly CPUs (2x 28-core Xeon Platinum per server), very fast network (2x100G, in fact 2x2x100G — 2 cards with 2 ports each) and 10x their best NVMes in each of 4 nodes: https://www.micron.com/resource-details/30c00464-e089-479c-8469-5ecb02cfe06f They only got 350000 peak write iops with high parallelism with 100 % CPU load. It may seem a lot, but if you divide it by the number of NVMes — 350000/40 NVMe — it’s only 8750 iops per an NVMe. If we account for 2 replicas and WAL we get 8750*2*2 = 35000 iops per drive. So… Ceph only squeezed 35000 iops out of an NVMe '''~~-sync=1~~that can deliver 260000 iops alone'''~~, see [[#O_SYNC vs fsync vs hdparm~~ . That’s what Ceph’s overhead is. Also there are no single-thread latency tests in that PDF. Such tests could be very interesting. === Update === https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9300_and_red_hat_ceph_reference_architecture.pdf New NVMes are Micron 9300 of maximum possible capacity — 12.8 TB. Each of these delivers even more write iops than 9200’s: 310k instead of 260k. Everything else remains the same. The new write performance result for 100 RBD clients is 477029 iops (36 % more than in the previous test). Remember that it’s still only 4770 iops per client, though. For 10 RBD clients the result is better: 294000 iops, which stands for 29400 iops per client. What helped the performance? I guess the configuration did. In comparison to the previous test they changed the following:* disabled messenger checksums (ms_crc_data=false) and bluestore checksums (bluestore_csum_type=none)* tuned rocksdb: <tt>bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=32,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,<br />write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,<br />level0_stop_writes_trigger=256,max_bytes_for_level_base=6GB,compaction_threads=32,flusher_threads=8,compaction_readahead_size=2MB</tt> — this divides into:** The main part is probably 64x32x4 MB memtables setting (number x merge x size) instead of default 4x1x256 MB. The effect of this change isn’t really clear for me. It may slightly reduce CPU load because sorting a big memtable is slower than sorting a small one. However, 32x4 compactions aren’t probably that much faster than 1x256.** max_bytes_for_level_base is changed dramatically — it’s raised to 6 GB from 256 MB!** added compaction threads* allocated 14 GB RAM per each OSD* osd_max_pg_log_entries=osd_min_pg_log_entries=osd_pg_log_dups_tracked=osd_pg_log_trim_min = 10. However I’m not sure about this — it did nothing in my tests. Other remarks:* cephx was already disabled in the previous version of the test. This time they also disabled signatures. It seems pointless though — disabled cephx doesn’t sign anything.* they already had debug objecter = 0/0 and the rest of debug levels set to zero.* it seems they haven’t tried changing prefer_deferred_size and min_alloc_size.* new NVMes definitely didn’t change anything. 260000 iops is over the top for Ceph anyway.* in the new PDF there is a 70/30 R/W test with QD=1. It was done for 100 RBD clients, but for their cluster it was a «low-load» condition (19.38 % CPU load on the hosts). They report 0]].37ms/0.72ms random read/write latencies. In fact they report it reversed :)but let’s assume that 0.37ms is actually for reads because reads are always faster in Ceph. This again corresponds only to 2700/1388 single-thread read/write iops.

== CAPACITORS! ==

One important thing to note is that '''all writes in Ceph are transactional''', even ones that aren’t specifically requested to be. It means that write operations do not complete until they are written into all OSD journals and fsync()'ed to disks. This is to prevent [[#RAID WRITE HOLE]]-like situations.

To make it more clear this means that Ceph '''does not use any drive write buffers'''. It does quite the opposite — it clears all buffers after each write. It doesn’t mean that there’s no write buffering at all — there is some on the client side (RBD cache, Linux page cache inside VMs). But internal disk write buffers aren’t used.

This makes typical desktop SSDs perform absolutely terribly for Ceph journal in terms of write IOPS. The numbers you can expect are something between 100 and 1000 (or 500—2000) iops, while you’d probably like to see at least 10000 (even a noname Chinese SSD can do 10000 iops without fsync).

So your disks should also be benchmarked with '''-iodepth=1 -fsync=1''' (or '''-sync=1''', see [[#O_SYNC vs fsync vs hdparm -W 0]]).

The thing that will really help us build our lightning-fast Ceph cluster is an SSD with (super)capacitors, which are perfectly visible to the naked eye on M.2 SSDs:

[[File:Micron 5100 sata m2.jpg]]

Supercaps work ~~for an SSD~~ like a built-in UPS for an SSD and allow it to flush DRAM cache into the persistent (flash) memory when a power loss occurs. Thus the cache becomes «non-volatile» — and thus an SSD safely ignores fsync (FLUSH CACHE) requests, because it’s confident that ~~its~~ cache contents will always make their way to the persistent memory.

And this increases '''transactional write IOPS, making it equal to non-transactional'''.

Supercaps are usually called «enhanced/advanced power loss protection» in ~~the~~ datasheets. This is a feature almost exclusively present only on in «server-grade» SSDs (not even all of them). For example, Intel DC S4600 has supercaps and Intel DC S3100 doesn’t.

{{Note}} This is the main difference between server and desktop SSDs. ~~The~~ An average user doesn’t need transactions, but ~~the~~ servers run DBMS’es, and DBMS’es want them really, really bad.

And… Ceph also does :) you should '''only''' buy SSDs with supercaps for Ceph clusters. Even if you consider NVMe — NVMe without capacitors is WORSE than SATA with them. Desktop NVMes do 150000+ write iops without syncs, but only ~~600-1000~~ 600—1000 iops with them.

Another option is Intel Optane. Intel Optane is also an SSD, but based on the different physics — Phase-Change Memory instead of Flash memory. Specs say these drives are capable of 550000 iops without the need to erase blocks and thus no need for write cache and supercaps. But even if Optane’s latency is 0.005ms (it is), Ceph’s latency is still 0.5ms, so it’s pointless to use them with Ceph — you get the same performance for a lot more money compared to usual server SSDs/NVMes.

== Bluestore vs Filestore ==

~~Bluestore is the «new» storage layer of Ceph. All presentations and documents say it’s better in all ways, which in fact seems reasonable for something «new»~~TODO: This section lacks random read performance comparisons.

Bluestore is the «new» storage layer of Ceph. All presentations and documents say it’s better in all ways, which indeed seems reasonable for something «new». Bluestore is really 2x faster than Filestore for linear write workloads, because it has no double-writes — big blocks are written only once, not twice as in Filestore. Filestore journals everything, so all writes first go to the journal and ~~the~~ then get copied to the main device.

Bluestore is also more feature-rich: it has checksums, compression, erasure-coded overwrites and virtual clones. Checksums allow 2x-replicated pools self-heal better, erasure-coded overwrites make EC usable for RBD and CephFS, and virtual clones make VMs run faster after taking a snapshot.

In HDD-only (or bad-SSD-only) setups Bluestore ~~uses a lot more RAM, because it uses RocksDB for all metadata, additionally caches some of them by itself and~~ is also ~~tries to cache some data blocks to compensate~~ 2x faster than Filestore for ~~the lack of page cache usage~~random writes. ~~The general rule of thumb~~ This is ~~1GB~~ again because it can do 1 commit per ~~1TB of storage~~write, ~~but not less than 2GB in total~~at least if you apply this patch: https://github.com/ceph/ceph/pull/26909 and turn bluefs_preextend_wal_files on. In fact it’s OK to say that Bluestore’s deferred write implementation is really optimal for transactional writes to slow drives.

~~And~~However, ~~suprisingly~~if you switch to faster drives, ~~there is one thing that may sometimes be worse with~~ Bluestore: 's random ~~write performance~~writes don't appear to be much better than Filestore's. ~~The issue~~ This shows up differently in ~~two popular~~ HDD+SSD and All-Flash setupsboth of which are certainly very popular.

=== HDD for data + SSD for journal ===

Filestore writes everything to the journal and only starts to flush it to the data device when the journal fills up to the configured percent. This is very convenient because it makes journal act as a «temporary buffer» that absorbs random write ~~burts~~bursts.

Bluestore can’t do the same even when you put its WAL+DB on SSD. It also has sort of a «journal» which is called «deferred write queue», but it’s very small (only 64 requests) and it lacks any kind of background flush threads. So you actually can increase the maximum number of deferred requests, but after the queue fills up the performance will drop until OSD ~~restart~~restarts.

So, Bluestore’s performance is very consistent, but it’s worse than peak performance of Filestore for the same hardware. In other words, Bluestore OSD refuses to do random writes faster than the HDD can do them on average.

With Filestore you easily get 1000—2000 iops while the journal is not full. With Bluestore you only get 100—300 iops regardless of the SSD journal, but these are absolutely stable over time and never drop.

In All-Flash clusters Bluestore’s own latency is usually 30-50 % greater than Filestore’s. However, this only refers to the latency of Bluestore itself, so the absolute number for these 30-50 % is something around 0.1ms which is hard to notice in front of the total Ceph’s latency. And even though latency is greater, peak parallel throughput is usually slightly better (+5..10 %) and peak CPU usage is slightly lower (-5..10 %).

But it’s still a shame that the increase is only 5-1010 % for that amount of architectural effort.

=== ~~HDD-only (or bad-SSD-only)~~ RAM usage ===

~~In these setups~~ Another thing to note is that Bluestore uses a lot more RAM. This is ~~also 2x faster than Filestore,~~ because it ~~can do 1 commit per write~~uses RocksDB for all metadata, ~~at least if you apply this patch: https://github.com/ceph/ceph/pull/26909~~ additionally caches some of them by itself and ~~turn bluefs_preextend_wal_files on. In fact it’s OK~~ also tries to ~~say that Bluestore’s deferred write implementation is really optimal~~ cache some data blocks to compensate for ~~transactional writes on HDDs~~the lack of page cache usage. The general rule of thumb is 1GB RAM per 1TB of storage, but not less than 2GB per an OSD.

~~TODO~~=== About block.db sizing === Who's tired of spillovers? Everyone's tired of spillovers! Spillover is when you use Bluestore in an SSD+HDD configuration putting Bluestore’s database (block.db) on the SSD partition, but it constantly spills over to the HDD. This often happens despite that SSD partition is much larger than the actual DB. Spillovers show up with a warning in `ceph -s` starting with Ceph 14 Nautilus. There’s also an attempt to fix them with additional RocksDB «allocation hints» in Ceph 15 Octopus, however, generally the situation is still the same as before. Official documents say that you should allocate 4 % of the slow device space for block.db (Bluestore’s metadata partition). This is a lot, Bluestore rarely needs that amount of space. But the main problem is that Bluestore uses RocksDB and RocksDB puts a file on the fast device only if it thinks that the whole layer will fit there (RocksDB is organized in files). Default RocksDB settings in Ceph are: * 1 GB WAL = 4x256 Mb* max_bytes_for_level_base and max_bytes_for_level_multiplier are default, thus 256 Mb and 10, respectively* so L1 = 256 Mb* L2 = 2560 Mb* L3 = 25600 Mb …so… RocksDB puts L2 files to block.db only if it’s at least 2560+256+1024 Mb (almost 4 GB). And it puts L3 there only if it’s at least 25600+2560+256+1024 Mb (almost 30 GB). And L4 — only if it’s at least 256000+25600+2560+256+1024 Mb (roughly 286 GB). In other words, all block.db sizes except 4, 30, 286 GB are pointless, Bluestore ~~section lacks~~ won’t use anything above the previous «round» size. At least if you don’t change RocksDB settings. And of these 4 is too small, 286 is too big. So just stick with 30 GB for all Bluestore OSDs :) == Controllers == * SATA is OK, you don’t need SAS at all. SATA is simple and definitely faster than old RAID controllers.* Don’t use RAID unless you’re absolutely sure you need it. All drives should be connected using the pass-through (HBA) mode.* If your RAID controller can’t do passthrough mode, reflash it. If you can’t reflash it, throw it away and buy an HBA («RAID without RAID»), for example, LSI 9300-8i.* If you still can’t throw it away — disable all caches and pray :) the problem is that RAID controllers sometimes ignore fsync requests so Ceph can become corrupted on a sudden power loss. Even some HBAs may do that (namely some Adaptecs).* In theory, you may try to leverage the battery- (or supercap-) backed controller cache in RAID0 mode to improve write latency. However, that can easily become shooting yourself. At least conduct some power-unplug tests if you do that.* A good post in the mailing list about RAID0: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2019-July/036237.html (TLDR: never use it!)* IOPS difference between RAID and HBA/SATA may be very noticeable. A bad or old RAID controller can easily become a bottleneck.* HBAs also have IOPS limits. For example, it’s ~280000 iops for all disks for the LSI 9211-8i.* Always turn blk-mq on for SAS and NVMe drives — or just use recent kernel versions, blk-mq is on by default since 4.18 or so. For SATA it’s not required. == CPUs == * CPU is the main bottleneck for Ceph running on good SSDs.* As Nick Fisk said in his presentation — Ceph is a Software-Defined Storage '''and every piece of Ceph «Software»''' will run faster with every GHz of CPU clock speed.* Server CPUs often have NUMA (Non-Uniform Memory Access). Which means that some CPU cores don’t have direct access to the RAM and/or to the part of the hardware. They must forward requests to other cores in order to access this hardware.* You should try to use CPUs with more clock speed and without NUMA to maximize the performance. That is, a slower CPU with more cores is probably worse than a faster CPU with a smaller number of cores…* …but within reason as one Bluestore OSD can eat up to ~6 cores under full load.* «Clock speed» means nominal, not Turbo Boost speed, because Turbo Boost is only beneficial for single-threaded workloads.* CPU pinning recommendations (taskset) are almost outdated, because Ceph OSDs are multi-threaded. At least 4 threads are active during write, so you’ll only slow your OSDs down if you allocate less than 4-6 cores for each of them.* There are 2 parameters responsible for OSD worker thread count — osd_op_num_shards and osd_op_num_threads_per_shard…* …But trying to tune them is pointless, default configuration (1x5 for HDDs and 2x8 for SSDs) is optimal. The problem is that all worker threads still serialize writes into a single kv_sync_thread, and the whole scheme only scales up to ~6 worker threads.* There is one thing that decreases latency 2-3 times at once. It’s disabling all power-save functions of CPUs:** <tt>cpupower idle-set -D 0</tt> — this disables C-States (or you can pass <tt>processor.max_cstate=1 intel_idle.max_cstate=0</tt> to the kernel command-line)** <tt>cpupower frequency-set -g performance</tt> or (for older versions) <tt>for i in $(seq 0 $((`nproc`-1))); do cpufreq-set -c $i -g performance; done</tt> — this disables frequency scaling.* When power-save is disabled CPU heats up as a GTX, but you get 2-3 times more iops.* High CPU requirement is one of the arguments to NOT use Ceph in a «hyperconverged setup», the setup in which storage and compute nodes are combined.* You can also disable all hardware vulnerability mitigations: <tt>noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier</tt> (or just <tt>mitigations=off</tt> for newer kernels) == VM setup and filesystem options == * Default qemu options for RBD are bad.* Bad means that a) it uses the slow emulated LSI controller b) it uses the mode with cached reads, but without cached writes.* Drive cache in qemu is controlled by the `cache` option (surprise). It can be <missing>, writethrough, writeback, none, unsafe, directsync. With RBD this option also affects rbd cache, which is the cache on the Ceph’s client library (librbd) side.* But cache=unsafe doesn’t work with RBD, it still waits for write confirmations. And writethrough, <missing> and directsync are basically equivalent.* RBD cache helps a lot on HDDs, but it slows everything down in all-flash clusters. Something is implemented with locks, something is single-threaded, somebody tries to optimize it all, but the work isn’t done yet.* There are the following drive emulation options: lsi (slowest), virtio-scsi (fast), virtio (fastest, but can’t do TRIM until QEMU 4.0). virtio-scsi can use multiple queues and thus should be the fastest with fast underlying storage (with a local NVMe?) — but it seems it doesn’t matter with Ceph.* The filesystem also slows things down! Specifically it updates inode mtime on each small write if you don’t have lazytime enabled. mtime is part of the metadata, so this change is journaled, which makes the <tt>fio -sync=1 -iodepth=1 -direct=1</tt> test result 3-4 times worse when you run it over a file in FS.* If you’re so unlucky that you run Oracle in your Ceph VMs then it’s crucial to set FILESYSTEMIO_OPTIONS=SETALL. I/O will be terribly slow if you don’t set it. So…* For HDD / SSD+HDD clusters it is recommended to use qemu cache=writeback. This mode is safe, because guest fsyncs make qemu flush RBD cache. That is, guests don’t lose the journaled data.* For SSD-only clusters it’s best to disable the cache at all (cache=none). It usually increases maximum parallel iops 2 times or so.* The best emulation driver is virtio. Now go find the way to set it in your VM GUI (Proxmox, Opennebula) :). Opennebula, for example, has quite a perverted way of changing the emulation driver.* Try to use lazytime everywhere. It requires a decently recent kernel: at least 4.0 for ext4 and at least 4.17 for XFS. For XFS, it also seems that a recent version of util-linux is required. == Network, DPDK and SPDK == * Fast network mostly matters for linear read/write and rebalancing. Yes, you need 10G or more, but usual Ethernet latencies of 0.05ms-0.1ms are totally enough for Ceph. Improving them further won’t improve your random read /write performance ~~comparisons~~.Jumbo frames (mtu=9000) also only matter for linear read/write.* DPDK = Data Plane Developer Kit, fast Intel library for working with network and RDMA (Infiniband) devices in userspace, without kernel context-switches* SPDK = Storage Performance Developer Kit, additional Intel library for working with NVMe SSDs in userspace, also very fast. There is also libnvme — a fork of SPDK with removed DPDK dependency.* There is DPDK and SPDK support in Ceph:** DPDK is enabled with ms_type=async+dpdk** SPDK is enabled for NVMes by passing <tt>spdk:PCI_serial_number</tt> as the device name and deploying OSDs using the Manual Deployment documentation** But…* DPDK support is broken — build scripts are broken and even if you fix them by hand like me there are some bugs that make OSDs crash after processing ~50 packets.* SPDK build scripts are OK and Ceph is even built with it by default. There are even some reports that it works, however, my OSDs have just hung when I tried to start them with SPDK.* Both are pointless to use because Ceph itself isn’t that fast. It doesn’t matter if your network latency is 0.05ms or 0.005ms — Ceph software takes 0.5-1ms. There was an experiment report in the mailing list — one guy tried to isolate AsyncMessenger from all other Ceph code and benchmark it alone — https://www.spinics.net/lists/ceph-devel/msg43555.html — and he only got ~80000 iops.* SPDK is unneeded in the long term even for NVMes, because Linux 5.1 finally has a proper asynchronous I/O implementation called '''io_uring''': https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/ — it gives you almost the same latency as SPDK with a lot less complexity. Also it finally works with the buffered I/O. == Drive cache is slowing you down == === O_SYNC vs fsync vs hdparm -W 0 === SATA and SCSI drives have two ways of flushing cache: FLUSH CACHE command and FUA (Force Unit Access) flag for writes. The first is an explicit flush, the second is an instruction to write the data directly to media. To be more precise, there is FUA flag in SCSI/SAS, but the situation is unclear with SATA: it’s there in the NCQ spec, but in practice most drives don’t support it. It seems that fsync() sends the FLUSH CACHE command and opening a file with O_SYNC sets the FUA bit. Does it make any difference? Usually no. But sometimes, depending on the exact controller and/or its configuration, there may be a difference. In this case '''fio -sync=1''' and '''fio -fsync=1''' start to give different results. In some cases drives just ignore one of the flush methods. In addition to that, SATA/SAS drives also have a cache disable command. When you disable the cache Linux stops sending flushes at all. It may seem that this should also result in the same performance as fsync/O_SYNC, but that’s not the case either! SSDs with supercaps give '''much''' better performance with disabled cache. For example, Seagate Nytro 1351 gives you 288 iops with cache and 18000 iops without it (!). Why? It seems that’s because FLUSH CACHE is interpreted by the drive as a «please flush all caches, including non-volatile cache» command, and «disable cache» is interpreted as «please disable the volatile cache, but you may leave the non-volatile one on if you want to». This makes writes with a flush after every write slower than writes with the cache disabled. What about NVMe? NVMe has slightly less variability — there is no «disable cache» command in the NVMe spec at all, but just as in the SATA spec there is the FLUSH CACHE command and FUA bit. But again, based on the personal experience I can say that it seems that FUA is often ignored with NVMe either by Linux or by the drive itself, thus '''fio -sync=1''' gives the same results as '''fio -direct=1''' without any sync flags. '''-fsync=1''' performs correctly and lands the performance down to where it must belong (1000—2000 iops for desktop NVMes). P.S: Bluestore uses fsync. Filestore uses O_SYNC. === Server SSDs === Disabling cache is not a joke! <tt>fio -ioengine=libaio -name=test -filename=/dev/sdb -(sync|fsync)=1 -direct=1 -bs=(4k|4M) -iodepth=(1|32|128) -rw=(write|randwrite)</tt> '''Micron 5100 Eco 960GB''' <div style="float: left"><div>'''Write'''</div><tab sep="bar" head="top" class="wikitable">sync or fsync | bs | iodepth | rw | hdparm -W 1 | hdparm -W 0sync | 4k | 1 | write | 612 iops | 22200 iopssync | 4k | 1 | randwrite | 612 iops | 22200 iopssync | 4k | 32 | randwrite | 6430 iops | 59100 iopssync | 4k | 128 | randwrite | 6503 iops | 59100 iopssync | 4M | 32 | write | 469 MB/s | 485 MB/sfsync | 4k | 1 | write | 659 iops | 25100 iopsfsync | 4k | 1 | randwrite | 671 iops | 25100 iopsfsync | 4k | 32 | randwrite | 695 iops | 59100 iopsfsync | 4k | 128 | randwrite | 701 iops | 59100 iopsfsync | 4M | 32 | write | 384 MB/s | 486 MB/s</tab></div><div style="float: left; margin-left: 10px"><div>'''Read'''</div><tab sep="bar" head="top" class="wikitable">bs | iodepth | rw | result4k | 1 | randread | 6000 iops4k | 4 | randread | 15900 iops4k | 8 | randread | 18500 iops4k | 16 | randread | 24800 iops4k | 32 | randread | 37400 iops4M | 1 | read | 460 MB/s4M | 16 | read | 514 MB/s</tab></div><div style="clear:both"></div> Reads don’t differ for hdparm -W 0 and 1. '''Seagate Nytro 1351 XA3840LE10063''' Disk was filled 90-100 % before the test. <div style="float: left"><div>'''Write'''</div><tab sep="bar" head="top" class="wikitable">sync or fsync | bs | iodepth | rw | hdparm -W 1 | hdparm -W 0sync | 4k | 1 | randwrite | 18700 iops | 18700 iopssync | 4k | 4 | randwrite | 49400 iops | 54700 iopssync | 4k | 32 | randwrite | 51800 iops | 65700 iopssync | 4M | 32 | write | 516 MB/s | 516 MB/sfsync=1 | 4k | 1 | randwrite | {{red|288 iops}} | 18100 iopsfsync=1 | 4k | 4 | randwrite | {{red|288 iops}} | 52800 iopsfsync=4 | 4k | 4 | randwrite | 1124 iops | 53500 iopsfsync=1 | 4k | 32 | randwrite | {{red|288 iops}} | 65700 iopsfsync=32 | 4k | 32 | randwrite | 7802 iops | 65700 iopsfsync=1 | 4M | 32 | write | 336 MB/s | 516 MB/s</tab></div><div style="float: left; margin-left: 10px"><div>'''Read'''</div><tab sep="bar" head="top" class="wikitable">bs | iodepth | rw | result4k | 1 | randread | 8600 iops4k | 4 | randread | 21900 iops4k | 8 | randread | 30500 iops4k | 16 | randread | 39200 iops4k | 32 | randread | 50000 iops4M | 1 | read | 508 MB/s4M | 16 | read | 536 MB/s</tab></div><div style="clear:both"></div> '''Disable the cache if you want more than 288 iops.'''

== RAID WRITE HOLE ==

Even if we’re talking about RAID, the thing that is much simpler than distributed software-defined storage like Ceph, we’re still talking about a distributed storage system — every system that has multiple physical drives is distributed, because each drive behaves and commits the data (or doesn’t commit it) independently of others.

Write Hole is the name for several situations in RAID arrays when drives go out of sync. Suppose you have a simple RAID1 array of two disks. You write a sector. You send a write command to both ~~drive~~drives. And then a power failure occurs before commands finish. Now, after the system boots again, you don’t know if your replicas contain same data, because you don’t know which drive had succeeded ~~to write~~ in writing it and which ~~didn’t~~hadn’t.

You say OK, I don’t care. I’ll just read from both drives and if I encounter different data I’ll just pick one of the copies, and I’ll either get the old data or the new.

But then imagine that you have RAID 5. Now you have three drives: two for data and one for parity. Now suppose that you overwrite a sector again. Before ~~writing~~ the write, your disks contain: (A1), (B1) and (A1 XOR B1). You want to overwrite (B1) with (B2). To do so you write (B2) to the second disk and (A1 XOR B2) to the third. A power failure occurs again… ~~And then, on~~ At the ~~next boot, you also find out that~~ same time disk 1 (one that you didn’t write anything to) ~~is dead~~fails. You might think that you can still reconstruct your data because you have RAID 5 and 2 disks out of 3 are still alive.

But imagine that disk 2 succeeded to write new data~~, while~~ and disk 3 failed(or vice versa). Now you have: (lost disk), (B2) and (A1 XOR B1). If you try to reconstruct A from these copies you’ll get (A1 XOR B1 XOR B2) which is obviously not equal to A1. Bang! Your RAID5 ~~has~~ corrupted the data that you ~~didn’t~~ haven’t even ~~write~~ been writing at the time of the power loss.

Because of this problem, Linux `mdadm` refuses ~~at all~~ to start an incomplete array after unclean shutdownat all. There’s no solution to this problem except full data journaling at the level of each disk drive. And this is… exactly what Ceph does! So, Ceph is actually safer than RAID. Slower — but safer :)

== Quick insight into SSD and flash memory organization ==

~~Although~~ The distinctive feature of NAND flash memory ~~allows fast random writes~~ is that you can write it in small blocks ~~(usually 512 to 4096 bytes)~~, ~~its distinctive feature is that every~~ but erase only big block groups at once, and you must ~~be erased~~ erase any block before ~~being written to~~overwriting it. ~~But~~ Write unit is called «page», erase unit is called «block». Actual NAND chips have 16 KB pages and 16-24 MB blocks (1024 pages for Micron MLC and 1536 pages for Micron TLC). This is probably because erasing is slow compared to ~~reading and~~ writing, ~~so manufacturers design memory chips so that they always erase~~ but it can be done for a ~~large group~~ lot of blocks at once~~, as this takes almost the same time as~~ (common sense suggests that erasing ~~one block could take. This group of blocks called «erase unit»~~ is ~~typically 2-4 megabytes in size~~~1000 times slower than writing). Another distinctive feature is that the total number of erase/program cycles is physically limited — after several thousands cycles (a usual number for MLC memory) the block becomes faulty and stops accepting new writes or even loses the data previously written to it. Denser and cheaper (MLC/TLC/QLC, 2/3/4 bits per cell) memory chips have smaller erase limits, while sparser and more expensive ones (SLC, 1 bit per cell) have bigger limits (up to 100000 rewrites). However, all limits are still finite, so stupidly overwriting the same block would be very slow and would break SSD very rapidly.

But that’s not the case with modern SSDs — even cheap models are very fast and usually very durable. But why? The credit goes to SSD controllers: SSDs contain very smart and powerful controllers, usually with at least 4 cores and 1-2 GHz clock frequency, which means they’re as powerful as mobile phones' processors. All that power is required to make FTL firmware run smoothly. FTL stands for «Flash Translation Layer» and it is the firmware responsible for translating addresses of small blocks into physical addresses on flash memory chips. Every write request is always put into a space freed in advance, and FTL just remembers the new physical location of the data. This makes writes very fast. FTL also defragments free space and moves blocks around to achieve uniform wear across all memory cells. This feature is called Wear Leveling. SSDs also usually have some extra physical space reserved to add even more endurance and to make wear leveling easier; this is called overprovisioning. Pricier server SSDs have a lot of space overprovisioned, for example, Micron 5100 Max has 37,5 % of physical memory reserved (extra 60 % is added to the user-visible capacity ~~inside~~).

And this is also the FTL which makes power loss protection a problem. Mapping tables are metadata which must also be forced into non-volatile memory when you flush the cache, and it’s what makes desktop SSDs slow with fsync… In fact, as I wrote it I thought that they could use RocksDB or similar LSM-tree based system to store mapping tables and that could make fsyncs fast even without the capacitors. It would lead to some waste of journal space and some extra write amplification (as every journal block would only contain 1 write), but still it would make writes fast. So… either they don’t know about LSM trees or the FTL metadata is not the only problem for fsync.

When I tried to lecture someone in the mailing list about «all SSDs doing fsyncs correctly» I got this as the reply: https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf. Long story short, it says that in 2013 a common scenario was SSDs not syncing metadata on fsync calls at all which led to all kinds of funny things on a power loss, up to (!!!) total failures of some SSDs.

There also exist some very old SSDs without capacitors (OCZ Vector/Vertex) which are capable of very large sync iops numbers. How do they work? Nobody knows, but I suspect that they just don’t do safe writes :). The core principle of flash memory overwrites ~~didn’t change~~ hasn’t changed in ~~the last~~ recent years~~, and~~ . 5 years ago SSDs were also FTL-based ~~on FTLs~~ , just as ~~they do~~ now.

So it seems there are two kinds of «power loss protection»: simple PLP means «we do fsyncs and don’t die or lose your data when a power loss occurs», and advanced PLP means that fsync’ed writes are just as fast as non-fsynced. It also seems that in the current years (2018—2019) simple PLP is already a standard and most SSDs don’t lose data on power failure.

Why are USB flash drives so slow then? In terms of small random writes they usually only deliver 2-3 operations per second, while being powered by similar flash memory chips — maybe slightly cheaper and worse ones, but obviously not 1000 times worse.

The answer also lies in the FTL. Thumb drives also have FTL and they even have some Wear Leveling, but it’s very small and dumb compared to SSD FTLs. It has a slow CPU and only a little memory. Thus it doesn’t have place to store a full mapping table for small blocks and thus it translates the positions of big blocks (1-2 megabytes or even bigger) instead. Writes are buffered and then flushed one block at a time; there is a small limit on the number of blocks that can be buffered at once. The limit is usually only between 3 and 6 blocks.

This limit is always sufficient to copy big files to a flash drive formatted in any of common filesystems. One opened block receives metadata and another receives data, then it just moves on. But if you start doing random writes you stop hitting the opened blocks and this is where lags come in.

== Bonus: Micron vSAN reference architecture ==

[https://media-www.micron.com/-/media/client/global/documents/products/other-documents/micron_vsan_6,-d-,7_on_x86_smc_reference_architecture.pdf Micron Accelerated All-Flash SATA vSAN 6.7 Solution]

Node configuration:

* 384 GB RAM 2667 MHz

* 2X Micron 5100 MAX 960 GB (randread: 93k iops, randwrite: 74k iops)

* 8X Micron 5200 ECO 3.84TB (randread: 95k iops, randwrite: 17k iops)

* 2x Xeon Gold 6142 (16c 2.6GHz)

* Mellanox ConnectX-4 Lx

* Connected to 2x Mellanox SN2410 25GbE switches

«Aligns with VMWare AF-6, aims up to 50K read iops per node»

* 2 replicas (like Ceph size=2)

* 4 nodes

* 4 VMs on each node

* 8 vmdk per VM

* 4 threads per vmdk

Total I/O parallelism: 512

100%/70%/50%/30%/0% write

* «Baseline» (fits in cache): 121k/178k/249k/314k/486k iops

* «Capacity» (doesn’t): 51k/66k/90k/134k/363k

* Latency is 1000*512/IOPS ms in all tests (1000ms * parallelism / iops)

* '''No latency tests with low parallelism'''

* '''No linear read/write tests'''

Conclusion:

* ~3800 write iops per drive

* ~11343 read iops per drive

* ~1600 write iops per drive when not in cache

* Parallel workload doesn’t look better than Ceph. vSAN is hyperconverged, though.

== Good SSD models ==

* Micron 5100/5200, 9300. Maybe 5300, 7300 too

* Seagate Nytro 1351/1551

* HGST SN260

* Intel P4500

https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc

== Conclusion ==

Quick guide for optimizing Ceph for random reads/writes:

* Only use SSDs and NVMe with supercaps. A hint: 99 % of desktop SSDs/NVMe don’t have supercaps.

* Disable their cache with hdparm -W 0.

* Disable powersave: governor=performance, cpupower idle-set -D 0

* Disable signatures:

*: <tt>cephx_require_signatures = false</tt>

*: <tt>cephx_cluster_require_signatures = false</tt>

*: <tt>cephx_sign_messages = false</tt>

*: (and use <tt>-o nocephx_require_signatures,nocephx_sign_messages</tt> for rbd map and cephfs kernel mounts)

* For good SSDs and NVMes: set min_alloc_size=4096, prefer_deferred_size_ssd=0 (BEFORE deploying OSDs)

* At least until Nautilus: <tt>[global] debug objecter = 0/0</tt> (there is a big client-side slowdown)

* Try to disable rbd cache in the userspace driver (QEMU options cache=none)

* <s>For HDD-only or Bad-SSD-Only and at least until it’s backported (it is) — remove the handbrake https://github.com/ceph/ceph/pull/26909</s>

← Предыдущая правка

VitaliyFilippov

Бюрократ, администратор

13 531

правка

Изменения

Ceph performance

Навигация

Персональные инструменты

Пространства имён

Варианты

Просмотры

Ещё

Поиск

Навигация

разделы

Инструменты