Изменения

Ceph performance

1975 байтов добавлено, 16:03, 24 июля 2019
м
Нет описания правки
The latency doesn’t scale with the number of servers or OSDs-per-SSD or with two-RBD-in-RAID0. When you’re benchmarking your cluster with iodepth=1 you’re benchmarking only ONE placement group at a time (PG is a triplet or a pair of OSDs). The result is only affected by how fast 1 OSD is responding to 1 request. In fact, with iodepth=1 IOPS=1/latency. There is Nick Fisk’s presentation titled «Low-latency Ceph». By «low-latency» he means 0.7ms, which is only ~1500 iops.
Another issue === Micron setup example === Here’s an example setup from Micron. They used 2x replication, very costly CPUs (2x Xeon Gold per server), very fast network (100G) and 10x best NVMe they had in each of the 4 nodes: https://www.micron.com/resource-details/30c00464-e089-479c-8469-5ecb02cfe06f They only got 350000 peak write iops with high parallelism with 100 % CPU load. It may seem a lot, but if you divide it by the number of NVMe — 350000/40 NVMe — it’s only 8750 iops per an NVMe. If we account for 2 replicas and WAL we get 8750*2*2 = 35000 iops per drive. So… Ceph only squeezed 35000 iops from an NVMe '''that can deliver 260000 iops alone'''. That’s what Ceph overhead is. Also there are no single-thread latency tests in that PDF. It could be very interesting. == CAPACITORS! == One important thing to note is that '''all writes in Ceph are transactional''', even ones that aren’t specifically requested to be. It means that write operations do not complete until they are written into all OSD journals and fsync()'ed to disks. This is to prevent [[#RAID WRITE HOLE]]-like situations.
To make it more clear this means that Ceph '''does not use any drive write buffers'''. It does quite the opposite — it clears all buffers after each write. It doesn’t mean that there’s no write buffering at all — there is some on the client side (RBD cache, Linux page cache inside VMs). But internal disk write buffers aren’t used.
So your disks should also be benchmarked with '''-iodepth=1 -fsync=1''' (or '''-sync=1''', see [[#O_SYNC vs fsync vs hdparm -W 0]]).
 
== CAPACITORS! ==
The thing that will really help us build our lightning-fast Ceph cluster is an SSD with (super)capacitors, which are perfectly visible to the naked eye on M.2 SSDs:
…so…
RocksDB puts L2 files to block.db only if it’s at least 2560+256+1024 Mb (almost 4 GB). And it will put L3 there only if it’s at least 25600+2560+256+1024 Mb (almost 30 GB). And L4 - L4 — only if it’s at least 256000+25600+2560+256+1024 Mb (roughly 286 GB).
In other words, all block.db sizes except 4, 30, 286 GB are pointless, Bluestore won’t use everything above the previous «round» size. At least if you don't don’t change RocksDB settings. And of these 4 is too small, 286 is too big.
So just stick with 30 GB for all Bluestore OSDs :)
=== O_SYNC vs fsync vs hdparm -W 0 ===
SATA and SCSI drives have two ways of flushing cache: FLUSH CACHE command and FUA (Force Unit Access) flag for writes. The first is an explicit flush, the second is an instruction to write the data directly to media. To be more precise, there is FUA flag in SCSI/SAS, but the situation is unclear with SATA: it's it’s there in the NCQ spec, but in practice most drives don't don’t support it.
It seems that fsync() sends the FLUSH CACHE command and opening a file with O_SYNC sets the FUA bit.
Does it make any difference? Usually no. But sometimes, depending on the exact controller and/or its configuration, there may be a difference. In this case '''fio -sync=1''' and '''fio -fsync=1''' start to give different results. In some cases drives just ignore one of the flush methods.
In addition to that, SATA/SAS drives also have a cache disable command. When you disable the cache Linux stops sending flushes at all. It may seem that this should also result in the same performance as fsync/O_SYNC, but that's that’s not the case either! SSDs with supercaps give '''much''' better performance with disabled cache. For example, Seagate Nytro 1351 gives you 288 iops with cache and 18000 iops without it (!).
Why? It seems that's that’s because FLUSH CACHE is interpreted by the drive as a "please «please flush all caches, including non-volatile cache" cache» command, and "disable cache" «disable cache» is interpreted as "please «please disable the volatile cache, but you may leave the non-volatile on if you want to"to». This makes writes with a flush after every write slower than writes with the cache disabled.
What about NVMe? NVMe has slightly less variability - variability — there is no "disable cache" «disable cache» command in the NVMe spec at all, but just as in the SATA spec there is the FLUSH CACHE command and FUA bit. But again, based on the personal experience I can say that it seems that FUA is often ignored with NVMe either by Linux or by the drive itself, thus '''fio -sync=1''' gives the same results as '''fio -direct=1''' without any sync flags. '''-fsync=1''' performs correctly and lands the performance down to where it must belong (1000-2000 1000—2000 iops for desktop NVMes).
P.S: Bluestore uses fsync. Filestore uses O_SYNC.
<div style="clear:both"></div>
Reads don't don’t differ for hdparm -W 0 and 1.
'''Seagate Nytro 1351 XA3840LE10063'''
Disk was filled 90-100 100 % before the test.
<div style="float: left">
This limit is always sufficient to copy big files to a flash drive formatted in any of common filesystems. One opened block receives metadata and another receives data, then it just moves on. But if you start doing random writes you stop hitting the opened blocks and this is where lags come in.
 
== Good SSD models ==
 
* Micron 5100/5200
* Seagate Nytro 1351/1551
* HGST SN260
* Intel P4500
 
https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc
 
== Conclusion ==
 
Quick guide for optimizing Ceph for random reads/writes:
 
* Only use SSDs and NVMe with supercaps. A hint: 99 % of desktop SSDs/NVMe don’t have supercaps.
* Disable their cache with hdparm -W 0.
* Disable powersave: governor performance, cpupower idle-set -D 0
* Disable signatures:
*: <tt>cephx_require_signatures = false</tt>
*: <tt>cephx_cluster_require_signatures = false</tt>
*: <tt>cephx_sign_messages = false</tt>
*: (and use -o nocephx_require_signatures, nocephx_sign_messages for rbd map and cephfs kernel mounts)
* For good SSDs and NVMes: set min_alloc_size=4096, prefer_deferred_size_ssd=0 (BEFORE deploying OSDs)
* At least until Nautilus: <tt>[global] debug objecter = 0/0</tt> (there is a big client-side slowdown)
* Try to disable rbd cache in the userspace driver (QEMU options cache=none)
* For HDD-only or Bad-SSD-Only and at least until it’s backported — remove the handbrake https://github.com/ceph/ceph/pull/26909