Изменения

Ceph performance

296 байтов убрано, 17:04, 23 июля 2019

Нет описания правки

# Ceph isn’t slow on HDDs: theoretical single-thread random write performance of Bluestore is 66 % (2/3) of your drive’s IOPS (currently it’s 33 % in practice, but if you push this handbrake down: https://github.com/ceph/ceph/pull/26909 it goes back to 66 %), and multi-threaded read/write performance is about 100 % of the raw drive speed.

'''However''', the naive expectation is that as you replace your HDDs to SSDs and use a fast ~~network -~~ network — Ceph should become almost as faster. Everyone is used to the idea that I/O is slow and ~~the~~ software is fast. And this is generally NOT true with Ceph.

Ceph is a Software-Defined Storage system, and its «software» is a significant overhead. The general rule currently is: with Ceph it’s hard to achieve random read latencies less than 0.5ms and random write latencies less than 1ms, '''no matter what drives or network you use'''. This stands for only 2000 iops random read and 1000 iops random writewith one thread, and even ~~this result is good~~ if you manage to achieve itthis result you’re already in a good shape. With BIS hardware and some tuning you may be able to improve it further, but only twice or so.

~~Does~~ But does latency matter? Yes, it does, when it comes to single-threaded (synchronous) random reads or writes. Basically, everything that wants the data to be durable does fsync()'s which serializes writes. For example, all ~~DBMS's~~ DBMS’s do. So to understand the performance limit of these apps you should benchmark your cluster with iodepth=1.

The latency doesn’t scale with the number of servers or OSDs-per-SSD or with two-RBD-in-RAID0. When you’re benchmarking your cluster with iodepth=1 you’re benchmarking only ONE placement group at a time (PG is a triplet or a pair of OSDs). The result is only affected by how fast 1 OSD is responding to 1 request. In fact, with iodepth=1 IOPS=1/latency. There is Nick Fisk’s presentation titled «Low-latency Ceph». By «low-latency» he means 0.7ms, which is only ~1500 iops.

Another issue is that '''all writes in Ceph are transactional''', even ones that aren’t specifically requested to be. It means that write operations do not complete until they are written into all OSD journals and fsync()'ed to disks. This is to prevent [[#RAID WRITE HOLE]]-like situations.

To make it more clear this means that Ceph '''does not use any drive write buffers'''. It does quite the ~~opposite -~~ opposite — it clears all buffers after each write. It doesn’t mean that there’s no write buffering at all — there is some on the client side (RBD cache, Linux page cache inside VMs). But internal disk write buffers aren’t used.

This makes typical desktop SSDs perform absolutely terrible for Ceph journal in terms of write IOPS. The numbers you can expect are something between 100 and 1000 (or 500—2000) iops, while ~~you'd~~ you’d probably like to see at least 10000 (even Chinese noname SSD can do).

So your disks should also be benchmarked with '''-iodepth=1 -fsync=1''' (or '''-sync=1''', see [[#O_SYNC vs fsync vs hdparm -W 0]]).

== CAPACITORS! ==

The thing that will really help us ~~with building~~ build our lightning-fast Ceph cluster is an SSD with (super)capacitors, which are perfectly visible to the naked eye on M.2 SSDs:

[[File:Micron 5100 sata m2.jpg]]

Supercaps work for an SSD like a built-in UPS and allow it to flush DRAM cache into the persistent (flash) memory when a power loss occurs. Thus the cache becomes ~~"non-volatile"~~ «non- volatile» — and thus an SSD safely ignores fsync (FLUSH CACHE) requests, because ~~it's~~ it’s confident that its contents will always make their way to the persistent memory.

And this ~~makes~~ increases '''transactional write IOPS , making it equal to non-transactional'''.

Supercaps are usually called ~~"enhanced~~«enhanced/advanced power loss ~~protection"~~ protection» in the datasheets. This is a feature almost exclusively present only on ~~"server~~«server-~~grade"~~ grade» SSDs (not even all of them). For example, Intel DC S4600 has supercaps and Intel DC S3100 ~~doesn't~~doesn’t.

{{Note}} This is the main difference between server and desktop SSDs. The average user ~~doesn't~~ doesn’t need transactions, but the servers run ~~DBMS'es~~DBMS’es, and ~~DBMS'es~~ DBMS’es really, really want them.

~~And...~~ And… Ceph also does :) you should '''only''' buy SSDs with supercaps for Ceph clusters. Even if you consider ~~NVMe —~~ NVMe — NVMe without capacitors is WORSE than SATA with them.

Another option is Intel Optane. Intel Optane is also an SSD, but based ~~но они основаны не на Flash памяти (не NAND и не NOR), а на~~ on the different physics — Phase-Change-Memory ~~«3D XPoint»~~instead of Flash memory. ~~По спецификации заявляются~~ Specs say these drives are capable of 550000 iops ~~при полном отсутствии необходимости в стирании блоков, кэше и конденсаторах~~without the need to erase blocks and thus no need for write cache and supercaps. ~~Но если даже задержка такого диска и равна~~ But even if Optane’s latency is 0.~~01мс~~005ms (it is), ~~то задержка Ceph всё равно как минимум в 50 раз больше~~Ceph’s latency is still 0.5ms, соответственно, с Ceph оптаны использовать чуть менее, чем бессмысленно — за большие деньги (1500$ за 960 гб, 500$ за 240 гб) вы получите не сильно лучший результатso it’s pointless to use them with Ceph — you get the same performance for a lot more money compared to usual server SSDs/NVMes.

== Bluestore vs Filestore ==

VitaliyFilippov

Бюрократ, администратор

13 521

правка

Изменения

Ceph performance

YourcmcWiki