Изменения

Перейти к: навигация, поиск

Ceph performance

3463 байта добавлено, 16:22, 23 июля 2019
м
Нет описания правки
* Highly parallel random read and write of small blocks (4-8kb, iodepth=32-128) in IOPS (Input/Output ops per second)
* Single-threaded transactional random write (4-8kb, iodepth=1) and read (though single-threaded reads are more rare) in IOPS
 
Single-threaded random read and write is where the latency matters, and the latency doesn’t scale with the number of servers. Whenever you’re benchmarking your cluster with iodepth=1 you’re benchmarking only ONE placement group (triplet or pair of OSDs) at a time. The result is only affected by how fast 1 OSD is responding to 1 request. In fact, with only one parallel request IOPS = 1/latency.
 
The latency really matters because not many applications can do random writes with high parallelism/iodepth. For example, a DBMS can’t, because it’s transactional and it needs to serialize its writes to the journal.
 
The naive expectation is that Ceph should be almost as fast as drives and network are, because everyone is used to the idea that I/O is slow and the software is fast. And this is true with Ceph only until you don’t use SSDs.
 
The general rule is: Ceph is an SDS. With Ceph it’s hard to achieve random read latencies less than 0.5ms and random write latencies less than 1ms, ''no matter what drives or network you use''. This stands for only 2000 iops random read and 1000 iops random write, and even this result is good if you manage to achieve it. With BIS hardware and some tuning you may be able to improve it, but only twice or so.
 
There is Nick Fisk’s presentation titled «Low-latency Ceph». By «low-latency» he means 0.7ms, which is only ~1500 iops.
=== Test your disks ===
* Single-threaded random write latency: {{Cmd|1=fio -ioengine=libaio -sync=1 -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/sdX}}
You wanna ask why it's it’s so slow? See below.
[[File:Warning icon.svg|32px|link=]] A useful habit is to leave an empty partition for later benchmarking on each SSD you deploy Ceph OSDs on, because some SSDs tend to slow down when filled.
* And you can also use it from inside your VMs, the results are usually similar to the above. Just note that the result also depends on the storage driver being used. Virtio is the fastest, virtio-scsi is slightly slower and everything else (like LSI emulation) is terribly slow. Results are also considerably affected by whether the RBD cache is enabled or not (RBD cache turns on automatically with cache=writeback/none). For random reads or writes, disabling RBD cache is faster.
== Bluestore vs Filestore Why is it so slow: the latency ==
== WHY SO SLOW ==First of all:
[[File# Ceph isn’t slow for linear reads and writes.# Ceph isn’t slow on HDDs:Warning icontheoretical single-thread random write performance of Bluestore is 66 % (2/3) of your drive’s IOPS (currently it’s 33 % in practice, but if you push this handbrake down: https://github.svg|32px]] Bad newscom/ceph/ceph/pull/26909 it goes back to 66 %), and multi-threaded read/write performance is about 100 % of the raw drive speed. '''However''', the naive expectation is that as you replace your HDDs to SSDs and use a fast network - Ceph should become almost as faster. Everyone is used to the idea that I/O is slow and the software is fast. And this is generally NOT true with Ceph. Ceph is a Software-Defined Storage system, and its «software» is a significant overhead. The general rule currently is: All with Ceph it’s hard to achieve random read latencies less than 0.5ms and random write latencies less than 1ms, '''no matter what drives or network you use'''. This stands for only 2000 iops random read and 1000 iops random write, and even this result is good if you manage to achieve it. With BIS hardware and some tuning you may be able to improve it further, but only twice or so. Does latency matter? Yes, it does, when it comes to single-threaded (synchronous) random reads or writes. Basically, everything that wants the data to be durable does fsync()'s which serializes writes. For example, all DBMS's do. So to understand the performance limit of these apps you should benchmark your cluster with iodepth=1. The latency doesn’t scale with the number of servers or OSDs-per-SSD or with two-RBD-in-RAID0. When you’re benchmarking your cluster with iodepth=1 you’re benchmarking only ONE placement group at a time (PG is a triplet or a pair of OSDs). The result is only affected by how fast 1 OSD is responding to 1 request. In fact, with iodepth=1 IOPS=1/latency. There is Nick Fisk’s presentation titled «Low-latency Ceph». By «low-latency» he means 0.7ms, which is only ~1500 iops. Another issue is that '''all writes in Ceph are transactional''', even ones that aren’t specifically requested to be. It means that write operations do not finish complete until they are written into all OSD journals and fsync()'ed to disks. This is to prevent [[#RAID WRITE HOLE]]-like situations. To make it more clear this means that Ceph '''does not use any drive write buffers'''. It does quite the opposite - it clears all buffers after each write. It doesn’t mean that there’s no write buffering at all — there is some on the client side (RBD cache, Linux page cache inside VMs). But internal disk write buffers aren’t used. This makes typical desktop SSDs perform absolutely terrible for Ceph journal in terms of write IOPS. The numbers you can expect are something between 100 and 1000 (or 500—2000) iops, while you'd probably like to see at least 10000 (even Chinese noname SSD can do). So your disks should also be benchmarked with '''-iodepth=1 -fsync=1''' (or '''-sync=1''', see [[#RAID Write HoleO_SYNC vs fsync vs hdparm -W 0]]) situations== CAPACITORS! == The thing that will really help us with building our lightning-fast Ceph cluster is an SSD with (super)capacitors, which are perfectly visible to the naked eye on M.2 SSDs: [[File:Micron 5100 sata m2.jpg]] Supercaps work for an SSD like a built-in UPS and allow it to flush DRAM cache into the persistent (flash) memory when a power loss occurs. Thus the cache becomes "non-volatile" - and thus an SSD safely ignores fsync (FLUSH CACHE) requests, because it's confident that its contents will always make their way to the persistent memory. And this makes '''transactional write IOPS equal to non-transactional'''. Supercaps are usually called "enhanced/advanced power loss protection" in the datasheets. This is a feature almost exclusively present only on "server-grade" SSDs (not even all of them). For example, Intel DC S4600 has supercaps and Intel DC S3100 doesn't. {{Note}} This is the main difference between server and desktop SSDs. The average user doesn't need transactions, but the servers run DBMS'es, and DBMS'es really, really want them. And... Ceph also does :) you should '''only''' buy SSDs with supercaps for Ceph clusters. Even if you consider NVMe — NVMe without capacitors is WORSE than SATA with them. Another option is Intel Optane. Intel Optane is also an SSD, but based но они основаны не на Flash памяти (не NAND и не NOR), а на Phase-Change-Memory «3D XPoint». По спецификации заявляются 550000 iops при полном отсутствии необходимости в стирании блоков, кэше и конденсаторах. Но если даже задержка такого диска и равна 0.01мс, то задержка Ceph всё равно как минимум в 50 раз больше, соответственно, с Ceph оптаны использовать чуть менее, чем бессмысленно — за большие деньги (1500$ за 960 гб, 500$ за 240 гб) вы получите не сильно лучший результат. == Bluestore vs Filestore ==
== RAID Write Hole WRITE HOLE ==
Even if we’re talking about RAID, the thing that is much simpler than distributed software-defined storage like Ceph, we’re still talking about a distributed storage system — every system that has multiple physical drives is distributed, because each drive behaves and commits the data (or doesn’t commit it) independently of others.
Although flash memory allows fast random writes in small blocks (usually 512 to 4096 bytes), its distinctive feature is that every block must be erased before being written to. But erasing is slow compared to reading and writing, so manufacturers design memory chips so that they always erase a large group of blocks at once, as this takes almost the same time as erasing one block could take. This group of blocks called «erase unit» is typically 2-4 megabytes in size. Another distinctive feature is that the total number of erase/program cycles is physically limited — after several thousands cycles (a usual number for MLC memory) the block becomes faulty and stops accepting new writes or even loses the data previously written to it. Denser and cheaper (MLC/TLC/QLC, 2/3/4 bits per cell) memory chips have smaller erase limits, while sparser and more expensive ones (SLC, 1 bit per cell) have bigger limits (up to 100000 rewrites). However, all limits are still finite, so stupidly overwriting the same block would be very slow and would break SSD very rapidly.
But that’s not the case with modern SSDs - SSDs — even cheap models are very fast and usually very durable. But why? The credit goes to SSD controllers: SSDs contain very smart and powerful controllers, usually with at least 4 cores and 1-2 GHz clock frequency, which means they’re as powerful as mobile phones' processors. All that power is required to make FTL firmware run smoothly. FTL stands for «Flash Translation Layer» and it is the firmware responsible for translating addresses of small blocks into physical addresses on flash memory chips. Every write request is always put into a space freed in advance, and FTL just remembers the new physical location of the data. This makes writes very fast. FTL also defragments free space and moves blocks around to achieve uniform wear across all memory cells. This feature is called Wear Leveling. SSDs also usually have some extra physical space reserved to add even more endurance and to make wear leveling easier; this is called overprovisioning. Pricier server SSDs have a lot of space overprovisioned, for example, Micron 5100 Max has 37,5 % of physical memory reserved (extra 60 % is added to the user-visible capacity inside).
And this is also the FTL which makes power loss protection a problem. Mapping tables are metadata which must also be forced into non-volatile memory when you flush the cache, and it’s what makes desktop SSDs slow with fsync… In fact, as I wrote it I thought that they could use RocksDB or similar LSM-tree based system to store mapping tables and that could make fsyncs fast even without the capacitors. It would lead to some waste of journal space and some extra write amplification (as every journal block would only contain 1 write), but still it would make writes fast. So… either they don’t know about LSM trees or the FTL metadata is not the only problem for fsync.

Навигация