Ceph performance — различия между версиями

Версия 11:32, 24 июля 2019

Содержание

1 General benchmarking principles
- 1.1 Test your disks
- 1.2 Test your Ceph cluster
2 Why is it so slow
3 CAPACITORS!
4 Bluestore vs Filestore
5 Controllers
6 CPUs
7 Drive cache is slowing you down
8 RAID WRITE HOLE
9 Quick insight into SSD and flash memory organization
- 9.1 A bonus: USB thumb drives

General benchmarking principles

Main test cases for benchmarking are:

Linear read and write (big blocks, big queue) in MB/s
Highly parallel random read and write of small blocks (4-8kb, iodepth=32-128) in IOPS (Input/Output ops per second)
Single-threaded transactional random write (4-8kb, iodepth=1) and read (though single-threaded reads are more rare) in IOPS

Test your disks

Run `fio` on your drives before deploying Ceph:

WARNING! For those under a rock — fio write test is DESTRUCTIVE. Don’t dare to run it on disks which have important data… for example, OSD journals (I’ve seen such cases).

Try to disable drive cache before testing: hdparm -W 0 /dev/sdX (SATA drives), sdparm --set WCE=0 /dev/sdX (SAS drives). This is usually ABSOLUTELY required for server SSDs like Micron 5100 or Seagate Nytro (see #Drive cache is slowing you down) as it increases random write iops more than by two magnitudes (from 288 iops to 18000 iops!). In some cases it may not improve anything, so try both options -W0 and -W1.
Linear read: fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX
Linear write: fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX
Peak parallel random read: fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 -rw=randread -runtime=60 -filename=/dev/sdX
Single-threaded read latency: fio -ioengine=libaio -sync=1 -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX
Peak parallel random read: fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -runtime=60 -filename=/dev/sdX
Journal write latency: fio -ioengine=libaio -sync=1 -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=write -runtime=60 -filename=/dev/sdX. Also try it with -fsync=1 instead of -sync=1 and write down the worst result, because sometimes one of sync or fsync is ignored by messy hardware.
Single-threaded random write latency: fio -ioengine=libaio -sync=1 -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/sdX

You wanna ask why it’s so slow? See below.

A useful habit is to leave an empty partition for later benchmarking on each SSD you deploy Ceph OSDs on, because some SSDs tend to slow down when filled.

Test your Ceph cluster

Recommended benchmarking tools:

The first recommended tool is again `fio` with `-ioengine=rbd -pool=<your pool> -rbdname=<your image>`. All of the above tests valid for raw drives can be repeated for RBD and they mean the same things. Sync, direct and invalidate flags can be omitted, because RBD has no concept of «sync» — all operations are always «sync». And there’s no page cache involved either, so «direct» also doesn’t mean anything.
The second recommended tool, especially useful for hunting performance problems, comes in the several improved varieties of «Mark’s bench» from russian Ceph chat: https://github.com/rumanzo/ceph-gobench or https://github.com/vitalif/ceph-bench. Both use a non-replicated Ceph pool (size=1), create several 4MB objects (16 by default) in each separate OSD and do random single-thread 4kb writes in randomly selected objects within one OSD. This mimics random writes to RBD and allows to determine the problematic OSDs by benchmarking them separately. Original Mark’s bench (outdated) was here: https://github.com/socketpair/ceph-bench
To create the non-replicated benchmark pool use ceph osd pool create bench 128 replicated; ceph osd pool set bench size 1; ceph osd pool set bench min_size 1. Just note that 128 (PG count) should be enough for all OSDs to get at least one PG each.
Do not use `rados bench`. It creates a small number of objects (1-2 for a thread) so all of them always reside in cache and improve the results far beyond they should be.
You can also use the simple `fio -ioengine=libaio` with a kernel-mounted RBD. However, that requires to disable some features of that RBD, because kernel client still lacks their support. Note that regardless of the overhead of moving data in and out the kernel, the kernel client is actually faster.
And you can also use it from inside your VMs, the results are usually similar to the above. Just note that the result also depends on the storage driver being used. Virtio is the fastest, virtio-scsi is slightly slower and everything else (like LSI emulation) is terribly slow. Results are also considerably affected by whether the RBD cache is enabled or not (RBD cache turns on automatically with cache=writeback/none). For random reads or writes, disabling RBD cache is faster.

Why is it so slow

First of all:

Ceph isn’t slow for linear reads and writes.
Ceph isn’t slow on HDDs: theoretical single-thread random write performance of Bluestore is 66 % (2/3) of your drive’s IOPS (currently it’s 33 % in practice, but if you push this handbrake down: https://github.com/ceph/ceph/pull/26909 it goes back to 66 %), and multi-threaded read/write performance is about 100 % of the raw drive speed.

However, the naive expectation is that as you replace your HDDs to SSDs and use a fast network — Ceph should become almost as faster. Everyone is used to the idea that I/O is slow and software is fast. And this is generally NOT true with Ceph.

Ceph is a Software-Defined Storage system, and its «software» is a significant overhead. The general rule currently is: with Ceph it’s hard to achieve random read latencies less than 0.5ms and random write latencies less than 1ms, no matter what drives or network you use. This stands for only 2000 iops random read and 1000 iops random write with one thread, and even if you manage to achieve this result you’re already in a good shape. With BIS hardware and some tuning you may be able to improve it further, but only twice or so.

But does latency matter? Yes, it does, when it comes to single-threaded (synchronous) random reads or writes. Basically, everything that wants the data to be durable does fsync()'s which serializes writes. For example, all DBMS’s do. So to understand the performance limit of these apps you should benchmark your cluster with iodepth=1.

The latency doesn’t scale with the number of servers or OSDs-per-SSD or with two-RBD-in-RAID0. When you’re benchmarking your cluster with iodepth=1 you’re benchmarking only ONE placement group at a time (PG is a triplet or a pair of OSDs). The result is only affected by how fast 1 OSD is responding to 1 request. In fact, with iodepth=1 IOPS=1/latency. There is Nick Fisk’s presentation titled «Low-latency Ceph». By «low-latency» he means 0.7ms, which is only ~1500 iops.

Another issue is that all writes in Ceph are transactional, even ones that aren’t specifically requested to be. It means that write operations do not complete until they are written into all OSD journals and fsync()'ed to disks. This is to prevent #RAID WRITE HOLE-like situations.

To make it more clear this means that Ceph does not use any drive write buffers. It does quite the opposite — it clears all buffers after each write. It doesn’t mean that there’s no write buffering at all — there is some on the client side (RBD cache, Linux page cache inside VMs). But internal disk write buffers aren’t used.

This makes typical desktop SSDs perform absolutely terrible for Ceph journal in terms of write IOPS. The numbers you can expect are something between 100 and 1000 (or 500—2000) iops, while you’d probably like to see at least 10000 (even Chinese noname SSD can do).

So your disks should also be benchmarked with -iodepth=1 -fsync=1 (or -sync=1, see #O_SYNC vs fsync vs hdparm -W 0).

CAPACITORS!

The thing that will really help us build our lightning-fast Ceph cluster is an SSD with (super)capacitors, which are perfectly visible to the naked eye on M.2 SSDs:

Supercaps work for an SSD like a built-in UPS and allow it to flush DRAM cache into the persistent (flash) memory when a power loss occurs. Thus the cache becomes «non-volatile» — and thus an SSD safely ignores fsync (FLUSH CACHE) requests, because it’s confident that cache contents will always make their way to the persistent memory.

And this increases transactional write IOPS, making it equal to non-transactional.

Supercaps are usually called «enhanced/advanced power loss protection» in the datasheets. This is a feature almost exclusively present only in «server-grade» SSDs (not even all of them). For example, Intel DC S4600 has supercaps and Intel DC S3100 doesn’t.

This is the main difference between server and desktop SSDs. The average user doesn’t need transactions, but the servers run DBMS’es, and DBMS’es want them really, really bad.

And… Ceph also does :) you should only buy SSDs with supercaps for Ceph clusters. Even if you consider NVMe — NVMe without capacitors is WORSE than SATA with them. Desktop NVMes do 150000+ write iops without syncs, but only 600—1000 iops with them.

Another option is Intel Optane. Intel Optane is also an SSD, but based on the different physics — Phase-Change Memory instead of Flash memory. Specs say these drives are capable of 550000 iops without the need to erase blocks and thus no need for write cache and supercaps. But even if Optane’s latency is 0.005ms (it is), Ceph’s latency is still 0.5ms, so it’s pointless to use them with Ceph — you get the same performance for a lot more money compared to usual server SSDs/NVMes.

Bluestore vs Filestore

Bluestore is the «new» storage layer of Ceph. All presentations and documents say it’s better in all ways, which in fact seems reasonable for something «new».

Bluestore is really 2x faster than Filestore for linear write workloads, because it has no double-writes — big blocks are written only once, not twice as in Filestore. Filestore journals everything, so all writes first go to the journal and the get copied to the main device.

Bluestore is also more feature-rich: it has checksums, compression, erasure-coded overwrites and virtual clones. Checksums allow 2x-replicated pools self-heal better, erasure-coded overwrites make EC usable for RBD and CephFS, and virtual clones make VMs run faster after taking a snapshot.

Bluestore uses a lot more RAM, because it uses RocksDB for all metadata, additionally caches some of them by itself and also tries to cache some data blocks to compensate for the lack of page cache usage. The general rule of thumb is 1GB per 1TB of storage, but not less than 2GB in total.

And, suprisingly, there is one thing that may sometimes be worse with Bluestore: random write performance. The issue shows up in two popular setups.

TODO: This section lacks random read performance comparisons.

HDD for data + SSD for journal

Filestore writes everything to the journal and only starts to flush it to the data device when the journal fills up to the configured percent. This is very convenient because it makes journal act as a «temporary buffer» that absorbs random write burts.

Bluestore can’t do the same even when you put its WAL+DB on SSD. It also has sort of a «journal» which is called «deferred write queue», but it’s very small (only 64 requests) and it lacks any kind of background flush threads. So you actually can increase the maximum number of deferred requests, but after the queue fills up the performance will drop until OSD restart.

So, Bluestore’s performance is very consistent, but it’s worse than peak performance of Filestore for the same hardware. In other words, Bluestore OSD refuses to do random writes faster than the HDD can do on average.

With Filestore you easily get 1000—2000 iops while the journal is not full. With Bluestore you only get 100—300 iops regardless of the SSD journal, but these are absolutely stable over time and never drop.

SSD-only (All-Flash)

In All-Flash clusters Bluestore’s own latency is usually 30-50 % greater than Filestore’s. However, this only refers to the latency of Bluestore itself, so the absolute number for these 30-50 % is something around 0.1ms which is hard to notice in front of the total Ceph’s latency. And even though latency is greater, peak parallel throughput is usually slightly better (+5..10 %) and peak CPU usage is slightly lower (-5..10 %).

But it’s still a shame that the increase is only 5-10 % for that amount of architectural effort.

HDD-only (or bad-SSD-only)

In these setups Bluestore is also 2x faster than Filestore, because it can do 1 commit per write, at least if you apply this patch: https://github.com/ceph/ceph/pull/26909 and turn bluefs_preextend_wal_files on. In fact it’s OK to say that Bluestore’s deferred write implementation is really optimal for transactional writes on HDDs.

About the sizing of block.db

As usual, there’s something wrong :)

Official documents say that you should allocate 4 % of the slow device space for block.db (Bluestore’s metadata partition). This is a lot, Bluestore rarely needs that amount of space.

But the main problem is that Bluestore uses RocksDB and RocksDB puts a file on the fast device only if it thinks that the whole layer will fit there (RocksDB is organized in files). So, default RocksDB settings in Ceph are:

1 GB WAL = 4x256 Mb
max_bytes_for_level_base and max_bytes_for_level_multiplier are default, thus 256 Мб and 10, respectively
so L1 = 256 Мб
L2 = 2560 Мб
L3 = 25600 Мб

…so…

RocksDB puts L2 files to block.db only if it’s at least 2560+256+1024 Mb (almost 4 GB). And it will put L3 there only if it’s at least 25600+2560+256+1024 Mb (almost 30 GB). And L4 - only if it’s at least 256000+25600+2560+256+1024 Mb (roughly 286 GB).

In other words, all block.db sizes except 4, 30, 286 GB are pointless, Bluestore won’t use everything above the previous «round» size. At least if you don't change RocksDB settings. And of these 4 is too small, 286 is too big.

So just stick with 30 GB for all Bluestore OSDs :)

Controllers

SATA is OK, you don't need SAS at all. SATA is simple and definitely faster than old RAID controllers.
Don't use RAID unless you're absolutely sure you need it. All drives should be connected using the pass-through (HBA) mode.
If your RAID controller can't do passthrough mode, reflash it. If you can't reflash it, throw it away and buy an HBA ("RAID without RAID"), for example, LSI 9300-8i.
If you still can't throw it away - disable all caches and pray :) the problem is that RAID controllers sometimes ignore fsync requests so Ceph can become corrupted on a sudden power loss. Even some HBAs may do that (namely some Adaptecs).
In theory, you may try to leverage the battery- (or supercap-) backed controller cache in RAID0 mode to improve write latency. However, that can easily become shooting yourself. At least conduct some power-unplug tests if you do that.
IOPS difference between RAID and HBA/SATA may be very noticeable. A bad or old RAID controller can easily become a bottleneck.
HBAs also have IOPS limits. For example, it's ~280000 iops for the whole controller for the LSI 9211-8i.
Always turn on blk-mq for SAS and NVMe - or just use recent kernel versions, blk-mq is on by default since 4.18 or so. However, blk-mq does almost nothing for SATA.

CPUs

Drive cache is slowing you down

RAID WRITE HOLE

Even if we’re talking about RAID, the thing that is much simpler than distributed software-defined storage like Ceph, we’re still talking about a distributed storage system — every system that has multiple physical drives is distributed, because each drive behaves and commits the data (or doesn’t commit it) independently of others.

Write Hole is the name for several situations in RAID arrays when drives go out of sync. Suppose you have a simple RAID1 array of two disks. You write a sector. You send a write command to both drive. And then a power failure occurs before commands finish. Now, after the system boots again, you don’t know if your replicas contain same data, because you don’t know which drive had succeeded to write it and which didn’t.

You say OK, I don’t care. I’ll just read from both drives and if I encounter different data I’ll just pick one of the copies, and I’ll either get the old data or the new.

But then imagine that you have RAID 5. Now you have three drives: two for data and one for parity. Now suppose that you overwrite a sector again. Before writing your disks contain: (A1), (B1) and (A1 XOR B1). You overwrite (B1) with (B2). To do so you write (B2) to the second disk and (A1 XOR B2) to the third. A power failure occurs again… And then, on the next boot, you also find out that disk 1 (one that you didn’t write anything to) is dead. You might think that you can still reconstruct your data because you have RAID 5 and 2 disks out of 3 are still alive.

But imagine that disk 2 succeeded to write new data, while disk 3 failed. Now you have: (lost disk), (B2) and (A1 XOR B1). If you try to reconstruct A from these copies you’ll get (A1 XOR B1 XOR B2) which is obviously not equal to A1. Bang! Your RAID5 has corrupted the data that you didn’t even write at the time of the power loss.

Because of this problem, Linux `mdadm` refuses at all to start an incomplete array after unclean shutdown. There’s no solution to this problem except full data journaling at the level of each disk drive. And this is… exactly what Ceph does! So, Ceph is actually safer than RAID. :)

Quick insight into SSD and flash memory organization

Although flash memory allows fast random writes in small blocks (usually 512 to 4096 bytes), its distinctive feature is that every block must be erased before being written to. But erasing is slow compared to reading and writing, so manufacturers design memory chips so that they always erase a large group of blocks at once, as this takes almost the same time as erasing one block could take. This group of blocks called «erase unit» is typically 2-4 megabytes in size. Another distinctive feature is that the total number of erase/program cycles is physically limited — after several thousands cycles (a usual number for MLC memory) the block becomes faulty and stops accepting new writes or even loses the data previously written to it. Denser and cheaper (MLC/TLC/QLC, 2/3/4 bits per cell) memory chips have smaller erase limits, while sparser and more expensive ones (SLC, 1 bit per cell) have bigger limits (up to 100000 rewrites). However, all limits are still finite, so stupidly overwriting the same block would be very slow and would break SSD very rapidly.

But that’s not the case with modern SSDs — even cheap models are very fast and usually very durable. But why? The credit goes to SSD controllers: SSDs contain very smart and powerful controllers, usually with at least 4 cores and 1-2 GHz clock frequency, which means they’re as powerful as mobile phones' processors. All that power is required to make FTL firmware run smoothly. FTL stands for «Flash Translation Layer» and it is the firmware responsible for translating addresses of small blocks into physical addresses on flash memory chips. Every write request is always put into a space freed in advance, and FTL just remembers the new physical location of the data. This makes writes very fast. FTL also defragments free space and moves blocks around to achieve uniform wear across all memory cells. This feature is called Wear Leveling. SSDs also usually have some extra physical space reserved to add even more endurance and to make wear leveling easier; this is called overprovisioning. Pricier server SSDs have a lot of space overprovisioned, for example, Micron 5100 Max has 37,5 % of physical memory reserved (extra 60 % is added to the user-visible capacity inside).

And this is also the FTL which makes power loss protection a problem. Mapping tables are metadata which must also be forced into non-volatile memory when you flush the cache, and it’s what makes desktop SSDs slow with fsync… In fact, as I wrote it I thought that they could use RocksDB or similar LSM-tree based system to store mapping tables and that could make fsyncs fast even without the capacitors. It would lead to some waste of journal space and some extra write amplification (as every journal block would only contain 1 write), but still it would make writes fast. So… either they don’t know about LSM trees or the FTL metadata is not the only problem for fsync.

When I tried to lecture someone in the mailing list about «all SSDs doing fsyncs correctly» I got this as the reply: https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf. Long story short, it says that in 2013 a common scenario was SSDs not syncing metadata on fsync calls at all which led to all kinds of funny things on a power loss, up to (!!!) total failures of some SSDs.

There also exist some very old SSDs without capacitors (OCZ Vector/Vertex) which are capable of very large sync iops numbers. How do they work? Nobody knows, but I suspect that they just don’t do safe writes :). The core principle of flash memory overwrites didn’t change in the last years, and SSDs were also based on FTLs just as they do now.

So it seems there are two kinds of «power loss protection»: simple PLP means «we do fsyncs and don’t die or lose your data when a power loss occurs», and advanced PLP means that fsync’ed writes are just as fast as non-fsynced. It also seems that in the current years (2018—2019) simple PLP is already a standard and most SSDs don’t lose data on power failure.

A bonus: USB thumb drives

Why are USB flash drives so slow then? In terms of small random writes they usually only deliver 2-3 operations per second, while being powered by similar flash memory chips — maybe slightly cheaper and worse ones, but obviously not 1000 times worse.

The answer also lies in the FTL. Thumb drives also have FTL and they even have some Wear Leveling, but it’s very small and dumb compared to SSD FTLs. It has a slow CPU and only a little memory. Thus it doesn’t have place to store a full mapping table for small blocks and thus it translates the positions of big blocks (1-2 megabytes or even bigger) instead. Writes are buffered and then flushed one block at a time; there is a small limit on number of blocks that can be buffered at once. The limit is usually only between 3 and 6 blocks.

This limit is always sufficient to copy big files to a flash drive formatted in any of common filesystems. One opened block receives metadata and another receives data, then it just moves on. But if you start doing random writes you stop hitting the opened blocks and this is where lags come in.

@@ Строка 14: / Строка 14: @@
 {{Box|[[File:Warning icon.svg|32px|link=]] {{red|WARNING!}} For those under a rock — fio write test is DESTRUCTIVE. Don’t dare to run it on disks which have important data… for example, OSD journals (I’ve seen such cases).}}
-* Try to disable drive cache before testing: {{Cmd|hdparm -W 0 /dev/sdX}} (SATA drives), {{Cmd|1=sdparm --set WCE=0 /dev/sdX}} (SAS drives). This is usually ABSOLUTELY required for server SSDs like Micron 5100 or Seagate Nytro (see [[#Drive cache is slowing down]]) as it increases random write iops ''more than by two magnitudes'' (from 288 iops to 18000 iops!). In some cases it may not improve anything, so try both options -W0 and -W1.
+* Try to disable drive cache before testing: {{Cmd|hdparm -W 0 /dev/sdX}} (SATA drives), {{Cmd|1=sdparm --set WCE=0 /dev/sdX}} (SAS drives). This is usually ABSOLUTELY required for server SSDs like Micron 5100 or Seagate Nytro (see [[#Drive cache is slowing you down]]) as it increases random write iops ''more than by two magnitudes'' (from 288 iops to 18000 iops!). In some cases it may not improve anything, so try both options -W0 and -W1.
 * Linear read: {{Cmd|1=fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX}}
 * Linear write: {{Cmd|1=fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX}}