13 531
правка
Изменения
Нет описания правки
[[CategoryFile:VitaliPrivateCeph-funnel-en.svg|500px|right]][[ru:Производительность Ceph]]Ceph is a Software-Defined Storage system. It’s very feature-rich: it provides object storage, VM disk storage, shared cluster filesystem and a lot of additional features. In some ways, it’s even unique. It could be an excellent solution which you could take for free, immediately solve all your problems, become a cloud provider and earn piles of money. However there is a subtle problem: PERFORMANCE. Rational people rarely want to lower the performance by 95 % in production. It seems cloud providers like AWS, GCP, Yandex don’t care — all of them run their clouds on top of their own crafted SDS-es (not even Ceph) and all these SDS-es are just as slow. :-) we don’t judge them of course, that’s their own business. This article describes which performance numbers you can achieve with Ceph and how. But I warn you: you won’t catch up with local SSDs. Local SSDs (especially NVMe) are REALLY fast right now, their latency is about 0.05ms. It’s very hard for an SDS to achieve the same result, and beating it is almost impossible. The network alone eats those 0.05ms... '''UPDATE: It’s possible to achieve good latency with an SDS. I did it in my own project — Vitastor: https://vitastor.io :-) it's a block SDS architecturally similar to Ceph, but FAST. It achieved 0.14 ms latency (both read and write) in a cluster with SATA SSDs. Ceph only achieved 1 ms for writes and 0.57 ms for reads on the same hardware. See [https://yourcmc.ru/git/vitalif/vitastor/src/branch/master/README.md README] for details.'''
== General benchmarking principles ==
Main test cases for benchmarking are:
* Linear read and write (big blocks, big long queue) in MB/s
* Highly parallel random read and write of small blocks (4-8kb, iodepth=32-128) in IOPS (Input/Output ops per second)
* Single-threaded transactional random write (4-8kb, iodepth=1) and read (though single-threaded reads are more rare) in IOPS
=== Test your disks ===
[https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc SSD Bench Google Docs]
Run `fio` on your drives before deploying Ceph:
{{Box|[[File:Warning icon.svg|32px|link=]] {{red|WARNING!}} For those under a rock — fio write test is DESTRUCTIVE. Don’t dare to run it on disks which have containing important data… for example, OSD journals (I’ve seen such cases).}}
* Try to disable drive cache before testing: {{Cmd|hdparm -W 0 /dev/sdX}} (SATA drives), {{Cmd|1=sdparm --set WCE=0 /dev/sdX}} (SAS drives). This is usually ABSOLUTELY required for server SSDs like Micron 5100 or Seagate Nytro (see [[#Drive cache is slowing you down]]) as it increases random write iops ''more than by two magnitudes'' (from 288 iops to 18000 iops!). In some cases it may not improve anything, so try both options -W0 and -W1.
* Linear read: {{Cmd|1=fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX}}
* Linear write: {{Cmd|1=fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX}}
* Peak parallel random read: {{Cmd|1=fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 -rw=randread -runtime=60 -filename=/dev/sdX}}
* Single-threaded read latency: {{Cmd|1=fio -ioengine=libaio -sync=1 -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX}}
* Peak parallel random readwrite: {{Cmd|1=fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -runtime=60 -filename=/dev/sdX}}
* Journal write latency: {{Cmd|1=fio -ioengine=libaio -sync=1 -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=write -runtime=60 -filename=/dev/sdX}}. Also try it with <tt>-fsync=1</tt> instead of <tt>-sync=1</tt> and write down the worst result, because sometimes one of sync or fsync is ignored by messy hardware.
* Single-threaded random write latency: {{Cmd|1=fio -ioengine=libaio -sync=1 -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/sdX}}
[[File:Warning icon.svg|32px|link=]] A useful habit is to leave an empty partition for later benchmarking on each SSD you deploy Ceph OSDs on, because some SSDs tend to slow down when filled.
==== Lyrical digression ====
Why use this approach in benchmarking? After all, disk performance depends on many parameters, such as:
* Block size;
* Mode — read, write, or various mixed read/write modes;
* Parallelism — queue depth and the number of threads, in other words, the number of parallel I/O requests;
* Test duration;
* Initial disk state — empty, filled linearly, filled randomly, randomly written over a specific period of time;
* Data distribution — for example, 10% of hot data and 90% of cold data or hot data located in a certain place (e.g., at the beginning of the disk);
* Other mixed test modes, e.g, benchmarking using different block sizes at the same time.
The results can also be presented with varying levels of detail — you can provide graphs, histograms, percentiles, and so on in addition to mere average operation count or megabytes per second. This, of course, can reveal more information about the behavior of the disk under test.
Benchmarking also contains a bit of philosophy. For example, some manufacturers of server SSDs argue that you must do preconditioning by randomly overwriting the disk at least twice to fill translation tables before testing. I rather believe that it puts the SSD in unrealistically bad conditions rarely seen in real life.
Others say you should plot a graph of latency against the number of operations per second, but my opinion is that it’s also a bit strange because it implies that you plot a graph of F1(q) against F2(q) instead of «q» itself.
In short, benchmarking can be a never-ending process. It can take quite a few days to get a complete view. This is usually what resources like 3dnews do in their SSD reviews. But we don’t want to waste several days. We need a test that allows us to estimate performance quickly.
Therefore we isolate a few «extreme» modes, check the disk in them and pretend that other results are somewhere between these «extreme points», forming some kind of a smooth function depending on the parameters. It’s also handy that each of these modes also corresponds to a valid use case:
* Applications that mainly use linear or large-block access. For such applications, the crucial characteristic is the linear I/O speed in megabytes per second. Therefore, the first test mode is linear read/write with 4 MB blocks and medium queue depth — 16-32 operations. Test results should be in MB/s.
* Applications that use random small-block access and support parallelism. This leads us to 4 KB random I/O modes with large queue depth — at least 128 operations. 4 KB is the standard block size for most filesystems and DBMS. Multiple (2-4-8) CPU threads should be used if a single thread can’t saturate the drive during test. Test results should include iops (I/O operations per second), but not latency. Latency is meaningless in this test because it can be arbitrarily increased just by increasing queue depth — latency is directly related to iops with a formula latency=queue/iops.
* Applications that use random small-block access and DO NOT support parallelism. There are more such applications than you might think; regarding writes, all transactional DBMSs are a notable example. This leads us to 4 KB random I/O test with queue depth of 1 and, for writes, with an fsync after each operation to prevent the disk or storage system from «cheating» by writing the data into a volatile cache. Results should include either iops or latency, but not both because, as already said, they directly relate to each other.
=== Test your Ceph cluster ===
Recommended benchmarking tools:
* The first recommended tool is again `fio` with `-ioengine=rbd . Run the following:*# fio -ioengine=rbd -direct=1 -name=test -bs=4M -iodepth=16 -rw=write -pool=<your rpool_hdd -runtime=60 -rbdname=testimg*# fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -pool> =rpool_hdd -runtime=60 -rbdname=<your image>`testimg*# fio -ioengine=rbd -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -pool=rpool_hdd -runtime=60 -rbdname=testimg*: Then repeat for rw=read/randread. All of *: The idea is to test a) the above best possible latency b) linear bandwidth c) random access iops.*: Reading from an empty RBD image is very fast :) so pre-fill it before testing.*: Run tests valid for raw drives can be repeated for from node(s) where your actual RBD and they mean the users will reside. The results are usually slightly better when you run tests from a separate physical server.* The same thingsfrom inside a VM or through the kernel RBD driver (krbd):*# fio -ioengine=libaio -direct=1 -name=test -bs=4M -iodepth=16 -rw=write -runtime=60 -filename=/dev/rbdX*# fio -ioengine=libaio -direct=1 -sync=1 -name=test -bs=4k -iodepth=1 -rw=randwrite -runtime=60 -filename=/dev/rbdX*# fio -ioengine=libaio -direct=1 -name=test -bs=4k -iodepth=128 -rw=randwrite -runtime=60 -filename=/dev/rbdX*: Don't miss the added -sync=1 option. SyncIt is added on purpose, direct and invalidate flags can be omitted, because RBD to match the ioengine=rbd test. ioengine=rbd has no concept of «sync» — all operations are sync, everything is always «sync»with it. And there’s no page cache involved either, so «direct» also doesn’t mean anythingOverall this write pattern — transactional single-threaded write — corresponds to a DBMS.* The second recommended tool, especially useful for hunting performance problems, comes in : Note that regardless of the several improved varieties supposed overhead of «Mark’s bench» from russian Ceph chatmoving data in and out the kernel, the kernel client is actually faster.* ceph-gobench*: Or https://github.com/rumanzovitalif/ceph-gobench or bench. The original idea comes from the «Mark’s bench» from russian Ceph chat ([https://github.com/vitalifsocketpair/ceph-benchoriginal outdated tool was here]). Both use a non-replicated Ceph pool (size=1), create several 4MB objects (16 by default) in each separate OSD and do random single-thread 4kb writes in randomly selected objects within one OSD. This mimics random writes to RBD and allows to determine the problematic OSDs by benchmarking them separately. Original Mark’s bench (outdated) was here: https://github.com/socketpair/ceph-bench
*: To create the non-replicated benchmark pool use {{Cmd|ceph osd pool create bench 128 replicated; ceph osd pool set bench size 1; ceph osd pool set bench min_size 1}}. Just note that 128 (PG count) should be enough for all OSDs to get at least one PG each.
* Do not S3 (rgw):** [https://github.com/intel-cloud/cosbench cosbench]** [https://github.com/markhpc/hsbench hsbench]** [https://github.com/minio/warp minio warp] Notes:* Never use dd to test disk performance.* Don't use `rados bench`. It creates a small number of objects (1-2 for a thread) so all of them always reside in cache and improve the results far beyond they should be.* You can also use the simple `rbd bench`, but fio -ioengineis better. =libaio` with a kernel== Test your network === ping -mounted RBDf (flood ping). However, that requires to disable some features of that RBD, because kernel client still lacks their support sockperf. Note that regardless of On the overhead of moving data in and out the kernelfirst node, run <tt>sockperf sr -i IP --tcp</tt>. On the kernel client second, run <tt>sockperf pp -i SERVER_IP --tcp -m 4096</tt>. Decent average number is actually fasteraround 0.05-0.07ms.* And you can also use it from inside your VMs, the results are usually similar to the above<s>qperf. Just note that the result also depends on On the storage driver being usedfirst node, just run <tt>qperf</tt>. Virtio is On the fastestsecond, virtio<tt>qperf -scsi vvs SERVER_IP tcp_lat -m 4096</tt>.</s> Don’t use qperf. It is slightly slower and everything else super-stupid: it doesn’t disable Nagle (like LSI emulationno TCP_NODELAY) and it doesn’t honor the <tt>-m 4096</tt> parameter — message size is terribly slowalways set to 1 BYTE in latency tests. Results are also considerably affected [[File:Warning icon.svg|32px|link=]] Warning: Ubuntu has AppArmor enabled by whether the RBD cache default and it affects network latency adversely. Disable it if you want good performance. The effect of AppArmor is enabled or not like the following (RBD cache turns on automatically with cacheIntel X520-DA2): * centos 3.10: rtt min/avg/max/mdev =writeback0.039/none)0. For random reads or writes, disabling RBD cache is faster053/0.132/0.012 ms* ubuntu 4.x + apparmor: rtt min/avg/max/mdev = 0.068/0.163/0.230/0.029 ms* ubuntu 4.x: rtt min/avg/max/mdev = 0.037/0.071/0.157/0.018 ms
== Why is it so slow ==
# Ceph isn’t slow for linear reads and writes.
# Ceph isn’t slow on HDDs: theoretical single-thread random write performance of Bluestore is 66 % (2/3) of your drive’s IOPS (currently it’s 33 % in practice, but if you push this handbrake down: https://github.com/ceph/ceph/pull/26909 it goes back to 66 %), and multi-threaded read/write performance is about almost 100 % of the raw drive speed.
'''However''', the naive expectation is that as you replace your HDDs to with SSDs and use a fast network — Ceph should become almost as faster. Everyone is used to the idea that I/O is slow and software is fast. And this is generally NOT true with Ceph.
Ceph is a Software-Defined Storage system, and its «software» is a significant overhead. The general rule currently is: with Ceph it’s hard to achieve random read latencies less than below 0.5ms and random write latencies less than below 1ms, '''no matter what drives or network you use'''. This With one thread, this stands for only 2000 iops random read iops and 1000 iops random write with one threadiops, and even if you manage to achieve this result you’re already in a good shape. With BIS best-in-slot hardware and some tuning you may be able to improve it further, but only twice or so.
But does latency matter? Yes, it does, when it comes to single-threaded (synchronous) random reads or writes. Basically, everything all software that wants the data to be durable does fsync()'s calls which serializes serialize writes. For example, all DBMS’s DBMSs do. So to understand the performance limit of these apps you should benchmark your cluster with iodepth=1.
The latency doesn’t scale with the number of servers or OSDs-per-SSD or with two-RBD-in-RAID0. When you’re benchmarking your cluster with iodepth=1 you’re benchmarking only ONE placement group at a time (PG is a triplet or a pair of OSDs). The result is only affected by how fast 1 a single OSD is responding to 1 processes a single request. In fact, with iodepth=1 IOPS=1/latency. There is Nick Fisk’s presentation titled «Low-latency Ceph». By «low-latency» he means 0.7ms, which is only ~1500 iops.
== CAPACITORS! ==
One important thing to note is that '''all writes in Ceph are transactional''', even ones that aren’t specifically requested to be. It means that write operations do not complete until they are written into all OSD journals and fsync()'ed to disks. This is to prevent [[#RAID WRITE HOLE]]-like situations.
To make it more clear this means that Ceph '''does not use any drive write buffers'''. It does quite the opposite — it clears all buffers after each write. It doesn’t mean that there’s no write buffering at all — there is some on the client side (RBD cache, Linux page cache inside VMs). But internal disk write buffers aren’t used.
This makes typical desktop SSDs perform absolutely terribly for Ceph journal in terms of write IOPS. The numbers you can expect are something between 100 and 1000 (or 500—2000) iops, while you’d probably like to see at least 10000 (even a noname Chinese SSD can do 10000 iops without fsync).
So your disks should also be benchmarked with '''-iodepth=1 -fsync=1''' (or '''-sync=1''', see [[#O_SYNC vs fsync vs hdparm -W 0]]).
The thing that will really help us build our lightning-fast Ceph cluster is an SSD with (super)capacitors, which are perfectly visible to the naked eye on M.2 SSDs:
[[File:Micron 5100 sata m2.jpg]]
Supercaps work for an SSD like a built-in UPS for an SSD and allow it to flush DRAM cache into the persistent (flash) memory when a power loss occurs. Thus the cache becomes «non-volatile» — and thus an SSD safely ignores fsync (FLUSH CACHE) requests, because it’s confident that its cache contents will always make their way to the persistent memory.
And this increases '''transactional write IOPS, making it equal to non-transactional'''.
Supercaps are usually called «enhanced/advanced power loss protection» in the datasheets. This is a feature almost exclusively present only on in «server-grade» SSDs (not even all of them). For example, Intel DC S4600 has supercaps and Intel DC S3100 doesn’t.
{{Note}} This is the main difference between server and desktop SSDs. The An average user doesn’t need transactions, but the servers run DBMS’es, and DBMS’es want them really, really bad.
And… Ceph also does :) you should '''only''' buy SSDs with supercaps for Ceph clusters. Even if you consider NVMe — NVMe without capacitors is WORSE than SATA with them. Desktop NVMes do 150000+ write iops without syncs, but only 600-1000 600—1000 iops with them.
Another option is Intel Optane. Intel Optane is also an SSD, but based on the different physics — Phase-Change Memory instead of Flash memory. Specs say these drives are capable of 550000 iops without the need to erase blocks and thus no need for write cache and supercaps. But even if Optane’s latency is 0.005ms (it is), Ceph’s latency is still 0.5ms, so it’s pointless to use them with Ceph — you get the same performance for a lot more money compared to usual server SSDs/NVMes.
== Bluestore vs Filestore ==
Bluestore is the «new» storage layer of Ceph. All presentations and documents say it’s better in all ways, which indeed seems reasonable for something «new». Bluestore is really 2x faster than Filestore for linear write workloads, because it has no double-writes — big blocks are written only once, not twice as in Filestore. Filestore journals everything, so all writes first go to the journal and the then get copied to the main device.
Bluestore is also more feature-rich: it has checksums, compression, erasure-coded overwrites and virtual clones. Checksums allow 2x-replicated pools self-heal better, erasure-coded overwrites make EC usable for RBD and CephFS, and virtual clones make VMs run faster after taking a snapshot.
In HDD-only (or bad-SSD-only) setups Bluestore uses a lot more RAM, because it uses RocksDB for all metadata, additionally caches some of them by itself and is also tries to cache some data blocks to compensate 2x faster than Filestore for the lack of page cache usagerandom writes. The general rule of thumb This is 1GB again because it can do 1 commit per 1TB of storagewrite, but not less than 2GB in totalat least if you apply this patch: https://github.com/ceph/ceph/pull/26909 and turn bluefs_preextend_wal_files on. In fact it’s OK to say that Bluestore’s deferred write implementation is really optimal for transactional writes to slow drives.
=== HDD for data + SSD for journal ===
Filestore writes everything to the journal and only starts to flush it to the data device when the journal fills up to the configured percent. This is very convenient because it makes journal act as a «temporary buffer» that absorbs random write burtsbursts.
Bluestore can’t do the same even when you put its WAL+DB on SSD. It also has sort of a «journal» which is called «deferred write queue», but it’s very small (only 64 requests) and it lacks any kind of background flush threads. So you actually can increase the maximum number of deferred requests, but after the queue fills up the performance will drop until OSD restartrestarts.
So, Bluestore’s performance is very consistent, but it’s worse than peak performance of Filestore for the same hardware. In other words, Bluestore OSD refuses to do random writes faster than the HDD can do them on average.
With Filestore you easily get 1000—2000 iops while the journal is not full. With Bluestore you only get 100—300 iops regardless of the SSD journal, but these are absolutely stable over time and never drop.
In All-Flash clusters Bluestore’s own latency is usually 30-50 % greater than Filestore’s. However, this only refers to the latency of Bluestore itself, so the absolute number for these 30-50 % is something around 0.1ms which is hard to notice in front of the total Ceph’s latency. And even though latency is greater, peak parallel throughput is usually slightly better (+5..10 %) and peak CPU usage is slightly lower (-5..10 %).
But it’s still a shame that the increase is only 5-1010 % for that amount of architectural effort.
=== HDD-only (or bad-SSD-only) RAM usage ===
== RAID WRITE HOLE ==
Even if we’re talking about RAID, the thing that is much simpler than distributed software-defined storage like Ceph, we’re still talking about a distributed storage system — every system that has multiple physical drives is distributed, because each drive behaves and commits the data (or doesn’t commit it) independently of others.
Write Hole is the name for several situations in RAID arrays when drives go out of sync. Suppose you have a simple RAID1 array of two disks. You write a sector. You send a write command to both drivedrives. And then a power failure occurs before commands finish. Now, after the system boots again, you don’t know if your replicas contain same data, because you don’t know which drive had succeeded to write in writing it and which didn’thadn’t.
You say OK, I don’t care. I’ll just read from both drives and if I encounter different data I’ll just pick one of the copies, and I’ll either get the old data or the new.
But then imagine that you have RAID 5. Now you have three drives: two for data and one for parity. Now suppose that you overwrite a sector again. Before writing the write, your disks contain: (A1), (B1) and (A1 XOR B1). You want to overwrite (B1) with (B2). To do so you write (B2) to the second disk and (A1 XOR B2) to the third. A power failure occurs again… And then, on At the next boot, you also find out that same time disk 1 (one that you didn’t write anything to) is deadfails. You might think that you can still reconstruct your data because you have RAID 5 and 2 disks out of 3 are still alive.
But imagine that disk 2 succeeded to write new data, while and disk 3 failed(or vice versa). Now you have: (lost disk), (B2) and (A1 XOR B1). If you try to reconstruct A from these copies you’ll get (A1 XOR B1 XOR B2) which is obviously not equal to A1. Bang! Your RAID5 has corrupted the data that you didn’t haven’t even write been writing at the time of the power loss.
Because of this problem, Linux `mdadm` refuses at all to start an incomplete array after unclean shutdownat all. There’s no solution to this problem except full data journaling at the level of each disk drive. And this is… exactly what Ceph does! So, Ceph is actually safer than RAID. Slower — but safer :)
== Quick insight into SSD and flash memory organization ==
But that’s not the case with modern SSDs — even cheap models are very fast and usually very durable. But why? The credit goes to SSD controllers: SSDs contain very smart and powerful controllers, usually with at least 4 cores and 1-2 GHz clock frequency, which means they’re as powerful as mobile phones' processors. All that power is required to make FTL firmware run smoothly. FTL stands for «Flash Translation Layer» and it is the firmware responsible for translating addresses of small blocks into physical addresses on flash memory chips. Every write request is always put into a space freed in advance, and FTL just remembers the new physical location of the data. This makes writes very fast. FTL also defragments free space and moves blocks around to achieve uniform wear across all memory cells. This feature is called Wear Leveling. SSDs also usually have some extra physical space reserved to add even more endurance and to make wear leveling easier; this is called overprovisioning. Pricier server SSDs have a lot of space overprovisioned, for example, Micron 5100 Max has 37,5 % of physical memory reserved (extra 60 % is added to the user-visible capacity inside).
And this is also the FTL which makes power loss protection a problem. Mapping tables are metadata which must also be forced into non-volatile memory when you flush the cache, and it’s what makes desktop SSDs slow with fsync… In fact, as I wrote it I thought that they could use RocksDB or similar LSM-tree based system to store mapping tables and that could make fsyncs fast even without the capacitors. It would lead to some waste of journal space and some extra write amplification (as every journal block would only contain 1 write), but still it would make writes fast. So… either they don’t know about LSM trees or the FTL metadata is not the only problem for fsync.
When I tried to lecture someone in the mailing list about «all SSDs doing fsyncs correctly» I got this as the reply: https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf. Long story short, it says that in 2013 a common scenario was SSDs not syncing metadata on fsync calls at all which led to all kinds of funny things on a power loss, up to (!!!) total failures of some SSDs.
There also exist some very old SSDs without capacitors (OCZ Vector/Vertex) which are capable of very large sync iops numbers. How do they work? Nobody knows, but I suspect that they just don’t do safe writes :). The core principle of flash memory overwrites didn’t change hasn’t changed in the last recent years, and . 5 years ago SSDs were also FTL-based on FTLs , just as they do now.
So it seems there are two kinds of «power loss protection»: simple PLP means «we do fsyncs and don’t die or lose your data when a power loss occurs», and advanced PLP means that fsync’ed writes are just as fast as non-fsynced. It also seems that in the current years (2018—2019) simple PLP is already a standard and most SSDs don’t lose data on power failure.
Why are USB flash drives so slow then? In terms of small random writes they usually only deliver 2-3 operations per second, while being powered by similar flash memory chips — maybe slightly cheaper and worse ones, but obviously not 1000 times worse.
The answer also lies in the FTL. Thumb drives also have FTL and they even have some Wear Leveling, but it’s very small and dumb compared to SSD FTLs. It has a slow CPU and only a little memory. Thus it doesn’t have place to store a full mapping table for small blocks and thus it translates the positions of big blocks (1-2 megabytes or even bigger) instead. Writes are buffered and then flushed one block at a time; there is a small limit on the number of blocks that can be buffered at once. The limit is usually only between 3 and 6 blocks.
This limit is always sufficient to copy big files to a flash drive formatted in any of common filesystems. One opened block receives metadata and another receives data, then it just moves on. But if you start doing random writes you stop hitting the opened blocks and this is where lags come in.
== Bonus: Micron vSAN reference architecture ==
[https://media-www.micron.com/-/media/client/global/documents/products/other-documents/micron_vsan_6,-d-,7_on_x86_smc_reference_architecture.pdf Micron Accelerated All-Flash SATA vSAN 6.7 Solution]
Node configuration:
* 384 GB RAM 2667 MHz
* 2X Micron 5100 MAX 960 GB (randread: 93k iops, randwrite: 74k iops)
* 8X Micron 5200 ECO 3.84TB (randread: 95k iops, randwrite: 17k iops)
* 2x Xeon Gold 6142 (16c 2.6GHz)
* Mellanox ConnectX-4 Lx
* Connected to 2x Mellanox SN2410 25GbE switches
«Aligns with VMWare AF-6, aims up to 50K read iops per node»
* 2 replicas (like Ceph size=2)
* 4 nodes
* 4 VMs on each node
* 8 vmdk per VM
* 4 threads per vmdk
Total I/O parallelism: 512
100%/70%/50%/30%/0% write
* «Baseline» (fits in cache): 121k/178k/249k/314k/486k iops
* «Capacity» (doesn’t): 51k/66k/90k/134k/363k
* Latency is 1000*512/IOPS ms in all tests (1000ms * parallelism / iops)
* '''No latency tests with low parallelism'''
* '''No linear read/write tests'''
Conclusion:
* ~3800 write iops per drive
* ~11343 read iops per drive
* ~1600 write iops per drive when not in cache
* Parallel workload doesn’t look better than Ceph. vSAN is hyperconverged, though.
== Good SSD models ==
* Micron 5100/5200, 9300. Maybe 5300, 7300 too
* Seagate Nytro 1351/1551
* HGST SN260
* Intel P4500
https://docs.google.com/spreadsheets/d/1E9-eXjzsKboiCCX-0u0r5fAjjufLKayaut_FOPxYZjc
== Conclusion ==
Quick guide for optimizing Ceph for random reads/writes:
* Only use SSDs and NVMe with supercaps. A hint: 99 % of desktop SSDs/NVMe don’t have supercaps.
* Disable their cache with hdparm -W 0.
* Disable powersave: governor=performance, cpupower idle-set -D 0
* Disable signatures:
*: <tt>cephx_require_signatures = false</tt>
*: <tt>cephx_cluster_require_signatures = false</tt>
*: <tt>cephx_sign_messages = false</tt>
*: (and use <tt>-o nocephx_require_signatures,nocephx_sign_messages</tt> for rbd map and cephfs kernel mounts)
* For good SSDs and NVMes: set min_alloc_size=4096, prefer_deferred_size_ssd=0 (BEFORE deploying OSDs)
* At least until Nautilus: <tt>[global] debug objecter = 0/0</tt> (there is a big client-side slowdown)
* Try to disable rbd cache in the userspace driver (QEMU options cache=none)
* <s>For HDD-only or Bad-SSD-Only and at least until it’s backported (it is) — remove the handbrake https://github.com/ceph/ceph/pull/26909</s>