Изменения

Перейти к: навигация, поиск

Ceph performance

2327 байтов добавлено, 11:54, 25 октября 2019
Нет описания правки
* High CPU requirement is one of the arguments to NOT use Ceph in a «hyperconverged setup», the setup in which storage and compute nodes are combined.
* You can also disable all hardware vulnerability mitigations: <tt>noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf=off nospec_store_bypass_disable no_stf_barrier</tt> (or just <tt>mitigations=off</tt> for newer kernels)
 
== VM setup and filesystem options ==
 
* Default qemu options for RBD are bad.
* Bad means that a) it uses a slow emulated LSI controller b) it uses a mode which caches reads, but not writes.
* Drive cache is qemu is controlled by the `cache` option (surprise). It can be <missing>, writethrough, writeback, none, unsafe, directsync. With RBD this option also affects rbd cache, which is the cache on the Ceph’s client library (librbd) side.
* But cache=unsafe doesn’t work with RBD, it still waits for write confirmations. And writethrough, <missing> and directsync are basically equivalent.
* RBD cache helps a lot on HDDs, but on all-flash clusters it slows everything down. Something is implemented with locks, something is single-threaded, somebody tries to optimize it all, but the work isn’t done yet.
* There are the following drive emulation options: lsi (slowest), virtio-scsi (fast), virtio (fastest, but can’t do TRIM until QEMU 4.0). virtio-scsi can use multiple queues and thus should be the fastest with fast underlying storage (with a local NVMe?) — but it seems it doesn’t matter with Ceph.
* The filesystem also slows thing down! Specifically it updates inode mtime on each small write if you don’t have lazytime enabled. mtime is part of the metadata, so this change is journaled, which makes the <tt>fio -sync=1 -iodepth=1 -direct=1</tt> test result 3-4 times worse when you run it over a file in FS.
* If you’re so unlucky that you run Oracle in your Ceph VMs then it’s crucial to set FILESYSTEMIO_OPTIONS=SETALL. I/O will be terribly slow if you don’t set it.
 
So…
* For HDD / SSD+HDD clusters it is recommended to use qemu cache=writeback. This mode is safe, because guest fsyncs make qemu flush RBD cache. That is, guests don’t lose the journaled data.
* For SSD-only clusters it’s best to disable the cache at all (cache=none). It usually increases maximum parallel iops 2 times or so.
* The best emulation driver is virtio. Now go find the way to set it in your VM GUI (Proxmox, Opennebula) :). Opennebula, for example, has quite a perverted way of changing the emulation driver.
* Try to use lazytime everywhere. It requires a decently recent kernel: at least 4.0 for ext4 and at least 4.17 for XFS. For XFS, it also seems that a recent version of util-linux is required.
== Network, DPDK and SPDK ==

Навигация