Изменения

Ceph performance

1183 байта добавлено, 23:10, 23 июля 2019
Нет описания правки
== Bluestore vs Filestore ==
Bluestore is the «new» storage layer of Ceph. It’s reasonable to expect better performance All presentations and basically documents say it’s better everything from in all ways, which in fact seems reasonable for something «new». Is it correct for Bluestore?
Yes and Bluestore is really 2x faster than Filestore for linear write workloads, because it has nodouble-writes — big blocks are written only once, not twice as in Filestore. Filestore journals everything, so all writes first go to the journal and the get copied to the main device.
Yes, Bluestore is really 2x faster than Filestore for linear write workloads, because also more feature-rich: it has no double-writes — big blocks are written only oncechecksums, not twice as in Filestorecompression, erasure-coded overwrites and virtual clones. Filestore journals everythingChecksums allow 2x-replicated pools self-heal better, so all writes first go to the journal erasure-coded overwrites make EC usable for RBD and the get copied to the main deviceCephFS, and virtual clones make VMs run faster after taking a snapshot.
Bluestore is also uses a lot more feature-rich: RAM, because it has checksumsuses RocksDB for all metadata, compression, erasure-coded overwrites additionally caches some of them by itself and «fast» snapshots (thanks also tries to cache some data blocks to compensate for the «virtual clones»)lack of page cache usage. The general rule of thumb is 1GB per 1TB of storage, but not less than 2GB in total.
HoweverAnd, the suprisingly, there is one thing that may sometimes be worse with Bluestore: random write performance varies:. The issue shows up in two popular setups.
* Bluestore is 2x faster than Filestore in === HDD-only (or bad-for data + SSD-for journal === Filestore writes everything to the journal and only) configurations, starts to flush it to the data device when the journal fills up to the configured percent. This is very convenient because it can do 1 commit per makes journal act as a «temporary buffer» that absorbs random write, at least if burts. Bluestore can’t do the same even when you apply this patch: https://githubput its WAL+DB on SSD.com/ceph/ceph/pull/26909 It also has sort of a «journal» which is called «deferred write queue», but it’s very small (only 64 requests) and turn it lacks any kind of background flush threads. In fact it’s OK to say that Bluestore’s So you actually can increase the maximum number of deferred-write algorithm is really optimal for HDD-only transactional writerequests, but after the queue fills up the performance will drop until OSD restart.* In HDD+SDD configurationsSo, Bluestore Bluestore’s performance is very consistent, but it’s worse than the peak performance of Filestore for the same hardware. This is because Bluestore’s deferred write queue doesn’t act as a «temporary buffer» that can smooth burst random write loads. In other words, Bluestore OSD refuses to do random writes faster than the HDD can do on average.*: So with With Filestore you can easily get 1000—2000 iops (iodepth=1). But after while the journal becomes is not full it will drop very low, to 30-50 iops. With Bluestore you only get 100—300 iops for a single HDD regardless of the SSD journal, but these are absolutely stable over time.*: This is caused by the very short deferred write queue (it’s only 64 requests) and the lack of any background flush threadnever drop.* In === SSD-only (All-Flash) === In All-Flash clusters Bluestore’s own latency is 1.5usually 30-2 times 50 % greater than Filestore’s. However, this only refers to the latency of Bluestore itself, so the absolute number for these 30-50 % is something around 0.1ms which is hard to notice in front of the total Ceph’s latency. And even though the latency is greater, the total peak parallel throughput is usually slightly better (+5..10 %) and total peak CPU usage is slightly lower (-5..10 %). But it’s still a shame that the increase is only 5-10% for that amount of architectural effort. === HDD-only (or bad-SSD-only) === In these setups Bluestore is also 2x faster than Filestore, because it can do 1 commit per write, at least if you apply this patch: https://github.com/ceph/ceph/pull/26909 and turn bluefs_preextend_wal_files on. In fact it’s OK to say that Bluestore’s deferred write implementation is really optimal for transactional writes on HDDs.
== RAID WRITE HOLE ==