Изменения

Ceph performance

1183 байта добавлено, 23:10, 23 июля 2019

Нет описания правки

== Bluestore vs Filestore ==

Bluestore is the «new» storage layer of Ceph. ~~It’s reasonable to expect better performance~~ All presentations and ~~basically~~ documents say it’s better ~~everything from~~ in all ways, which in fact seems reasonable for something «new». ~~Is it correct for Bluestore?~~

~~Yes and~~ Bluestore is really 2x faster than Filestore for linear write workloads, because it has nodouble-writes — big blocks are written only once, not twice as in Filestore. Filestore journals everything, so all writes first go to the journal and the get copied to the main device.

~~Yes,~~ Bluestore is ~~really 2x faster than Filestore for linear write workloads, because~~ also more feature-rich: it has ~~no double-writes — big blocks are written only once~~checksums, ~~not twice as in Filestore~~compression, erasure-coded overwrites and virtual clones. ~~Filestore journals everything~~Checksums allow 2x-replicated pools self-heal better, ~~so all writes first go to the journal~~ erasure-coded overwrites make EC usable for RBD and ~~the get copied to the main device~~CephFS, and virtual clones make VMs run faster after taking a snapshot.

Bluestore ~~is also~~ uses a lot more ~~feature-rich:~~ RAM, because it ~~has checksums~~uses RocksDB for all metadata, ~~compression, erasure-coded overwrites~~ additionally caches some of them by itself and ~~«fast» snapshots (thanks~~ also tries to cache some data blocks to compensate for the ~~«virtual clones»)~~lack of page cache usage. The general rule of thumb is 1GB per 1TB of storage, but not less than 2GB in total.

~~However~~And, ~~the~~ suprisingly, there is one thing that may sometimes be worse with Bluestore: random write performance ~~varies:~~. The issue shows up in two popular setups.

* Bluestore is 2x faster than Filestore in === HDD~~-only (or bad-~~for data + SSD-for journal === Filestore writes everything to the journal and only~~) configurations,~~ starts to flush it to the data device when the journal fills up to the configured percent. This is very convenient because it ~~can do 1 commit per~~ makes journal act as a «temporary buffer» that absorbs random write~~, at least if~~ burts. Bluestore can’t do the same even when you ~~apply this patch: https://github~~put its WAL+DB on SSD.~~com/ceph/ceph/pull/26909~~ It also has sort of a «journal» which is called «deferred write queue», but it’s very small (only 64 requests) and ~~turn~~ it lacks any kind of background flush threads. ~~In fact it’s OK to say that Bluestore’s~~ So you actually can increase the maximum number of deferred~~-write algorithm is really optimal for HDD-only transactional write~~requests, but after the queue fills up the performance will drop until OSD restart.* In HDD+SDD configurationsSo, ~~Bluestore~~ Bluestore’s performance is very consistent, but it’s worse than ~~the~~ peak performance of Filestore for the same hardware~~. This is because Bluestore’s deferred write queue doesn’t act as a «temporary buffer» that can smooth burst random write loads~~. In other words, Bluestore OSD refuses to do random writes faster than the HDD can do on average.*: So with With Filestore you ~~can~~ easily get 1000—2000 iops ~~(iodepth=1). But after~~ while the journal ~~becomes~~ is not full ~~it will drop very low, to 30-50 iops~~. With Bluestore you only get 100—300 iops ~~for a single HDD~~ regardless of the SSD journal, but these are absolutely stable over time.*: This is caused by the very short deferred write queue (it’s only 64 requests) and ~~the lack of any background flush thread~~never drop.* In === SSD-only (All-Flash) === In All-Flash clusters Bluestore’s own latency is ~~1.5~~usually 30-~~2 times~~ 50 % greater than Filestore’s. However, this only refers to the latency of Bluestore itself, so the absolute number for these 30-50 % is something around 0.1ms which is hard to notice in front of the total Ceph’s latency. And even though ~~the~~ latency is greater, ~~the total~~ peak parallel throughput is usually slightly better (+5..10 %) and ~~total~~ peak CPU usage is slightly lower (-5..10 %). But it’s still a shame that the increase is only 5-10% for that amount of architectural effort. === HDD-only (or bad-SSD-only) === In these setups Bluestore is also 2x faster than Filestore, because it can do 1 commit per write, at least if you apply this patch: https://github.com/ceph/ceph/pull/26909 and turn bluefs_preextend_wal_files on. In fact it’s OK to say that Bluestore’s deferred write implementation is really optimal for transactional writes on HDDs.

== RAID WRITE HOLE ==

VitaliyFilippov

Бюрократ, администратор

13 521

правка

Изменения

Ceph performance

YourcmcWiki