Изменения

Ceph performance

1282 байта добавлено, 08:02, 24 июля 2019
м
Нет описания правки
{{Note}} This is the main difference between server and desktop SSDs. The average user doesn’t need transactions, but the servers run DBMS’es, and DBMS’es want them really, really bad.
And… Ceph also does :) you should '''only''' buy SSDs with supercaps for Ceph clusters. Even if you consider NVMe — NVMe without capacitors is WORSE than SATA with them. Desktop NVMes do 150000+ write iops without syncs, but only 600-1000 600—1000 iops with them.
Another option is Intel Optane. Intel Optane is also an SSD, but based on the different physics — Phase-Change Memory instead of Flash memory. Specs say these drives are capable of 550000 iops without the need to erase blocks and thus no need for write cache and supercaps. But even if Optane’s latency is 0.005ms (it is), Ceph’s latency is still 0.5ms, so it’s pointless to use them with Ceph — you get the same performance for a lot more money compared to usual server SSDs/NVMes.
And, suprisingly, there is one thing that may sometimes be worse with Bluestore: random write performance. The issue shows up in two popular setups.
 
TODO: This section lacks random read performance comparisons.
=== HDD for data + SSD for journal ===
In All-Flash clusters Bluestore’s own latency is usually 30-50 % greater than Filestore’s. However, this only refers to the latency of Bluestore itself, so the absolute number for these 30-50 % is something around 0.1ms which is hard to notice in front of the total Ceph’s latency. And even though latency is greater, peak parallel throughput is usually slightly better (+5..10 %) and peak CPU usage is slightly lower (-5..10 %).
But it’s still a shame that the increase is only 5-1010 % for that amount of architectural effort.
=== HDD-only (or bad-SSD-only) ===
In these setups Bluestore is also 2x faster than Filestore, because it can do 1 commit per write, at least if you apply this patch: https://github.com/ceph/ceph/pull/26909 and turn bluefs_preextend_wal_files on. In fact it’s OK to say that Bluestore’s deferred write implementation is really optimal for transactional writes on HDDs.
TODO=== About the sizing of block.db === As usual, there’s something wrong : ) Official documents say that you should allocate 4 % of the slow device space for block.db (Bluestore’s metadata partition). This is a lot, Bluestore section lacks random read performance comparisonsrarely needs that amount of space. But the main problem is that Bluestore uses RocksDB and RocksDB puts a file on the fast device only if it thinks that the whole layer will fit there (RocksDB is organized in files). So, default RocksDB settings in Ceph are: * 1 GB WAL = 4x256 Mb* max_bytes_for_level_base and max_bytes_for_level_multiplier are default, thus 256 Мб and 10, respectively* so L1 = 256 Мб* L2 = 2560 Мб* L3 = 25600 Мб …so… RocksDB puts L2 files to block.db only if it’s at least 2560+256+1024 Mb (almost 4 GB). And it will put L3 there only if it’s at least 25600+2560+256+1024 Mb (almost 30 GB). And L4 - only if it’s at least 256000+25600+2560+256+1024 Mb (roughly 286 GB). In other words, all block.db sizes except 4, 30, 286 GB are pointless, Bluestore won’t use everything above the previous «round» size. At least if you don't change RocksDB settings. 286 GB is a lot, Bluestore rarely needs that much. If you only use RBD, choose 30 GB.If you are something like RGW with a lot of small objects -
== RAID WRITE HOLE ==