Изменения

Перейти к: навигация, поиск

Ceph performance

2422 байта добавлено, 09:34, 13 августа 2019
Нет описания правки
=== Micron setup example ===
Here’s an example setup from Micron. They used 2x replication, very costly CPUs (2x Xeon Gold per server), very fast network (100G2x100G, in fact 2x2x100G — 2 cards with 2 ports each) and 10x their best NVMes in each of 4 nodes: https://www.micron.com/resource-details/30c00464-e089-479c-8469-5ecb02cfe06f
They only got 350000 peak write iops with high parallelism with 100 % CPU load. It may seem a lot, but if you divide it by the number of NVMes — 350000/40 NVMe — it’s only 8750 iops per an NVMe. If we account for 2 replicas and WAL we get 8750*2*2 = 35000 iops per drive. So… Ceph only squeezed 35000 iops out of an NVMe '''that can deliver 260000 iops alone'''. That’s what Ceph’s overhead is.
Also there are no single-thread latency tests in that PDF. Such tests could be very interesting.
 
=== Update ===
 
https://www.micron.com/-/media/client/global/documents/products/other-documents/micron_9300_and_red_hat_ceph_reference_architecture.pdf
 
New NVMes are Micron 9300 of maximum possible capacity — 12.8 TB. Each of those delivers even more write iops: 310k instead of 260k. Everything else remains the same. The new write performance result for 100 RBD clients is 477029 iops. Remember that it’s only 4770 iops per a client, though. 10 RBD clients get 294000 iops in their case, that’s 29400 iops per a client, which is of course better.
 
What helped the performance? I guess the configuration did. In comparison to the previous test they changed the following:
* disabled messenger checksums (ms_crc_data=false) and bluestore checksums (bluestore_csum_type=none)
* tuned rocksdb: <tt>bluestore_rocksdb_options = compression=kNoCompression,max_write_buffer_number=64,min_write_buffer_number_to_merge=32,recycle_log_file_num=64,compaction_style=kCompactionStyleLevel,write_buffer_size=4MB,target_file_size_base=4MB,max_background_compactions=64,level0_file_num_compaction_trigger=64,level0_slowdown_writes_trigger=128,level0_stop_writes_trigger=256,max_bytes_for_level_base=6GB,compaction_threads=32,flusher_threads=8,compaction_readahead_size=2MB</tt> — this divides into the following:
** 64x32x4 MB memtable (number x merge x size) instead of default 4x1x256 MB. The effect of this change isn’t really clear for me. It may slightly reduce CPU load because sorting a big memtable is slower than sorting a small one. However, 32x4 compactions aren’t probably that much faster than 1x256.
** max_bytes_for_level_base is changed dramatically — it’s raised to 6 GB from 256 MB!
** added compaction threads
* allocated 14 GB RAM per each OSD
* osd_max_pg_log_entries=osd_min_pg_log_entries=osd_pg_log_dups_tracked=osd_pg_log_trim_min = 10. However I’m not sure about this — it did nothing in my tests.
 
Other remarks:
* cephx was already disabled in the previous version of the test. This time they also disabled signatures. However, it seems pointless — disabled cephx doesn’t sign anything.
* they already had debug objecter = 0/0 and the rest of debugs.
* it seems they haven’t tried changing prefer_deferred_size and min_alloc_size.
* new NVMes definitely didn’t change anything. 260000 iops is over the top for Ceph anyway.
== CAPACITORS! ==

Навигация