Изменения

Перейти к: навигация, поиск

Ceph performance

1601 байт добавлено, 11:14, 22 июля 2019
м
Нет описания правки
[[Category:VitaliPrivate]]
 
== Benchmarking ==
 
You should mainly test for the following use-cases:
* Linear read and write (big blocks, big queue)
* Highly parallel random read and write of small blocks (4-8kb, iodepth=32-128)
* Single-threaded transactional random write (4-8kb, iodepth=1) and read (though single-threaded reads are more rare)
 
Single-threaded random reads and writes are where the latency matters, and the latency doesn't scale with the number of servers. Whenever you're benchmarking your cluster with iodepth=1 you're benchmarking only ONE placement group (triplet or pair of OSDs) at a time. The result is only affected by how fast 1 OSD is responding to 1 request.
 
The latency really matters because not many applications can do random writes with high parallelism/iodepth. For example, a DBMS can't, because it's transactional and it needs to serialize its writes to the journal.
 
The naive expectation is that Ceph should be almost as fast as drives and network are, because everyone is used to the idea that I/O is slow and the software is fast. And this is true with Ceph only until you don't use SSDs.
 
The general rule is: with Ceph it's hard to achieve random read latencies less than 0.5ms and random write latencies less than 1ms, no matter what drives or network you use. This is only 2000 iops random read and 1000 iops random write, and it's already good if you achieve it. With BIS hardware and some tuning you may be able to improve it twice, but not more.
 
There is Nick Fisk's presentation titled "Low-latency Ceph". By "low-latency" he only means 0.7ms, which is only ~1500 iops.
 
== Bluestore vs Filestore ==
== WHY SO SLOW ==
== Quick insight into SSD and flash memory organization ==
Although flash memory allows fast random writes in small blocks (usually 512 to 4096 bytes), its distinctive feature is that every block must be erased before being written to. But erasing is slow compared to reading and writing, so manufacturers design memory chips so that they always erase a large group of blocks at once, which as this takes almost the same time as erasing one block could take. This group of blocks called «erase unit» is typically 2-4 megabytes in size. Another distinctive feature is that the total number of erase/program cycles is physically limited — after several thousands cycles (a usual number for MLC memory) the block becomes faulty and stops accepting new writes or even loses the data previously written to it. Denser and cheaper (MLC/TLC/QLC, 2/3/4 bits per cell) memory chips have smaller erase limits, while sparser and more expensive ones (SLC, 1 bit per cell) have bigger limits (up to 100000 rewrites). However, all limits are still finite, so stupidly overwriting the same block would be very slow and would break SSD very rapidly.
But that’s not the case with modern SSDs, they - even cheap models are very fast and usually last very long. Even cheap models are usually rather strongdurable. But why? The credit goes to SSD controllers: SSDs contain very smart and powerful controllers, usually with at least 4 cores and 1-2 GHz clock frequency, which means they’re as powerful as mobile phones' processors. All that power is required to make FTL firmware run smoothly. FTL stands for «Flash Translation Layer» and it is the firmware responsible for translating addresses of small blocks into physical addresses on flash memory chips. Every write request is always put into a space freed in advance, and FTL just remembers the new physical location of the data. This makes writes very fast. FTL also defragments free space and moves blocks around to achieve uniform wear across all memory cells. This feature is called Wear Leveling. SSDs also usually have some extra physical space reserved to add even more endurance and to make wear leveling easier; this is called overprovisioning. Pricier server SSDs have a lot of space overprovisioned, for example, Micron 5100 Max has 37,5 % of physical memory reserved (extra 60 % is added to the user-visible capacity inside).
And this is also the FTL which makes power loss protection a problem. Mapping tables are metadata which must also be forced into non-volatile memory when you flush the cache, and it’s what makes desktop SSDs slow with fsync… In fact, as I wrote it I thought that they could use RocksDB or similar LSM-tree based system to store mapping tables and that could make fsyncs fast even without the capacitors. It would lead to some waste of journal space and some extra write amplification (as every journal block would only contain 1 write), but still it would make writes fast. So… either they don’t know about LSM trees or the FTL metadata is not the only problem for fsync.

Навигация