Ceph performance
Quick insight into SSD and flash memory organization
Although flash memory allows fast random writes in small blocks (usually 512 to 4096 bytes), its distinctive feature is that every block must be erased before being written to. But erasing is slow compared to reading and writing, so manufacturers design memory chips so that they always erase a large group of blocks at once, which takes the same time as erasing one block could take. This group of blocks called «erase unit» is typically 2-4 megabytes in size. Another distinctive feature is that the total number of erase/program cycles is physically limited — after several thousands cycles (a usual number for MLC memory) the block becomes faulty and stops accepting new writes or even loses the data previously written to it. Denser and cheaper (MLC/TLC/QLC, 2/3/4 bits per cell) memory chips have smaller erase limits, while sparser and more expensive ones (SLC, 1 bit per cell) have bigger limits (up to 100000 rewrites). However, all limits are still finite, so stupidly overwriting the same block would be very slow and would break SSD very rapidly.
But that’s not the case with modern SSDs, they are very fast and usually last very long. Even cheap models are usually rather strong. But why? The credit goes to SSD controllers: SSDs contain very smart and powerful controllers, usually with at least 4 cores and 1-2 GHz clock frequency, which means they’re as powerful as mobile phones' processors. All that power is required to make FTL firmware run smoothly. FTL stands for «Flash Translation Layer» and it is the firmware responsible for translating addresses of small blocks into physical addresses on flash memory chips. Every write request is always put into a space freed in advance, and FTL just remembers the new physical location of the data. This makes writes very fast. FTL also defragments free space and moves blocks around to achieve uniform wear across all memory cells. This feature is called Wear Leveling. SSDs also usually have some extra physical space reserved to add even more endurance and to make wear leveling easier; this is called overprovisioning. Pricier server SSDs have a lot of space overprovisioned, for example, Micron 5100 Max has 37,5 % of physical memory reserved (extra 60 % is added to the user-visible capacity inside).
And this is also the FTL which makes power loss protection a problem. The mapping tables are the metadata which must also be forced into the non-volatile memory when you flush the cache, and it’s what makes desktop SSDs slow with fsync... In fact, as I wrote it I thought that they could use RocksDB or similar LSM-tree based system to store mapping tables and that could make fsyncs fast even without the capacitors. It would lead to some waste of journal space and some extra write amplification (as every journal block would only contain 1 write), but still it would make writes fast. So… either they don’t know about LSM trees or the FTL metadata is not the only problem for fsync.
When I tried to lecture someone in the mailing list about «all SSDs doing fsyncs correctly» I got this as the reply: https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf. Long story short, it says that in 2013 a common scenario was SSDs not syncing metadata on fsync calls at all which lead to all kinds of funny things on a power loss, up to (!!!) total failures of some SSDs.
There also exist some very old SSDs without capacitors (OCZ Vector/Vertex) which are capable of very large sync iops numbers. How do they work? Nobody knows, but I suspect that they just don’t do safe writes :). The core principle of flash memory overwrites didn’t change in the last years, and SSDs were also based on FTLs just as they do now.
So it seems there are two kinds of «power loss protection»: simple PLP means «we do fsyncs and don’t die or lose your data when a power loss occurs», and advanced PLP means that fsync’ed writes are just as fast as non-fsynced. It also seems that in the current years (2018—2019) simple PLP is already a standard and most SSDs don’t lose data on power failure.
A bonus: USB thumb drives
Why are USB flash drives so slow then? In terms of small random writes they usually only deliver 2-3 operations per second, while being powered by same flash memory chips - maybe slightly cheaper and worse ones, but obviously not 1000 times worse.
The answer also lies in the FTL. Thumb drives also have FTL and they even have some Wear Leveling, but it's very small and dumb compared to SSD FTLs. It has a slow CPU and only a little memory. Thus it doesn't have place to store a full mapping table for small blocks and thus it translates the positions of big blocks (1-2 megabytes or even bigger) instead. Writes are buffered and then flushed one block at a time; there is a small limit on number of blocks that can be buffered at once. The limit is usually only between 3 and 6 blocks.
This limit is always sufficient to copy big files to a flash drive formatted in any of common filesystems. One opened block receives metadata and another receives data, then it just moves on. But if you start doing random writes you stop hitting the opened blocks and this is where lags come in.