Cluster-Wide Disk Layout Parameters

These parameters apply to clients and OSDs, are fixed at the moment of OSD drive initialization and can't be changed after it without losing data.

block_size
bitmap_granularity
immediate_commit
client_dirty_limit

block_size

Type: integer
Default: 131072

Size of objects (data blocks) into which all physical and virtual drives are subdivided in Vitastor. One of current main settings in Vitastor, affects memory usage, write amplification and I/O load distribution effectiveness.

Recommended default block size is 128 KB for SSD and 4 MB for HDD. In fact, it's possible to use 4 MB for SSD too - it will lower memory usage, but may increase average WA and reduce linear performance.

OSDs with different block sizes (for example, SSD and SSD+HDD OSDs) can currently coexist in one etcd instance only within separate Vitastor clusters with different etcd_prefix'es.

Also block size can't be changed after OSD initialization without losing data.

You must always specify block_size in etcd in /vitastor/config/global if you change it so all clients can know about it.

OSD memory usage is roughly (SIZE / BLOCK * 68 bytes) which is roughly 544 MB per 1 TB of used disk space with the default 128 KB block size.

bitmap_granularity

Type: integer
Default: 4096

Required virtual disk write alignment ("sector size"). Must be a multiple of disk_alignment. It's called bitmap granularity because Vitastor tracks an allocation bitmap for each object containing 2 bits per each (bitmap_granularity) bytes.

This parameter can't be changed after OSD initialization without losing data. Also it's fixed for the whole Vitastor cluster i.e. two different values can't be used in a single Vitastor cluster.

Clients MUST be aware of this parameter value, so put it into etcd key /vitastor/config/global if you change it for any reason.

immediate_commit

Type: string
Default: false

Another parameter which is really important for performance.

Desktop SSDs are very fast (100000+ iops) for simple random writes without cache flush. However, they are really slow (only around 1000 iops) if you try to fsync() each write, that is, when you want to guarantee that each change gets immediately persisted to the physical media.

Server-grade SSDs with "Advanced/Enhanced Power Loss Protection" or with "Supercapacitor-based Power Loss Protection", on the other hand, are equally fast with and without fsync because their cache is protected from sudden power loss by a built-in supercapacitor-based "UPS".

Some software-defined storage systems always fsync each write and thus are really slow when used with desktop SSDs. Vitastor, however, can also efficiently utilize desktop SSDs by postponing fsync until the client calls it explicitly.

This is what this parameter regulates. When it's set to "all" the whole Vitastor cluster commits each change to disks immediately and clients just ignore fsyncs because they know for sure that they're unneeded. This reduces the amount of network roundtrips performed by clients and improves performance. So it's always better to use server grade SSDs with supercapacitors even with Vitastor, especially given that they cost only a bit more than desktop models.

There is also a common SATA SSD (and HDD too!) firmware bug (or feature) that makes server SSDs which have supercapacitors slow with fsync. To check if your SSDs are affected, compare benchmark results from fio -name=test -ioengine=libaio -direct=1 -bs=4k -rw=randwrite -iodepth=1 with and without -fsync=1. Results should be the same. If fsync=1 result is worse you can try to work around this bug by "disabling" drive write-back cache by running hdparm -W 0 /dev/sdXX or echo write through > /sys/block/sdXX/device/scsi_disk/*/cache_type (IMPORTANT: don't mistake it with /sys/block/sdXX/queue/write_cache - it's unsafe to change by hand). The same may apply to newer HDDs with internal SSD cache or "media-cache" - for example, a lot of Seagate EXOS drives have it (they have internal SSD cache even though it's not stated in datasheets).

This parameter must be set both in etcd in /vitastor/config/global and in OSD command line or configuration. Setting it to "all" or "small" requires enabling disable_journal_fsync and disable_meta_fsync, setting it to "all" also requires enabling disable_data_fsync.

TLDR: For optimal performance, set immediate_commit to "all" if you only use SSDs with supercapacitor-based power loss protection (nonvolatile write-through cache) for both data and journals in the whole Vitastor cluster. Set it to "small" if you only use such SSDs for journals. Leave empty if your drives have write-back cache.

client_dirty_limit

Type: integer
Default: 33554432

Without immediate_commit=all this parameter sets the limit of "dirty" (not committed by fsync) data allowed by the client before forcing an additional fsync and committing the data. Also note that the client always holds a copy of uncommitted data in memory so this setting also affects RAM usage of clients.

This parameter doesn't affect OSDs themselves.

5.3 KiB Raw Blame History

Cluster-Wide Disk Layout Parameters

block_size

bitmap_granularity

immediate_commit

client_dirty_limit

5.3 KiB

Raw Blame History