Simplified distributed block storage with strong consistency, like in Ceph
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

258 lines
6.7 KiB

8 months ago
[Documentation](../../ → [Configuration](../ → Pool configuration
[Читать на русском](
# Pool configuration
Pool configuration is set in etcd key `/vitastor/config/pools` in the following
JSON format:
"<Numeric ID>": {
"name": "<name>",
...other parameters...
Pool configuration is also affected by:
- [OSD Placement Tree](#placement-tree)
- [Separate OSD settings](#osd-settings)
8 months ago
- [name](#name)
- [scheme](#scheme)
- [pg_size](#pg_size)
- [parity_chunks](#parity_chunks)
- [pg_minsize](#pg_minsize)
- [pg_count](#pg_count)
- [failure_domain](#failure_domain)
- [max_osd_combinations](#max_osd_combinations)
- [pg_stripe_size](#pg_stripe_size)
- [root_node](#root_node)
- [osd_tags](#osd_tags)
- [primary_affinity_tags](#primary_affinity_tags)
- [Replicated Pool](#replicated-pool)
- [Erasure-coded Pool](#erasure-coded-pool)
# Placement Tree
OSD placement tree is set in a separate etcd key `/vitastor/config/node_placement`
in the following JSON format:
"<node name or OSD number>": {
"level": "<level>",
"parent": "<parent node name, if any>"
Here, if a node name is a number then it is assumed to refer to an OSD.
Level of the OSD is always "osd" and cannot be overriden. You may only
override parent node of the OSD which is its host by default.
Non-numeric node names refer to other placement tree nodes like hosts, racks,
datacenters and so on.
Hosts of all OSDs are auto-created in the tree with level "host" and name
equal to the host name reported by a corresponding OSD. You can refer to them
without adding them to this JSON tree manually.
Level may be "host", "osd" or refer to some other placement tree level
from [placement_levels](
Parent node reference is required for intermediate tree nodes.
# OSD settings
Separate OSD settings are set in etc keys `/vitastor/config/osd/<number>`
in JSON format `{"<key>":<value>}`.
As of now, there is only one setting:
## reweight
- Type: number, between 0 and 1
- Default: 1
Every OSD receives PGs proportional to its size. Reweight is a multiplier for
OSD size used during PG distribution.
This means an OSD configured with reweight lower than 1 receives less PGs than
it normally would. An OSD with reweight = 0 won't store any data. You can set
reweight to 0 to trigger rebalance and remove all data from an OSD.
# Pool parameters
8 months ago
## name
- Type: string
- Required
Pool name.
## scheme
- Type: string
- Required
- One of: "replicated", "xor", "ec" or "jerasure"
8 months ago
Redundancy scheme used for data in this pool. "jerasure" is an alias for "ec",
both use Reed-Solomon-Vandermonde codes based on ISA-L or jerasure libraries.
Fast ISA-L based implementation is used automatically when it's available,
slower jerasure version is used otherwise.
8 months ago
## pg_size
- Type: integer
- Required
Total number of disks for PGs of this pool - i.e., number of replicas for
replicated pools and number of data plus parity disks for EC/XOR pools.
## parity_chunks
- Type: integer
Number of parity chunks for EC/XOR pools. For such pools, data will be lost
if you lose more than parity_chunks disks at once, so this parameter can be
equally described as FTT (number of failures to tolerate).
Required for EC/XOR pools, ignored for replicated pools.
## pg_minsize
- Type: integer
- Required
Number of available live disks for PGs of this pool to remain active.
That is, if it becomes impossible to place PG data on at least (pg_minsize)
OSDs, PG is deactivated for both read and write. So you know that a fresh
write always goes to at least (pg_minsize) OSDs (disks).
FIXME: pg_minsize behaviour may be changed in the future to only make PGs
read-only instead of deactivating them.
## pg_count
- Type: integer
- Required
Number of PGs for this pool. The value should be big enough for the monitor /
LP solver to be able to optimize data placement.
"Enough" is usually around 64-128 PGs per OSD, i.e. you set pg_count for pool
to (total OSD count * 100 / pg_size). You can round it to the closest power of 2,
because it makes it easier to reduce or increase PG count later by dividing or
multiplying it by 2.
In Vitastor, PGs are ephemeral, so you can change pool PG count anytime just
by overwriting pool configuration in etcd. Amount of the data affected by
rebalance will be smaller if the new PG count is a multiple of the old PG count
or vice versa.
## failure_domain
- Type: string
- Default: host
Failure domain specification. Must be "host" or "osd" or refer to one of the
placement tree levels, defined in [placement_levels](
Two replicas, or two parts in case of EC/XOR, of the same block of data are
never put on OSDs in the same failure domain (for example, on the same host).
So failure domain specifies the unit which failure you are protecting yourself
## max_osd_combinations
- Type: integer
- Default: 10000
Vitastor data placement algorithm is based on the LP solver and OSD combinations
which are fed to it are generated ramdonly. This parameter specifies the maximum
number of combinations to generate when optimising PG placement.
This parameter usually doesn't require to be changed.
## pg_stripe_size
- Type: integer
- Default: 0
Specifies the stripe size for this pool according to which images are split into
different PGs. Stripe size can't be smaller than [block_size](
multiplied by (pg_size - parity_chunks) for EC/XOR pools, or 1 for replicated pools,
and the same value is used by default.
This means first `pg_stripe_size = (block_size * (pg_size-parity_chunks))` bytes
of an image go to one PG, next `pg_stripe_size` bytes go to another PG and so on.
Usually doesn't require to be changed separately from the block size.
## root_node
- Type: string
Specifies the root node of the OSD tree to restrict this pool OSDs to.
Referenced root node must exist in /vitastor/config/node_placement.
## osd_tags
- Type: string or array of strings
Specifies OSD tags to restrict this pool to. If multiple tags are specified,
only OSDs having all of these tags will be used for this pool.
## primary_affinity_tags
- Type: string or array of strings
Specifies OSD tags to prefer putting primary OSDs in this pool to.
Note that for EC/XOR pools Vitastor always prefers to put primary OSD on one
of the OSDs containing a data chunk for a PG.
# Examples
## Replicated pool
"1": {
## Erasure-coded pool
"2": {
8 months ago