Allow to configure block_size, bitmap_granularity and immediate_commit per-pool

rm-left-on-dead
Vitaliy Filippov 2022-08-09 02:27:02 +03:00
parent 4c9aaa8a86
commit 5a10d135f3
24 changed files with 535 additions and 340 deletions

View File

@ -9,34 +9,34 @@
These parameters apply to clients and OSDs, are fixed at the moment of OSD drive
initialization and can't be changed after it without losing data.
OSDs with different values of these parameters (for example, SSD and SSD+HDD
OSDs) can coexist in one Vitastor cluster within different pools. Each pool can
only include OSDs with identical settings of these parameters.
These parameters, when set to a non-default value, must also be specified in
etcd for clients to be aware of their values, either in /vitastor/config/global
or in pool configuration. Pool configuration overrides the global setting.
If the value for a pool in etcd doesn't match on-disk OSD configuration, the
OSD will refuse to start PGs of that pool.
- [block_size](#block_size)
- [bitmap_granularity](#bitmap_granularity)
- [immediate_commit](#immediate_commit)
- [client_dirty_limit](#client_dirty_limit)
## block_size
- Type: integer
- Default: 131072
Size of objects (data blocks) into which all physical and virtual drives are
subdivided in Vitastor. One of current main settings in Vitastor, affects
memory usage, write amplification and I/O load distribution effectiveness.
Size of objects (data blocks) into which all physical and virtual drives
(within a pool) are subdivided in Vitastor. One of current main settings
in Vitastor, affects memory usage, write amplification and I/O load
distribution effectiveness.
Recommended default block size is 128 KB for SSD and 4 MB for HDD. In fact,
it's possible to use 4 MB for SSD too - it will lower memory usage, but
may increase average WA and reduce linear performance.
OSDs with different block sizes (for example, SSD and SSD+HDD OSDs) can
currently coexist in one etcd instance only within separate Vitastor
clusters with different etcd_prefix'es.
Also block size can't be changed after OSD initialization without losing
data.
You must always specify block_size in etcd in /vitastor/config/global if
you change it so all clients can know about it.
OSD memory usage is roughly (SIZE / BLOCK * 68 bytes) which is roughly
544 MB per 1 TB of used disk space with the default 128 KB block size.
@ -50,12 +50,7 @@ of disk_alignment. It's called bitmap granularity because Vitastor tracks
an allocation bitmap for each object containing 2 bits per each
(bitmap_granularity) bytes.
This parameter can't be changed after OSD initialization without losing
data. Also it's fixed for the whole Vitastor cluster i.e. two different
values can't be used in a single Vitastor cluster.
Clients MUST be aware of this parameter value, so put it into etcd key
/vitastor/config/global if you change it for any reason.
Can't be smaller than the OSD data device sector.
## immediate_commit
@ -99,26 +94,12 @@ unsafe to change by hand). The same may apply to newer HDDs with internal
SSD cache or "media-cache" - for example, a lot of Seagate EXOS drives have
it (they have internal SSD cache even though it's not stated in datasheets).
This parameter must be set both in etcd in /vitastor/config/global and in
OSD command line or configuration. Setting it to "all" or "small" requires
enabling disable_journal_fsync and disable_meta_fsync, setting it to "all"
also requires enabling disable_data_fsync.
Setting this parameter to "all" or "small" in OSD parameters requires enabling
disable_journal_fsync and disable_meta_fsync, setting it to "all" also requires
enabling disable_data_fsync.
TLDR: For optimal performance, set immediate_commit to "all" if you only use
SSDs with supercapacitor-based power loss protection (nonvolatile
write-through cache) for both data and journals in the whole Vitastor
cluster. Set it to "small" if you only use such SSDs for journals. Leave
empty if your drives have write-back cache.
## client_dirty_limit
- Type: integer
- Default: 33554432
Without immediate_commit=all this parameter sets the limit of "dirty"
(not committed by fsync) data allowed by the client before forcing an
additional fsync and committing the data. Also note that the client always
holds a copy of uncommitted data in memory so this setting also affects
RAM usage of clients.
This parameter doesn't affect OSDs themselves.

View File

@ -9,10 +9,19 @@
Данные параметры используются клиентами и OSD, задаются в момент инициализации
диска OSD и не могут быть изменены после этого без потери данных.
OSD с разными значениями данных параметров (например, SSD и гибридные SSD+HDD
OSD) могут сосуществовать в одном кластере Vitastor в разных пулах. Один пул
может включать только OSD с одинаковыми настройками этих параметров.
Данные параметры, отличаясь от значения по умолчанию, должны также быть заданы
в etcd, чтобы клиенты могли узнать их значение, либо в глобальной конфигурации
/vitastor/config/global, либо в настройках пулов. Настройки пула переопределяют
глобальное значение. Если значение в настройках пула не будет соответствовать
конфигурации OSD, OSD откажется запускать PG данного пула.
- [block_size](#block_size)
- [bitmap_granularity](#bitmap_granularity)
- [immediate_commit](#immediate_commit)
- [client_dirty_limit](#client_dirty_limit)
## block_size
@ -20,24 +29,15 @@
- Значение по умолчанию: 131072
Размер объектов (блоков данных), на которые делятся физические и виртуальные
диски в Vitastor. Одна из ключевых на данный момент настроек, влияет на
потребление памяти, объём избыточной записи (write amplification) и
эффективность распределения нагрузки по OSD.
диски в Vitastor (в рамках каждого пула). Одна из ключевых на данный момент
настроек, влияет на потребление памяти, объём избыточной записи (write
amplification) и эффективность распределения нагрузки по OSD.
Рекомендуемые по умолчанию размеры блока - 128 килобайт для SSD и 4
мегабайта для HDD. В принципе, для SSD можно тоже использовать 4 мегабайта,
это понизит использование памяти, но ухудшит распределение нагрузки и в
среднем увеличит WA.
OSD с разными размерами блока (например, SSD и SSD+HDD OSD) на данный
момент могут сосуществовать в рамках одного etcd только в виде двух независимых
кластеров Vitastor с разными etcd_prefix.
Также размер блока нельзя менять после инициализации OSD без потери данных.
Если вы меняете размер блока, обязательно прописывайте его в etcd в
/vitastor/config/global, дабы все клиенты его знали.
Потребление памяти OSD составляет примерно (РАЗМЕР / БЛОК * 68 байт),
т.е. примерно 544 МБ памяти на 1 ТБ занятого места на диске при
стандартном 128 КБ блоке.
@ -52,13 +52,7 @@ OSD с разными размерами блока (например, SSD и SS
потому, что Vitastor хранит битовую карту для каждого объекта, содержащую
по 2 бита на каждые (bitmap_granularity) байт.
Данный параметр нельзя менять после инициализации OSD без потери данных.
Также он фиксирован для всего кластера Vitastor, т.е. разные значения
не могут сосуществовать в одном кластере.
Клиенты ДОЛЖНЫ знать правильное значение этого параметра, так что если вы
его меняете, обязательно прописывайте изменённое значение в etcd в ключ
/vitastor/config/global.
Не может быть меньше размера сектора дисков данных OSD.
## immediate_commit
@ -108,8 +102,7 @@ HDD-дисках с внутренним SSD или "медиа" кэшем - н
многих дисках Seagate EXOS (у них есть внутренний SSD-кэш, хотя это и не
указано в спецификациях).
Данный параметр нужно указывать и в etcd в /vitastor/config/global, и в
командной строке или конфигурации OSD. Значения "all" и "small" требуют
Указание "all" или "small" в настройках / командной строке OSD требует
включения disable_journal_fsync и disable_meta_fsync, значение "all" также
требует включения disable_data_fsync.
@ -119,16 +112,3 @@ immediate_commit в значение "all", если вы используете
такие SSD для всех журналов, но не для данных - можете установить параметр
в "small". Если и какие-то из дисков журналов имеют волатильный кэш записи -
оставьте параметр пустым.
## client_dirty_limit
- Тип: целое число
- Значение по умолчанию: 33554432
При работе без immediate_commit=all - это лимит объёма "грязных" (не
зафиксированных fsync-ом) данных, при достижении которого клиент будет
принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
что в этом случае до момента fsync клиент хранит копию незафиксированных
данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
Параметр не влияет на сами OSD.

View File

@ -29,6 +29,7 @@ between clients, OSDs and etcd.
- [etcd_slow_timeout](#etcd_slow_timeout)
- [etcd_keepalive_timeout](#etcd_keepalive_timeout)
- [etcd_ws_keepalive_timeout](#etcd_ws_keepalive_timeout)
- [client_dirty_limit](#client_dirty_limit)
## tcp_header_buffer_size
@ -212,3 +213,16 @@ etcd_report_interval to guarantee that keepalive actually works.
etcd websocket ping interval required to keep the connection alive and
detect disconnections quickly.
## client_dirty_limit
- Type: integer
- Default: 33554432
Without immediate_commit=all this parameter sets the limit of "dirty"
(not committed by fsync) data allowed by the client before forcing an
additional fsync and committing the data. Also note that the client always
holds a copy of uncommitted data in memory so this setting also affects
RAM usage of clients.
This parameter doesn't affect OSDs themselves.

View File

@ -29,6 +29,7 @@
- [etcd_slow_timeout](#etcd_slow_timeout)
- [etcd_keepalive_timeout](#etcd_keepalive_timeout)
- [etcd_ws_keepalive_timeout](#etcd_ws_keepalive_timeout)
- [client_dirty_limit](#client_dirty_limit)
## tcp_header_buffer_size
@ -222,3 +223,16 @@ etcd_report_interval, чтобы keepalive гарантированно рабо
- Значение по умолчанию: 30
Интервал проверки живости вебсокет-подключений к etcd.
## client_dirty_limit
- Тип: целое число
- Значение по умолчанию: 33554432
При работе без immediate_commit=all - это лимит объёма "грязных" (не
зафиксированных fsync-ом) данных, при достижении которого клиент будет
принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
что в этом случае до момента fsync клиент хранит копию незафиксированных
данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
Параметр не влияет на сами OSD.

View File

@ -33,6 +33,9 @@ Parameters:
- [pg_count](#pg_count)
- [failure_domain](#failure_domain)
- [max_osd_combinations](#max_osd_combinations)
- [block_size](#block_size)
- [bitmap_granularity](#bitmap_granularity)
- [immediate_commit](#immediate_commit)
- [pg_stripe_size](#pg_stripe_size)
- [root_node](#root_node)
- [osd_tags](#osd_tags)
@ -186,6 +189,43 @@ number of combinations to generate when optimising PG placement.
This parameter usually doesn't require to be changed.
## block_size
- Type: integer
- Default: 131072
Block size for this pool. The value from /vitastor/config/global is used when
unspecified. If your cluster has OSDs with different block sizes then pool must
be restricted by [osd_tags](#osd_tags) to only include OSDs with matching block
size.
Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-cluster.en.md#block_size).
## bitmap_granularity
- Type: integer
- Default: 4096
"Sector" size of virtual disks in this pool. The value from
/vitastor/config/global is used when unspecified. Similar to block_size, the
pool must be restricted by [osd_tags](#osd_tags) to only include OSDs with
matching bitmap_granularity.
Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-cluster.en.md#bitmap_granularity).
## immediate_commit
- Type: string, one of "all", "small" and "none"
- Default: none
Immediate commit setting for this pool. The value from /vitastor/config/global
is used when unspecified. Similar to block_size, the pool must be restricted by
[osd_tags](#osd_tags) to only include OSDs with compatible immediate_commit.
Compatible means that a pool with non-immediate commit will work with OSDs with
immediate commit enabled, but not vice versa.
Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-cluster.en.md#immediate_commit).
## pg_stripe_size
- Type: integer

View File

@ -32,6 +32,9 @@
- [pg_count](#pg_count)
- [failure_domain](#failure_domain)
- [max_osd_combinations](#max_osd_combinations)
- [block_size](#block_size)
- [bitmap_granularity](#bitmap_granularity)
- [immediate_commit](#immediate_commit)
- [pg_stripe_size](#pg_stripe_size)
- [root_node](#root_node)
- [osd_tags](#osd_tags)
@ -185,13 +188,51 @@ PG в Vitastor эферемерны, то есть вы можете менят
Обычно данный параметр не требует изменений.
## block_size
- Тип: целое число
- По умолчанию: 131072
Размер блока для данного пула. Если не задан, используется значение из
/vitastor/config/global. Если в вашем кластере есть OSD с разными размерами
блока, пул должен быть ограничен только OSD, блок которых равен блоку пула,
с помощью [osd_tags](#osd_tags).
О самом параметре читайте в разделе [Дисковые параметры уровня кластера](layout-cluster.ru.md#block_size).
## bitmap_granularity
- Тип: целое число
- По умолчанию: 4096
Размер "сектора" виртуальных дисков в данном пуле. Если не задан, используется
значение из /vitastor/config/global. Аналогично block_size, пул должен быть
ограничен OSD со значением bitmap_granularity, равным значению пула, с помощью
[osd_tags](#osd_tags).
О самом параметре читайте в разделе [Дисковые параметры уровня кластера](layout-cluster.ru.md#bitmap_granularity).
## immediate_commit
- Тип: строка "all", "small" или "none"
- По умолчанию: none
Настройка мгновенного коммита для данного пула. Если не задана, используется
значение из /vitastor/config/global. Аналогично block_size, пул должен быть
ограничен OSD со значением bitmap_granularity, совместимым со значением пула, с
помощью [osd_tags](#osd_tags). Совместимость означает, что пул с отключенным
мгновенным коммитом может работать на OSD с включённым мгновенным коммитом, но
не наоборот.
О самом параметре читайте в разделе [Дисковые параметры уровня кластера](layout-cluster.ru.md#immediate_commit).
## pg_stripe_size
- Тип: целое число
- По умолчанию: 0
Данный параметр задаёт размер полосы "нарезки" образов на PG. Размер полосы не может
быть меньше, чем [block_size](layout-cluster.ru.md#block_size), умноженный на
быть меньше, чем [block_size](#block_size), умноженный на
(pg_size - parity_chunks) для EC-пулов или 1 для реплицированных пулов. То же
значение используется по умолчанию.

View File

@ -2,3 +2,13 @@
These parameters apply to clients and OSDs, are fixed at the moment of OSD drive
initialization and can't be changed after it without losing data.
OSDs with different values of these parameters (for example, SSD and SSD+HDD
OSDs) can coexist in one Vitastor cluster within different pools. Each pool can
only include OSDs with identical settings of these parameters.
These parameters, when set to a non-default value, must also be specified in
etcd for clients to be aware of their values, either in /vitastor/config/global
or in pool configuration. Pool configuration overrides the global setting.
If the value for a pool in etcd doesn't match on-disk OSD configuration, the
OSD will refuse to start PGs of that pool.

View File

@ -2,3 +2,13 @@
Данные параметры используются клиентами и OSD, задаются в момент инициализации
диска OSD и не могут быть изменены после этого без потери данных.
OSD с разными значениями данных параметров (например, SSD и гибридные SSD+HDD
OSD) могут сосуществовать в одном кластере Vitastor в разных пулах. Один пул
может включать только OSD с одинаковыми настройками этих параметров.
Данные параметры, отличаясь от значения по умолчанию, должны также быть заданы
в etcd, чтобы клиенты могли узнать их значение, либо в глобальной конфигурации
/vitastor/config/global, либо в настройках пулов. Настройки пула переопределяют
глобальное значение. Если значение в настройках пула не будет соответствовать
конфигурации OSD, OSD откажется запускать PG данного пула.

View File

@ -2,46 +2,28 @@
type: int
default: 131072
info: |
Size of objects (data blocks) into which all physical and virtual drives are
subdivided in Vitastor. One of current main settings in Vitastor, affects
memory usage, write amplification and I/O load distribution effectiveness.
Size of objects (data blocks) into which all physical and virtual drives
(within a pool) are subdivided in Vitastor. One of current main settings
in Vitastor, affects memory usage, write amplification and I/O load
distribution effectiveness.
Recommended default block size is 128 KB for SSD and 4 MB for HDD. In fact,
it's possible to use 4 MB for SSD too - it will lower memory usage, but
may increase average WA and reduce linear performance.
OSDs with different block sizes (for example, SSD and SSD+HDD OSDs) can
currently coexist in one etcd instance only within separate Vitastor
clusters with different etcd_prefix'es.
Also block size can't be changed after OSD initialization without losing
data.
You must always specify block_size in etcd in /vitastor/config/global if
you change it so all clients can know about it.
OSD memory usage is roughly (SIZE / BLOCK * 68 bytes) which is roughly
544 MB per 1 TB of used disk space with the default 128 KB block size.
info_ru: |
Размер объектов (блоков данных), на которые делятся физические и виртуальные
диски в Vitastor. Одна из ключевых на данный момент настроек, влияет на
потребление памяти, объём избыточной записи (write amplification) и
эффективность распределения нагрузки по OSD.
диски в Vitastor (в рамках каждого пула). Одна из ключевых на данный момент
настроек, влияет на потребление памяти, объём избыточной записи (write
amplification) и эффективность распределения нагрузки по OSD.
Рекомендуемые по умолчанию размеры блока - 128 килобайт для SSD и 4
мегабайта для HDD. В принципе, для SSD можно тоже использовать 4 мегабайта,
это понизит использование памяти, но ухудшит распределение нагрузки и в
среднем увеличит WA.
OSD с разными размерами блока (например, SSD и SSD+HDD OSD) на данный
момент могут сосуществовать в рамках одного etcd только в виде двух независимых
кластеров Vitastor с разными etcd_prefix.
Также размер блока нельзя менять после инициализации OSD без потери данных.
Если вы меняете размер блока, обязательно прописывайте его в etcd в
/vitastor/config/global, дабы все клиенты его знали.
Потребление памяти OSD составляет примерно (РАЗМЕР / БЛОК * 68 байт),
т.е. примерно 544 МБ памяти на 1 ТБ занятого места на диске при
стандартном 128 КБ блоке.
@ -54,25 +36,14 @@
an allocation bitmap for each object containing 2 bits per each
(bitmap_granularity) bytes.
This parameter can't be changed after OSD initialization without losing
data. Also it's fixed for the whole Vitastor cluster i.e. two different
values can't be used in a single Vitastor cluster.
Clients MUST be aware of this parameter value, so put it into etcd key
/vitastor/config/global if you change it for any reason.
Can't be smaller than the OSD data device sector.
info_ru: |
Требуемое выравнивание записи на виртуальные диски (размер их "сектора").
Должен быть кратен disk_alignment. Называется гранулярностью битовой карты
потому, что Vitastor хранит битовую карту для каждого объекта, содержащую
по 2 бита на каждые (bitmap_granularity) байт.
Данный параметр нельзя менять после инициализации OSD без потери данных.
Также он фиксирован для всего кластера Vitastor, т.е. разные значения
не могут сосуществовать в одном кластере.
Клиенты ДОЛЖНЫ знать правильное значение этого параметра, так что если вы
его меняете, обязательно прописывайте изменённое значение в etcd в ключ
/vitastor/config/global.
Не может быть меньше размера сектора дисков данных OSD.
- name: immediate_commit
type: string
default: false
@ -114,10 +85,9 @@
SSD cache or "media-cache" - for example, a lot of Seagate EXOS drives have
it (they have internal SSD cache even though it's not stated in datasheets).
This parameter must be set both in etcd in /vitastor/config/global and in
OSD command line or configuration. Setting it to "all" or "small" requires
enabling disable_journal_fsync and disable_meta_fsync, setting it to "all"
also requires enabling disable_data_fsync.
Setting this parameter to "all" or "small" in OSD parameters requires enabling
disable_journal_fsync and disable_meta_fsync, setting it to "all" also requires
enabling disable_data_fsync.
TLDR: For optimal performance, set immediate_commit to "all" if you only use
SSDs with supercapacitor-based power loss protection (nonvolatile
@ -168,8 +138,7 @@
многих дисках Seagate EXOS (у них есть внутренний SSD-кэш, хотя это и не
указано в спецификациях).
Данный параметр нужно указывать и в etcd в /vitastor/config/global, и в
командной строке или конфигурации OSD. Значения "all" и "small" требуют
Указание "all" или "small" в настройках / командной строке OSD требует
включения disable_journal_fsync и disable_meta_fsync, значение "all" также
требует включения disable_data_fsync.
@ -179,22 +148,3 @@
такие SSD для всех журналов, но не для данных - можете установить параметр
в "small". Если и какие-то из дисков журналов имеют волатильный кэш записи -
оставьте параметр пустым.
- name: client_dirty_limit
type: int
default: 33554432
info: |
Without immediate_commit=all this parameter sets the limit of "dirty"
(not committed by fsync) data allowed by the client before forcing an
additional fsync and committing the data. Also note that the client always
holds a copy of uncommitted data in memory so this setting also affects
RAM usage of clients.
This parameter doesn't affect OSDs themselves.
info_ru: |
При работе без immediate_commit=all - это лимит объёма "грязных" (не
зафиксированных fsync-ом) данных, при достижении которого клиент будет
принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
что в этом случае до момента fsync клиент хранит копию незафиксированных
данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
Параметр не влияет на сами OSD.

View File

@ -223,3 +223,22 @@
detect disconnections quickly.
info_ru: |
Интервал проверки живости вебсокет-подключений к etcd.
- name: client_dirty_limit
type: int
default: 33554432
info: |
Without immediate_commit=all this parameter sets the limit of "dirty"
(not committed by fsync) data allowed by the client before forcing an
additional fsync and committing the data. Also note that the client always
holds a copy of uncommitted data in memory so this setting also affects
RAM usage of clients.
This parameter doesn't affect OSDs themselves.
info_ru: |
При работе без immediate_commit=all - это лимит объёма "грязных" (не
зафиксированных fsync-ом) данных, при достижении которого клиент будет
принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
что в этом случае до момента fsync клиент хранит копию незафиксированных
данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
Параметр не влияет на сами OSD.

View File

@ -127,7 +127,7 @@
запросы записи клиенты копируют в памяти и при потере соединения и повторном соединении
с OSD повторяют из памяти. Скопированные в память данные удаляются при успешном fsync,
а чтобы хранение этих данных не приводило к чрезмерному потреблению памяти, клиенты
автоматически выполняют fsync каждые [client_dirty_limit](../config/layout-cluster.ru.md#client_dirty_limit)
автоматически выполняют fsync каждые [client_dirty_limit](../config/network.ru.md#client_dirty_limit)
записанных байт.
## Схожесть с Ceph

View File

@ -157,7 +157,12 @@ const etcd_tree = {
pg_count: 100,
failure_domain: 'host',
max_osd_combinations: 10000,
pg_stripe_size: 4194304,
// block_size, bitmap_granularity, immediate_commit must match all OSDs used in that pool
data_block_size: 131072,
bitmap_granularity: 4096,
// 'all'/'small'/'none', same as in OSD options
immediate_commit: 'none',
pg_stripe_size: 0,
root_node?: 'rack1',
// restrict pool to OSDs having all of these tags
osd_tags?: 'nvme' | [ 'nvme', ... ],
@ -323,6 +328,13 @@ const etcd_tree = {
misplaced: uint64_t,
degraded: uint64_t,
incomplete: uint64_t,
},
object_bytes: {
total: uint64_t,
clean: uint64_t,
misplaced: uint64_t,
degraded: uint64_t,
incomplete: uint64_t,
}, */
},
history: {
@ -1438,8 +1450,23 @@ class Mon
sum_object_counts()
{
const object_counts = { object: 0n, clean: 0n, misplaced: 0n, degraded: 0n, incomplete: 0n };
const object_bytes = { object: 0n, clean: 0n, misplaced: 0n, degraded: 0n, incomplete: 0n };
for (const pool_id in this.state.pg.stats)
{
let object_size = 0;
for (const osd_num of this.state.pg.stats[pool_id].write_osd_set||[])
{
if (osd_num && this.state.osd.stats[osd_num] && this.state.osd.stats[osd_num].block_size)
{
object_size = this.state.osd.stats[osd_num].block_size;
break;
}
}
if (!object_size)
{
object_size = this.config['block_size'];
}
object_size = BigInt(object_size);
for (const pg_num in this.state.pg.stats[pool_id])
{
const st = this.state.pg.stats[pool_id][pg_num];
@ -1450,12 +1477,13 @@ class Mon
if (st[k+'_count'])
{
object_counts[k] += BigInt(st[k+'_count']);
object_bytes[k] += BigInt(st[k+'_count']) * object_size;
}
}
}
}
}
return object_counts;
return { object_counts, object_bytes };
}
sum_inode_stats(prev_stats, timestamp, prev_timestamp)
@ -1568,7 +1596,7 @@ class Mon
{
const txn = [];
const timestamp = Date.now();
const object_counts = this.sum_object_counts();
const { object_counts, object_bytes } = this.sum_object_counts();
let stats = this.sum_op_stats(timestamp, this.prev_stats);
let inode_stats = this.sum_inode_stats(
this.prev_stats ? this.prev_stats.inode_stats : null,
@ -1576,6 +1604,7 @@ class Mon
);
this.prev_stats = { timestamp, ...stats, inode_stats };
stats.object_counts = object_counts;
stats.object_bytes = object_bytes;
stats = this.serialize_bigints(stats);
inode_stats = this.serialize_bigints(inode_stats);
txn.push({ requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(stats)) } });

View File

@ -47,6 +47,7 @@ struct snap_merger_t
int state = 0;
int lists_todo = 0;
uint64_t target_block_size = 0;
uint32_t target_bitmap_granularity = 0;
btree::safe_btree_set<uint64_t> merge_offsets;
btree::safe_btree_set<uint64_t>::iterator oit;
std::map<inode_t, std::vector<uint64_t>> layer_lists;
@ -101,7 +102,7 @@ struct snap_merger_t
std::vector<inode_t> chain_list;
inode_config_t *cur = to_cfg;
chain_list.push_back(cur->num);
layer_block_size[cur->num] = get_block_size(cur->num);
layer_block_size[cur->num] = get_block_size(cur->num, NULL);
while (cur->parent_id != from_cfg->num &&
cur->parent_id != to_cfg->num &&
cur->parent_id != 0)
@ -124,7 +125,7 @@ struct snap_merger_t
}
cur = &it->second;
chain_list.push_back(cur->num);
layer_block_size[cur->num] = get_block_size(cur->num);
layer_block_size[cur->num] = get_block_size(cur->num, NULL);
}
if (cur->parent_id != from_cfg->num)
{
@ -133,7 +134,7 @@ struct snap_merger_t
return;
}
chain_list.push_back(from_cfg->num);
layer_block_size[from_cfg->num] = get_block_size(from_cfg->num);
layer_block_size[from_cfg->num] = get_block_size(from_cfg->num, NULL);
int i = chain_list.size()-1;
for (inode_t item: chain_list)
{
@ -204,14 +205,16 @@ struct snap_merger_t
use_cas ? " online (with CAS)" : "", INODE_NO_POOL(target), INODE_POOL(target)
);
}
target_block_size = get_block_size(target);
target_block_size = get_block_size(target, &target_bitmap_granularity);
}
uint64_t get_block_size(inode_t inode)
uint64_t get_block_size(inode_t inode, uint32_t *bitmap_granularity)
{
auto & pool_cfg = parent->cli->st_cli.pool_config.at(INODE_POOL(inode));
uint64_t pg_data_size = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks);
return parent->cli->get_bs_block_size() * pg_data_size;
if (bitmap_granularity)
*bitmap_granularity = pool_cfg.bitmap_granularity;
return pool_cfg.data_block_size * pg_data_size;
}
void continue_merge_reent()
@ -409,7 +412,7 @@ struct snap_merger_t
}
else
{
uint64_t bitmap_bytes = target_block_size/parent->cli->get_bs_bitmap_granularity()/8;
uint64_t bitmap_bytes = target_block_size/target_bitmap_granularity/8;
int i;
for (i = 0; i < bitmap_bytes; i++)
{
@ -469,7 +472,7 @@ struct snap_merger_t
{
// Write each non-empty range using an individual operation
// FIXME: Allow to use single write with "holes" (OSDs don't allow it yet)
uint32_t gran = parent->cli->get_bs_bitmap_granularity();
uint32_t gran = target_bitmap_granularity;
uint64_t bitmap_size = target_block_size / gran;
while (rwo->end < bitmap_size && !rwo->error_code)
{

View File

@ -7,6 +7,8 @@
#include "pg_states.h"
#include "http_client.h"
static const char *obj_states[] = { "clean", "misplaced", "degraded", "incomplete" };
// Print cluster status:
// etcd, mon, osd states
// raw/used space, object states, pool states, pg states
@ -196,21 +198,57 @@ resume_2:
}
pgs_by_state_str += std::to_string(kv.second)+" "+kv.first;
}
uint64_t object_size = parent->cli->get_bs_block_size();
std::string more_states;
uint64_t obj_n;
obj_n = agg_stats["object_counts"]["misplaced"].uint64_value();
if (obj_n > 0)
more_states += ", "+format_size(obj_n*object_size)+" misplaced";
obj_n = agg_stats["object_counts"]["degraded"].uint64_value();
if (obj_n > 0)
more_states += ", "+format_size(obj_n*object_size)+" degraded";
obj_n = agg_stats["object_counts"]["incomplete"].uint64_value();
if (obj_n > 0)
more_states += ", "+format_size(obj_n*object_size)+" incomplete";
bool readonly = json_is_true(parent->cli->merged_config["readonly"]);
bool no_recovery = json_is_true(parent->cli->merged_config["no_recovery"]);
bool no_rebalance = json_is_true(parent->cli->merged_config["no_rebalance"]);
if (parent->json_output)
{
// JSON output
auto json_status = json11::Json::object {
{ "etcd_alive", etcd_alive },
{ "etcd_count", (uint64_t)etcd_states.size() },
{ "etcd_db_size", etcd_db_size },
{ "mon_count", mon_count },
{ "mon_master", mon_master },
{ "osd_up", osd_up },
{ "osd_count", osd_count },
{ "total_raw", total_raw },
{ "free_raw", free_raw },
{ "down_raw", down_raw },
{ "free_down_raw", free_down_raw },
{ "readonly", readonly },
{ "no_recovery", no_recovery },
{ "no_rebalance", no_rebalance },
{ "pool_count", pool_count },
{ "active_pool_count", pools_active },
{ "pg_states", pgs_by_state },
{ "op_stats", agg_stats["op_stats"] },
{ "recovery_stats", agg_stats["recovery_stats"] },
{ "object_counts", agg_stats["object_counts"] },
};
for (int i = 0; i < sizeof(obj_states)/sizeof(obj_states[0]); i++)
{
std::string str(obj_states[i]);
uint64_t obj_n = agg_stats["object_bytes"][str].uint64_value();
if (!obj_n)
obj_n = agg_stats["object_counts"][str].uint64_value() * parent->cli->st_cli.global_block_size;
json_status[str+"_data"] = obj_n;
}
printf("%s\n", json11::Json(json_status).dump().c_str());
state = 100;
return;
}
std::string more_states;
for (int i = 0; i < sizeof(obj_states)/sizeof(obj_states[0]); i++)
{
std::string str(obj_states[i]);
uint64_t obj_n = agg_stats["object_bytes"][str].uint64_value();
if (!obj_n)
obj_n = agg_stats["object_counts"][str].uint64_value() * parent->cli->st_cli.global_block_size;
if (!i || obj_n > 0)
more_states += format_size(obj_n)+" "+str+", ";
}
more_states.resize(more_states.size()-2);
std::string recovery_io;
{
uint64_t deg_bps = agg_stats["recovery_stats"]["degraded"]["bps"].uint64_value();
@ -232,38 +270,6 @@ resume_2:
else if (no_rebalance)
recovery_io += " rebalance: disabled\n";
}
if (parent->json_output)
{
// JSON output
printf("%s\n", json11::Json(json11::Json::object {
{ "etcd_alive", etcd_alive },
{ "etcd_count", (uint64_t)etcd_states.size() },
{ "etcd_db_size", etcd_db_size },
{ "mon_count", mon_count },
{ "mon_master", mon_master },
{ "osd_up", osd_up },
{ "osd_count", osd_count },
{ "total_raw", total_raw },
{ "free_raw", free_raw },
{ "down_raw", down_raw },
{ "free_down_raw", free_down_raw },
{ "readonly", readonly },
{ "no_recovery", no_recovery },
{ "no_rebalance", no_rebalance },
{ "clean_data", agg_stats["object_counts"]["clean"].uint64_value() * object_size },
{ "misplaced_data", agg_stats["object_counts"]["misplaced"].uint64_value() * object_size },
{ "degraded_data", agg_stats["object_counts"]["degraded"].uint64_value() * object_size },
{ "incomplete_data", agg_stats["object_counts"]["incomplete"].uint64_value() * object_size },
{ "pool_count", pool_count },
{ "active_pool_count", pools_active },
{ "pg_states", pgs_by_state },
{ "op_stats", agg_stats["op_stats"] },
{ "recovery_stats", agg_stats["recovery_stats"] },
{ "object_counts", agg_stats["object_counts"] },
}).dump().c_str());
state = 100;
return;
}
printf(
" cluster:\n"
" etcd: %d / %ld up, %s database size\n"
@ -272,7 +278,7 @@ resume_2:
" \n"
" data:\n"
" raw: %s used, %s / %s available%s\n"
" state: %s clean%s\n"
" state: %s\n"
" pools: %d / %d active\n"
" pgs: %s\n"
" \n"
@ -286,7 +292,7 @@ resume_2:
format_size(free_raw-free_down_raw).c_str(),
format_size(total_raw-down_raw).c_str(),
(down_raw > 0 ? (", "+format_size(down_raw)+" down").c_str() : ""),
format_size(agg_stats["object_counts"]["clean"].uint64_value() * object_size).c_str(), more_states.c_str(),
more_states.c_str(),
pools_active, pool_count,
pgs_by_state_str.c_str(),
readonly ? " (read-only mode)" : "",

View File

@ -14,6 +14,7 @@
#define CACHE_FLUSHING 2
#define CACHE_REPEATING 3
#define OP_FLUSH_BUFFER 0x02
#define OP_IMMEDIATE_COMMIT 0x04
cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
{
@ -127,26 +128,26 @@ void cluster_client_t::calc_wait(cluster_op_t *op)
op->prev_wait++;
}
}
if (!op->prev_wait && pgs_loaded)
if (!op->prev_wait)
continue_rw(op);
}
else if (op->opcode == OSD_OP_SYNC)
{
for (auto prev = op->prev; prev; prev = prev->prev)
{
if (prev->opcode == OSD_OP_SYNC || prev->opcode == OSD_OP_WRITE)
if (prev->opcode == OSD_OP_SYNC || prev->opcode == OSD_OP_WRITE && !(prev->flags & OP_IMMEDIATE_COMMIT))
{
op->prev_wait++;
}
}
if (!op->prev_wait && pgs_loaded)
if (!op->prev_wait)
continue_sync(op);
}
else /* if (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP) */
{
for (auto prev = op_queue_head; prev && prev != op; prev = prev->next)
{
if (prev->opcode == OSD_OP_WRITE && prev->flags & OP_FLUSH_BUFFER)
if (prev->opcode == OSD_OP_WRITE && (prev->flags & OP_FLUSH_BUFFER))
{
op->prev_wait++;
}
@ -156,7 +157,7 @@ void cluster_client_t::calc_wait(cluster_op_t *op)
break;
}
}
if (!op->prev_wait && pgs_loaded)
if (!op->prev_wait)
continue_rw(op);
}
}
@ -168,7 +169,7 @@ void cluster_client_t::inc_wait(uint64_t opcode, uint64_t flags, cluster_op_t *n
while (next)
{
auto n2 = next->next;
if (next->opcode == OSD_OP_SYNC ||
if (next->opcode == OSD_OP_SYNC && !(flags & OP_IMMEDIATE_COMMIT) ||
next->opcode == OSD_OP_WRITE && (flags & OP_FLUSH_BUFFER) && !(next->flags & OP_FLUSH_BUFFER) ||
(next->opcode == OSD_OP_READ || next->opcode == OSD_OP_READ_BITMAP) && (flags & OP_FLUSH_BUFFER))
{
@ -221,7 +222,7 @@ void cluster_client_t::erase_op(cluster_op_t *op)
op_queue_tail = op->prev;
op->next = op->prev = NULL;
std::function<void(cluster_op_t*)>(op->callback)(op);
if (!immediate_commit)
if (!(flags & OP_IMMEDIATE_COMMIT))
inc_wait(opcode, flags, next, -1);
}
@ -262,21 +263,6 @@ restart:
continuing_ops = 0;
}
static uint32_t is_power_of_two(uint64_t value)
{
uint32_t l = 0;
while (value > 1)
{
if (value & 1)
{
return 64;
}
value = value >> 1;
l++;
}
return l;
}
void cluster_client_t::on_load_config_hook(json11::Json::object & config)
{
this->merged_config = config;
@ -284,24 +270,6 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & config)
{
this->merged_config[kv.first] = kv.second;
}
bs_block_size = config["block_size"].uint64_value();
bs_bitmap_granularity = config["bitmap_granularity"].uint64_value();
if (!bs_block_size)
{
bs_block_size = DEFAULT_BLOCK_SIZE;
}
if (!bs_bitmap_granularity)
{
bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
}
bs_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
uint32_t block_order;
if ((block_order = is_power_of_two(bs_block_size)) >= 64 || bs_block_size < MIN_DATA_BLOCK_SIZE || bs_block_size >= MAX_DATA_BLOCK_SIZE)
{
throw std::runtime_error("Bad block size");
}
// Cluster-wide immediate_commit mode
immediate_commit = (config["immediate_commit"] == "all");
if (config.find("client_max_dirty_bytes") != config.end())
{
client_max_dirty_bytes = config["client_max_dirty_bytes"].uint64_value();
@ -379,9 +347,15 @@ void cluster_client_t::on_change_hook(std::map<std::string, etcd_kv_t> & changes
continue_ops();
}
bool cluster_client_t::get_immediate_commit()
bool cluster_client_t::get_immediate_commit(uint64_t inode)
{
return immediate_commit;
pool_id_t pool_id = INODE_POOL(inode);
if (!pool_id)
return true;
auto pool_it = st_cli.pool_config.find(pool_id);
if (pool_it == st_cli.pool_config.end())
return true;
return pool_it->second.immediate_commit == IMMEDIATE_ALL;
}
void cluster_client_t::on_change_osd_state_hook(uint64_t peer_osd)
@ -439,9 +413,45 @@ void cluster_client_t::execute(cluster_op_t *op)
std::function<void(cluster_op_t*)>(op->callback)(op);
return;
}
if (!pgs_loaded)
{
offline_ops.push_back(op);
return;
}
op->cur_inode = op->inode;
op->retval = 0;
if (op->opcode == OSD_OP_WRITE && !immediate_commit)
op->flags = op->flags & OSD_OP_IGNORE_READONLY; // single allowed flag
if (op->opcode != OSD_OP_SYNC)
{
pool_id_t pool_id = INODE_POOL(op->cur_inode);
if (!pool_id)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
return;
}
auto pool_it = st_cli.pool_config.find(pool_id);
if (pool_it == st_cli.pool_config.end() || pool_it->second.real_pg_count == 0)
{
// Pools are loaded, but this one is unknown
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
return;
}
// Check alignment
if ((op->opcode == OSD_OP_READ || op->opcode == OSD_OP_WRITE) && !op->len ||
op->offset % pool_it->second.bitmap_granularity || op->len % pool_it->second.bitmap_granularity)
{
op->retval = -EINVAL;
std::function<void(cluster_op_t*)>(op->callback)(op);
return;
}
if (pool_it->second.immediate_commit == IMMEDIATE_ALL)
{
op->flags |= OP_IMMEDIATE_COMMIT;
}
}
if (op->opcode == OSD_OP_WRITE && !(op->flags & OP_IMMEDIATE_COMMIT))
{
if (dirty_bytes >= client_max_dirty_bytes || dirty_ops >= client_max_dirty_ops)
{
@ -480,9 +490,9 @@ void cluster_client_t::execute(cluster_op_t *op)
}
else
op_queue_tail = op_queue_head = op;
if (!immediate_commit)
if (!(op->flags & OP_IMMEDIATE_COMMIT))
calc_wait(op);
else if (pgs_loaded)
else
{
if (op->opcode == OSD_OP_SYNC)
continue_sync(op);
@ -610,28 +620,6 @@ int cluster_client_t::continue_rw(cluster_op_t *op)
else if (op->state == 3)
goto resume_3;
resume_0:
if ((op->opcode == OSD_OP_READ || op->opcode == OSD_OP_WRITE) && !op->len ||
op->offset % bs_bitmap_granularity || op->len % bs_bitmap_granularity)
{
op->retval = -EINVAL;
erase_op(op);
return 1;
}
{
pool_id_t pool_id = INODE_POOL(op->cur_inode);
if (!pool_id)
{
op->retval = -EINVAL;
erase_op(op);
return 1;
}
if (st_cli.pool_config.find(pool_id) == st_cli.pool_config.end() ||
st_cli.pool_config[pool_id].real_pg_count == 0)
{
// Postpone operations to unknown pools
return 0;
}
}
if (op->opcode == OSD_OP_WRITE || op->opcode == OSD_OP_DELETE)
{
if (!(op->flags & OSD_OP_IGNORE_READONLY))
@ -644,7 +632,7 @@ resume_0:
return 1;
}
}
if (op->opcode == OSD_OP_WRITE && !immediate_commit && !(op->flags & OP_FLUSH_BUFFER))
if (op->opcode == OSD_OP_WRITE && !(op->flags & OP_IMMEDIATE_COMMIT) && !(op->flags & OP_FLUSH_BUFFER))
{
copy_write(op, dirty_buffers);
}
@ -814,7 +802,7 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
// Primary OSDs still operate individual stripes, but their size is multiplied by PG minsize in case of EC
auto & pool_cfg = st_cli.pool_config.at(INODE_POOL(op->cur_inode));
uint32_t pg_data_size = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks);
uint64_t pg_block_size = bs_block_size * pg_data_size;
uint64_t pg_block_size = pool_cfg.data_block_size * pg_data_size;
uint64_t first_stripe = (op->offset / pg_block_size) * pg_block_size;
uint64_t last_stripe = op->len > 0 ? ((op->offset + op->len - 1) / pg_block_size) * pg_block_size : first_stripe;
op->retval = 0;
@ -822,9 +810,9 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
if (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP)
{
// Allocate memory for the bitmap
unsigned object_bitmap_size = (((op->opcode == OSD_OP_READ_BITMAP ? pg_block_size : op->len) / bs_bitmap_granularity + 7) / 8);
unsigned object_bitmap_size = (((op->opcode == OSD_OP_READ_BITMAP ? pg_block_size : op->len) / pool_cfg.bitmap_granularity + 7) / 8);
object_bitmap_size = (object_bitmap_size < 8 ? 8 : object_bitmap_size);
unsigned bitmap_mem = object_bitmap_size + (bs_bitmap_size * pg_data_size) * op->parts.size();
unsigned bitmap_mem = object_bitmap_size + (pool_cfg.data_block_size / pool_cfg.bitmap_granularity / 8 * pg_data_size) * op->parts.size();
if (op->bitmap_buf_size < bitmap_mem)
{
op->bitmap_buf = realloc_or_die(op->bitmap_buf, bitmap_mem);
@ -854,7 +842,7 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
bool skip_prev = true;
while (cur < end)
{
unsigned bmp_loc = (cur - op->offset)/bs_bitmap_granularity;
unsigned bmp_loc = (cur - op->offset)/pool_cfg.bitmap_granularity;
bool skip = (((*((uint8_t*)op->bitmap_buf + bmp_loc/8)) >> (bmp_loc%8)) & 0x1);
if (skip_prev != skip)
{
@ -872,7 +860,7 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
skip_prev = skip;
prev = cur;
}
cur += bs_bitmap_granularity;
cur += pool_cfg.bitmap_granularity;
}
assert(cur > prev);
if (skip_prev)
@ -904,7 +892,7 @@ bool cluster_client_t::affects_osd(uint64_t inode, uint64_t offset, uint64_t len
{
auto & pool_cfg = st_cli.pool_config.at(INODE_POOL(inode));
uint32_t pg_data_size = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks);
uint64_t pg_block_size = bs_block_size * pg_data_size;
uint64_t pg_block_size = pool_cfg.data_block_size * pg_data_size;
uint64_t first_stripe = (offset / pg_block_size) * pg_block_size;
uint64_t last_stripe = len > 0 ? ((offset + len - 1) / pg_block_size) * pg_block_size : first_stripe;
for (uint64_t stripe = first_stripe; stripe <= last_stripe; stripe += pg_block_size)
@ -935,7 +923,7 @@ bool cluster_client_t::try_send(cluster_op_t *op, int i)
part->osd_num = primary_osd;
part->flags |= PART_SENT;
op->inflight_count++;
uint64_t pg_bitmap_size = bs_bitmap_size * (
uint64_t pg_bitmap_size = (pool_cfg.data_block_size / pool_cfg.bitmap_granularity / 8) * (
pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks
);
uint64_t meta_rev = 0;
@ -983,7 +971,7 @@ int cluster_client_t::continue_sync(cluster_op_t *op)
{
if (op->state == 1)
goto resume_1;
if (immediate_commit || !dirty_osds.size())
if (!dirty_osds.size())
{
// Sync is not required in the immediate_commit mode or if there are no dirty_osds
op->retval = 0;
@ -1140,7 +1128,8 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
else
{
// OK
dirty_osds.insert(part->osd_num);
if (!(op->flags & OP_IMMEDIATE_COMMIT))
dirty_osds.insert(part->osd_num);
part->flags |= PART_DONE;
op->done_count++;
if (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP)
@ -1162,12 +1151,12 @@ void cluster_client_t::copy_part_bitmap(cluster_op_t *op, cluster_op_part_t *par
{
// Copy (OR) bitmap
auto & pool_cfg = st_cli.pool_config.at(INODE_POOL(op->cur_inode));
uint32_t pg_block_size = bs_block_size * (
uint32_t pg_block_size = pool_cfg.data_block_size * (
pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks
);
uint32_t object_offset = (part->op.req.rw.offset - op->offset) / bs_bitmap_granularity;
uint32_t part_offset = (part->op.req.rw.offset % pg_block_size) / bs_bitmap_granularity;
uint32_t part_len = (op->opcode == OSD_OP_READ_BITMAP ? pg_block_size : part->op.req.rw.len) / bs_bitmap_granularity;
uint32_t object_offset = (part->op.req.rw.offset - op->offset) / pool_cfg.bitmap_granularity;
uint32_t part_offset = (part->op.req.rw.offset % pg_block_size) / pool_cfg.bitmap_granularity;
uint32_t part_len = (op->opcode == OSD_OP_READ_BITMAP ? pg_block_size : part->op.req.rw.len) / pool_cfg.bitmap_granularity;
if (!(object_offset & 0x7) && !(part_offset & 0x7) && (part_len >= 8))
{
// Copy bytes

View File

@ -6,8 +6,6 @@
#include "messenger.h"
#include "etcd_state_client.h"
#define MIN_DATA_BLOCK_SIZE 4*1024
#define MAX_DATA_BLOCK_SIZE 128*1024*1024
#define DEFAULT_CLIENT_MAX_DIRTY_BYTES 32*1024*1024
#define DEFAULT_CLIENT_MAX_DIRTY_OPS 1024
#define INODE_LIST_DONE 1
@ -79,11 +77,7 @@ class cluster_client_t
timerfd_manager_t *tfd;
ring_loop_t *ringloop;
uint64_t bs_block_size = 0;
uint32_t bs_bitmap_granularity = 0, bs_bitmap_size = 0;
std::map<pool_id_t, uint64_t> pg_counts;
// WARNING: initially true so execute() doesn't create fake sync
bool immediate_commit = true;
// FIXME: Implement inmemory_commit mode. Note that it requires to return overlapping reads from memory.
uint64_t client_max_dirty_bytes = 0;
uint64_t client_max_dirty_ops = 0;
@ -119,7 +113,7 @@ public:
bool is_ready();
void on_ready(std::function<void(void)> fn);
bool get_immediate_commit();
bool get_immediate_commit(uint64_t inode);
static void copy_write(cluster_op_t *op, std::map<object_id, cluster_buffer_t> & dirty_buffers);
void continue_ops(bool up_retry = false);
@ -127,8 +121,8 @@ public:
std::function<void(inode_list_t* lst, std::set<object_id>&& objects, pg_num_t pg_num, osd_num_t primary_osd, int status)> callback);
int list_pg_count(inode_list_t *lst);
void list_inode_next(inode_list_t *lst, int next_pgs);
inline uint32_t get_bs_bitmap_granularity() { return bs_bitmap_granularity; }
inline uint64_t get_bs_block_size() { return bs_block_size; }
//inline uint32_t get_bs_bitmap_granularity() { return st_cli.global_bitmap_granularity; }
//inline uint64_t get_bs_block_size() { return st_cli.global_block_size; }
uint64_t next_op_id();
protected:

View File

@ -534,11 +534,18 @@ void etcd_state_client_t::load_global_config()
global_config = kv.value.object_items();
}
}
bs_block_size = global_config["block_size"].uint64_value();
if (!bs_block_size)
global_block_size = global_config["block_size"].uint64_value();
if (!global_block_size)
{
bs_block_size = DEFAULT_BLOCK_SIZE;
global_block_size = DEFAULT_BLOCK_SIZE;
}
global_bitmap_granularity = global_config["bitmap_granularity"].uint64_value();
if (!global_bitmap_granularity)
{
global_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
}
global_immediate_commit = global_config["immediate_commit"].string_value() == "all"
? IMMEDIATE_ALL : (global_config["immediate_commit"].string_value() == "small" ? IMMEDIATE_SMALL : IMMEDIATE_NONE);
on_load_config_hook(global_config);
});
}
@ -732,9 +739,35 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
fprintf(stderr, "Pool %u has invalid max_osd_combinations (must be at least 100), skipping pool\n", pool_id);
continue;
}
// Data Block Size
pc.data_block_size = pool_item.second["block_size"].uint64_value();
if (!pc.data_block_size)
pc.data_block_size = global_block_size;
if ((pc.data_block_size & (pc.data_block_size-1)) ||
pc.data_block_size < MIN_DATA_BLOCK_SIZE || pc.data_block_size > MAX_DATA_BLOCK_SIZE)
{
fprintf(stderr, "Pool %u has invalid block_size (must be a power of two between %u and %u), skipping pool\n",
pool_id, MIN_DATA_BLOCK_SIZE, MAX_DATA_BLOCK_SIZE);
continue;
}
// Bitmap Granularity
pc.bitmap_granularity = pool_item.second["bitmap_granularity"].uint64_value();
if (!pc.bitmap_granularity)
pc.bitmap_granularity = global_bitmap_granularity;
if (!pc.bitmap_granularity || pc.data_block_size % pc.bitmap_granularity)
{
fprintf(stderr, "Pool %u has invalid bitmap_granularity (must divide block_size), skipping pool\n", pool_id);
continue;
}
// Immediate Commit Mode
pc.immediate_commit = pool_item.second["immediate_commit"].is_string()
? (pool_item.second["immediate_commit"].string_value() == "all"
? IMMEDIATE_ALL : (pool_item.second["immediate_commit"].string_value() == "small"
? IMMEDIATE_SMALL : IMMEDIATE_NONE))
: global_immediate_commit;
// PG Stripe Size
pc.pg_stripe_size = pool_item.second["pg_stripe_size"].uint64_value();
uint64_t min_stripe_size = bs_block_size * (pc.scheme == POOL_SCHEME_REPLICATED ? 1 : (pc.pg_size-pc.parity_chunks));
uint64_t min_stripe_size = pc.data_block_size * (pc.scheme == POOL_SCHEME_REPLICATED ? 1 : (pc.pg_size-pc.parity_chunks));
if (pc.pg_stripe_size < min_stripe_size)
pc.pg_stripe_size = min_stripe_size;
// Save

View File

@ -13,6 +13,13 @@
#define ETCD_OSD_STATE_WATCH_ID 4
#define DEFAULT_BLOCK_SIZE 128*1024
#define MIN_DATA_BLOCK_SIZE 4*1024
#define MAX_DATA_BLOCK_SIZE 128*1024*1024
#define DEFAULT_BITMAP_GRANULARITY 4096
#define IMMEDIATE_NONE 0
#define IMMEDIATE_SMALL 1
#define IMMEDIATE_ALL 2
struct etcd_kv_t
{
@ -41,6 +48,7 @@ struct pool_config_t
std::string name;
uint64_t scheme;
uint64_t pg_size, pg_minsize, parity_chunks;
uint32_t data_block_size, bitmap_granularity, immediate_commit;
uint64_t pg_count;
uint64_t real_pg_count;
std::string failure_domain;
@ -83,7 +91,6 @@ protected:
int ws_keepalive_timer = -1;
int ws_alive = 0;
bool rand_initialized = false;
uint64_t bs_block_size = DEFAULT_BLOCK_SIZE;
void add_etcd_url(std::string);
void pick_next_etcd();
public:
@ -92,6 +99,9 @@ public:
int max_etcd_attempts = 5;
int etcd_quick_timeout = 1000;
int etcd_slow_timeout = 5000;
uint64_t global_block_size = DEFAULT_BLOCK_SIZE;
uint32_t global_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
uint32_t global_immediate_commit = IMMEDIATE_NONE;
std::string etcd_prefix;
int log_level = 0;

View File

@ -427,6 +427,10 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
cl->osd_num, config["protocol_version"].uint64_value(), OSD_PROTOCOL_VERSION
);
}
if (check_config_hook)
{
err = !check_config_hook(cl, config);
}
}
if (err)
{

View File

@ -33,7 +33,6 @@
#define PEER_RDMA 4
#define PEER_STOPPED 5
#define DEFAULT_BITMAP_GRANULARITY 4096
#define VITASTOR_CONFIG_PATH "/etc/vitastor/vitastor.conf"
#define MSGR_SENDP_HDR 1
@ -160,6 +159,7 @@ public:
void outbox_push(osd_op_t *cur_op);
std::function<void(osd_op_t*)> exec_op;
std::function<void(osd_num_t)> repeer_pgs;
std::function<bool(osd_client_t*, json11::Json)> check_config_hook;
void read_requests();
void send_replies();
void accept_connections(int listen_fd);

View File

@ -314,7 +314,12 @@ static int nfs3_read_proc(void *opaque, rpc_op_t *rop)
rpc_queue_reply(rop);
return 0;
}
uint64_t alignment = self->parent->cli->get_bs_bitmap_granularity();
uint64_t alignment = self->parent->cli->st_cli.global_bitmap_granularity;
auto pool_cfg = self->parent->cli->st_cli.pool_config.find(INODE_POOL(ino_it->second));
if (pool_cfg != self->parent->cli->st_cli.pool_config.end())
{
alignment = pool_cfg->second.bitmap_granularity;
}
uint64_t aligned_offset = args->offset - (args->offset % alignment);
uint64_t aligned_count = args->offset + args->count;
if (aligned_count % alignment)
@ -375,7 +380,12 @@ static int nfs3_write_proc(void *opaque, rpc_op_t *rop)
return 0;
}
uint64_t count = args->count > args->data.size ? args->data.size : args->count;
uint64_t alignment = self->parent->cli->get_bs_bitmap_granularity();
uint64_t alignment = self->parent->cli->st_cli.global_bitmap_granularity;
auto pool_cfg = self->parent->cli->st_cli.pool_config.find(INODE_POOL(ino_it->second));
if (pool_cfg != self->parent->cli->st_cli.pool_config.end())
{
alignment = pool_cfg->second.bitmap_granularity;
}
// Pre-fill reply
*reply = (WRITE3res){
.status = NFS3_OK,
@ -471,6 +481,7 @@ static void nfs_do_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint
op->iov.push_back(buf, count);
op->callback = [self, rop](cluster_op_t *op)
{
uint64_t inode = op->inode;
WRITE3args *args = (WRITE3args*)rop->request;
WRITE3res *reply = (WRITE3res*)rop->reply;
if (op->retval != op->len)
@ -483,8 +494,8 @@ static void nfs_do_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint
{
*(uint64_t*)reply->resok.verf = self->parent->server_id;
delete op;
if (!self->parent->cli->get_immediate_commit() &&
args->stable != UNSTABLE)
if (args->stable != UNSTABLE &&
!self->parent->cli->get_immediate_commit(inode))
{
// Client requested a stable write. Add an fsync
op = new cluster_op_t;
@ -1179,25 +1190,17 @@ static int nfs3_commit_proc(void *opaque, rpc_op_t *rop)
{
nfs_client_t *self = (nfs_client_t*)opaque;
//COMMIT3args *args = (COMMIT3args*)rop->request;
if (!self->parent->cli->get_immediate_commit())
cluster_op_t *op = new cluster_op_t;
// fsync. we don't know how to fsync a single inode, so just fsync everything
op->opcode = OSD_OP_SYNC;
op->callback = [rop](cluster_op_t *op)
{
cluster_op_t *op = new cluster_op_t;
// fsync. we don't know how to fsync a single inode, so just fsync everything
op->opcode = OSD_OP_SYNC;
op->callback = [rop](cluster_op_t *op)
{
COMMIT3res *reply = (COMMIT3res*)rop->reply;
*reply = (COMMIT3res){ .status = vitastor_nfs_map_err(op->retval) };
rpc_queue_reply(rop);
};
self->parent->cli->execute(op);
return 1;
}
// pretend we just did an fsync
COMMIT3res *reply = (COMMIT3res*)rop->reply;
*reply = (COMMIT3res){ .status = NFS3_OK };
rpc_queue_reply(rop);
return 0;
COMMIT3res *reply = (COMMIT3res*)rop->reply;
*reply = (COMMIT3res){ .status = vitastor_nfs_map_err(op->retval) };
rpc_queue_reply(rop);
};
self->parent->cli->execute(op);
return 1;
}
static int mount3_mnt_proc(void *opaque, rpc_op_t *rop)

View File

@ -38,13 +38,12 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
this->config = msgr.read_config(config).object_items();
if (this->config.find("log_level") == this->config.end())
this->config["log_level"] = 1;
parse_config(this->config);
parse_config(this->config, true);
epmgr = new epoll_manager_t(ringloop);
// FIXME: Use timerfd_interval based directly on io_uring
this->tfd = epmgr->tfd;
// FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config
auto bs_cfg = json_to_bs(this->config);
this->bs = new blockstore_t(bs_cfg, ringloop, tfd);
{
@ -81,6 +80,7 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
msgr.ringloop = this->ringloop;
msgr.exec_op = [this](osd_op_t *op) { exec_op(op); };
msgr.repeer_pgs = [this](osd_num_t peer_osd) { repeer_pgs(peer_osd); };
msgr.check_config_hook = [this](osd_client_t *cl, json11::Json conf) { return check_peer_config(cl, conf); };
msgr.init();
init_cluster();
@ -98,23 +98,33 @@ osd_t::~osd_t()
free(zero_buffer);
}
void osd_t::parse_config(const json11::Json & config)
void osd_t::parse_config(const json11::Json & config, bool allow_disk_params)
{
st_cli.parse_config(config);
msgr.parse_config(config);
// OSD number
osd_num = config["osd_num"].uint64_value();
if (!osd_num)
throw std::runtime_error("osd_num is required in the configuration");
msgr.osd_num = osd_num;
// Vital Blockstore parameters
bs_block_size = config["block_size"].uint64_value();
if (!bs_block_size)
bs_block_size = DEFAULT_BLOCK_SIZE;
bs_bitmap_granularity = config["bitmap_granularity"].uint64_value();
if (!bs_bitmap_granularity)
bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
clean_entry_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
if (allow_disk_params)
{
// OSD number
osd_num = config["osd_num"].uint64_value();
if (!osd_num)
throw std::runtime_error("osd_num is required in the configuration");
msgr.osd_num = osd_num;
// Vital Blockstore parameters
bs_block_size = config["block_size"].uint64_value();
if (!bs_block_size)
bs_block_size = DEFAULT_BLOCK_SIZE;
bs_bitmap_granularity = config["bitmap_granularity"].uint64_value();
if (!bs_bitmap_granularity)
bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
clean_entry_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
// immediate_commit
if (config["immediate_commit"] == "all")
immediate_commit = IMMEDIATE_ALL;
else if (config["immediate_commit"] == "small")
immediate_commit = IMMEDIATE_SMALL;
else
immediate_commit = IMMEDIATE_NONE;
}
// Bind address
bind_address = config["bind_address"].string_value();
if (bind_address == "")
@ -132,12 +142,6 @@ void osd_t::parse_config(const json11::Json & config)
no_rebalance = json_is_true(config["no_rebalance"]);
no_recovery = json_is_true(config["no_recovery"]);
allow_test_ops = json_is_true(config["allow_test_ops"]);
if (config["immediate_commit"] == "all")
immediate_commit = IMMEDIATE_ALL;
else if (config["immediate_commit"] == "small")
immediate_commit = IMMEDIATE_SMALL;
else
immediate_commit = IMMEDIATE_NONE;
if (!config["autosync_interval"].is_null())
{
// Allow to set it to 0

View File

@ -29,10 +29,6 @@
#define OSD_FLUSHING_PGS 0x08
#define OSD_RECOVERING 0x10
#define IMMEDIATE_NONE 0
#define IMMEDIATE_SMALL 1
#define IMMEDIATE_ALL 2
#define MAX_AUTOSYNC_INTERVAL 3600
#define DEFAULT_AUTOSYNC_INTERVAL 5
#define DEFAULT_AUTOSYNC_WRITES 128
@ -172,7 +168,7 @@ class osd_t
uint64_t recovery_stat_bytes[2][2] = {};
// cluster connection
void parse_config(const json11::Json & config);
void parse_config(const json11::Json & config, bool allow_disk_params);
void init_cluster();
void on_change_osd_state_hook(osd_num_t peer_osd);
void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
@ -201,6 +197,7 @@ class osd_t
// peer handling (primary OSD logic)
void parse_test_peer(std::string peer);
void handle_peers();
bool check_peer_config(osd_client_t *cl, json11::Json conf);
void repeer_pgs(osd_num_t osd_num);
void start_pg_peering(pg_t & pg);
void submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *ps);

View File

@ -108,6 +108,52 @@ void osd_t::parse_test_peer(std::string peer)
msgr.connect_peer(peer_osd, st_cli.peer_states[peer_osd]);
}
bool osd_t::check_peer_config(osd_client_t *cl, json11::Json conf)
{
// Check block_size, bitmap_granularity and immediate_commit of the peer
if (conf["block_size"].is_null() ||
conf["bitmap_granularity"].is_null() ||
conf["immediate_commit"].is_null())
{
printf(
"[OSD %lu] Warning: peer OSD %lu does not report block_size/bitmap_granularity/immediate_commit."
" Is it older than 0.6.3?\n", this->osd_num, cl->osd_num
);
}
else
{
int peer_immediate_commit = (conf["immediate_commit"].string_value() == "all"
? IMMEDIATE_ALL : (conf["immediate_commit"].string_value() == "small" ? IMMEDIATE_SMALL : IMMEDIATE_NONE));
if (immediate_commit == IMMEDIATE_ALL && peer_immediate_commit != IMMEDIATE_ALL ||
immediate_commit == IMMEDIATE_SMALL && peer_immediate_commit == IMMEDIATE_NONE)
{
printf(
"[OSD %lu] My immediate_commit is \"%s\", but peer OSD %lu has \"%s\". We can't work together\n",
this->osd_num, immediate_commit == IMMEDIATE_ALL ? "all" : "small",
cl->osd_num, conf["immediate_commit"].string_value().c_str()
);
return true;
}
else if (conf["block_size"].uint64_value() != (uint64_t)this->bs_block_size)
{
printf(
"[OSD %lu] My block_size is %u, but peer OSD %lu has %lu. We can't work together\n",
this->osd_num, this->bs_block_size, cl->osd_num, conf["block_size"].uint64_value()
);
return true;
}
else if (conf["bitmap_granularity"].uint64_value() != (uint64_t)this->bs_bitmap_granularity)
{
printf(
"[OSD %lu] My bitmap_granularity is %u, but peer OSD %lu has %lu. We can't work together\n",
this->osd_num, this->bs_bitmap_granularity, cl->osd_num, conf["bitmap_granularity"].uint64_value()
);
return true;
}
}
return true;
}
json11::Json osd_t::get_osd_state()
{
std::vector<char> hostname;
@ -137,6 +183,7 @@ json11::Json osd_t::get_statistics()
sprintf(time_str, "%ld.%03ld", ts.tv_sec, ts.tv_nsec/1000000);
st["time"] = time_str;
st["blockstore_ready"] = bs->is_started();
st["data_block_size"] = (uint64_t)bs->get_block_size();
if (bs)
{
st["size"] = bs->get_block_count() * bs->get_block_size();
@ -365,7 +412,7 @@ void osd_t::on_load_config_hook(json11::Json::object & global_config)
for (auto & kv: global_config)
if (osd_config.find(kv.first) == osd_config.end())
osd_config[kv.first] = kv.second;
parse_config(osd_config);
parse_config(osd_config, false);
bind_socket();
acquire_lease();
}
@ -598,6 +645,7 @@ void osd_t::apply_pg_config()
bool all_applied = true;
for (auto & pool_item: st_cli.pool_config)
{
bool warned_block_size = false;
auto pool_id = pool_item.first;
for (auto & kv: pool_item.second.pg_config)
{
@ -607,6 +655,22 @@ void osd_t::apply_pg_config()
!pg_cfg.pause && (!pg_cfg.cur_primary || pg_cfg.cur_primary == this->osd_num);
auto pg_it = this->pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
bool currently_taken = pg_it != this->pgs.end() && pg_it->second.state != PG_OFFLINE;
// Check pool block size and bitmap granularity
if (this->bs_block_size != pool_item.second.data_block_size ||
this->bs_bitmap_granularity != pool_item.second.bitmap_granularity)
{
if (!warned_block_size)
{
printf(
"[OSD %lu] My block_size and bitmap_granularity are %u/%u"
", but pool has %u/%u. Refusing to start PGs of this pool\n",
this->osd_num, bs_block_size, bs_bitmap_granularity,
pool_item.second.data_block_size, pool_item.second.bitmap_granularity
);
}
warned_block_size = true;
take = false;
}
if (currently_taken && !take)
{
// Stop this PG