Allow to configure block_size, bitmap_granularity and immediate_commit per-pool

2022-08-09 02:27:02 +03:00 · 2022-08-09 02:27:02 +03:00 · 5a10d135f3
parent 4c9aaa8a86
commit 5a10d135f3
24 changed files with 535 additions and 340 deletions
--- a/docs/config/layout-cluster.en.md
+++ b/docs/config/layout-cluster.en.md
@ -9,34 +9,34 @@
 These parameters apply to clients and OSDs, are fixed at the moment of OSD drive
 initialization and can't be changed after it without losing data.

+OSDs with different values of these parameters (for example, SSD and SSD+HDD
+OSDs) can coexist in one Vitastor cluster within different pools. Each pool can
+only include OSDs with identical settings of these parameters.
+
+These parameters, when set to a non-default value, must also be specified in
+etcd for clients to be aware of their values, either in /vitastor/config/global
+or in pool configuration. Pool configuration overrides the global setting.
+If the value for a pool in etcd doesn't match on-disk OSD configuration, the
+OSD will refuse to start PGs of that pool.
+
 - [block_size](#block_size)
 - [bitmap_granularity](#bitmap_granularity)
 - [immediate_commit](#immediate_commit)
- [client_dirty_limit](#client_dirty_limit)

 ## block_size

 - Type: integer
 - Default: 131072

-Size of objects (data blocks) into which all physical and virtual drives are
-subdivided in Vitastor. One of current main settings in Vitastor, affects
-memory usage, write amplification and I/O load distribution effectiveness.
+Size of objects (data blocks) into which all physical and virtual drives
+(within a pool) are subdivided in Vitastor. One of current main settings
+in Vitastor, affects memory usage, write amplification and I/O load
+distribution effectiveness.

 Recommended default block size is 128 KB for SSD and 4 MB for HDD. In fact,
 it's possible to use 4 MB for SSD too - it will lower memory usage, but
 may increase average WA and reduce linear performance.

-OSDs with different block sizes (for example, SSD and SSD+HDD OSDs) can
-currently coexist in one etcd instance only within separate Vitastor
-clusters with different etcd_prefix'es.
-
-Also block size can't be changed after OSD initialization without losing
-data.
-
-You must always specify block_size in etcd in /vitastor/config/global if
-you change it so all clients can know about it.
-
 OSD memory usage is roughly (SIZE / BLOCK * 68 bytes) which is roughly
 544 MB per 1 TB of used disk space with the default 128 KB block size.

@ -50,12 +50,7 @@ of disk_alignment. It's called bitmap granularity because Vitastor tracks
 an allocation bitmap for each object containing 2 bits per each
 (bitmap_granularity) bytes.

-This parameter can't be changed after OSD initialization without losing
-data. Also it's fixed for the whole Vitastor cluster i.e. two different
-values can't be used in a single Vitastor cluster.
-
-Clients MUST be aware of this parameter value, so put it into etcd key
-/vitastor/config/global if you change it for any reason.
+Can't be smaller than the OSD data device sector.

 ## immediate_commit

@ -99,26 +94,12 @@ unsafe to change by hand). The same may apply to newer HDDs with internal
 SSD cache or "media-cache" - for example, a lot of Seagate EXOS drives have
 it (they have internal SSD cache even though it's not stated in datasheets).

-This parameter must be set both in etcd in /vitastor/config/global and in
-OSD command line or configuration. Setting it to "all" or "small" requires
-enabling disable_journal_fsync and disable_meta_fsync, setting it to "all"
-also requires enabling disable_data_fsync.
+Setting this parameter to "all" or "small" in OSD parameters requires enabling
+disable_journal_fsync and disable_meta_fsync, setting it to "all" also requires
+enabling disable_data_fsync.

 TLDR: For optimal performance, set immediate_commit to "all" if you only use
 SSDs with supercapacitor-based power loss protection (nonvolatile
 write-through cache) for both data and journals in the whole Vitastor
 cluster. Set it to "small" if you only use such SSDs for journals. Leave
 empty if your drives have write-back cache.
-
-## client_dirty_limit
-
- Type: integer
- Default: 33554432
-
-Without immediate_commit=all this parameter sets the limit of "dirty"
-(not committed by fsync) data allowed by the client before forcing an
-additional fsync and committing the data. Also note that the client always
-holds a copy of uncommitted data in memory so this setting also affects
-RAM usage of clients.
-
-This parameter doesn't affect OSDs themselves.
--- a/docs/config/layout-cluster.ru.md
+++ b/docs/config/layout-cluster.ru.md
@ -9,10 +9,19 @@
 Данные параметры используются клиентами и OSD, задаются в момент инициализации
 диска OSD и не могут быть изменены после этого без потери данных.

+OSD с разными значениями данных параметров (например, SSD и гибридные SSD+HDD
+OSD) могут сосуществовать в одном кластере Vitastor в разных пулах. Один пул
+может включать только OSD с одинаковыми настройками этих параметров.
+
+Данные параметры, отличаясь от значения по умолчанию, должны также быть заданы
+в etcd, чтобы клиенты могли узнать их значение, либо в глобальной конфигурации
+/vitastor/config/global, либо в настройках пулов. Настройки пула переопределяют
+глобальное значение. Если значение в настройках пула не будет соответствовать
+конфигурации OSD, OSD откажется запускать PG данного пула.
+
 - [block_size](#block_size)
 - [bitmap_granularity](#bitmap_granularity)
 - [immediate_commit](#immediate_commit)
- [client_dirty_limit](#client_dirty_limit)

 ## block_size

@ -20,24 +29,15 @@
 - Значение по умолчанию: 131072

 Размер объектов (блоков данных), на которые делятся физические и виртуальные
-диски в Vitastor. Одна из ключевых на данный момент настроек, влияет на
-потребление памяти, объём избыточной записи (write amplification) и
-эффективность распределения нагрузки по OSD.
+диски в Vitastor (в рамках каждого пула). Одна из ключевых на данный момент
+настроек, влияет на потребление памяти, объём избыточной записи (write
+amplification) и эффективность распределения нагрузки по OSD.

 Рекомендуемые по умолчанию размеры блока - 128 килобайт для SSD и 4
 мегабайта для HDD. В принципе, для SSD можно тоже использовать 4 мегабайта,
 это понизит использование памяти, но ухудшит распределение нагрузки и в
 среднем увеличит WA.

-OSD с разными размерами блока (например, SSD и SSD+HDD OSD) на данный
-момент могут сосуществовать в рамках одного etcd только в виде двух независимых
-кластеров Vitastor с разными etcd_prefix.
-
-Также размер блока нельзя менять после инициализации OSD без потери данных.
-
-Если вы меняете размер блока, обязательно прописывайте его в etcd в
-/vitastor/config/global, дабы все клиенты его знали.
-
 Потребление памяти OSD составляет примерно (РАЗМЕР / БЛОК * 68 байт),
 т.е. примерно 544 МБ памяти на 1 ТБ занятого места на диске при
 стандартном 128 КБ блоке.
@ -52,13 +52,7 @@ OSD с разными размерами блока (например, SSD и SS
 потому, что Vitastor хранит битовую карту для каждого объекта, содержащую
 по 2 бита на каждые (bitmap_granularity) байт.

-Данный параметр нельзя менять после инициализации OSD без потери данных.
-Также он фиксирован для всего кластера Vitastor, т.е. разные значения
-не могут сосуществовать в одном кластере.
-
-Клиенты ДОЛЖНЫ знать правильное значение этого параметра, так что если вы
-его меняете, обязательно прописывайте изменённое значение в etcd в ключ
-/vitastor/config/global.
+Не может быть меньше размера сектора дисков данных OSD.

 ## immediate_commit

@ -108,8 +102,7 @@ HDD-дисках с внутренним SSD или "медиа" кэшем - н
 многих дисках Seagate EXOS (у них есть внутренний SSD-кэш, хотя это и не
 указано в спецификациях).

-Данный параметр нужно указывать и в etcd в /vitastor/config/global, и в
-командной строке или конфигурации OSD. Значения "all" и "small" требуют
+Указание "all" или "small" в настройках / командной строке OSD требует
 включения disable_journal_fsync и disable_meta_fsync, значение "all" также
 требует включения disable_data_fsync.

@ -119,16 +112,3 @@ immediate_commit в значение "all", если вы используете
 такие SSD для всех журналов, но не для данных - можете установить параметр
 в "small". Если и какие-то из дисков журналов имеют волатильный кэш записи -
 оставьте параметр пустым.
-
-## client_dirty_limit
-
- Тип: целое число
- Значение по умолчанию: 33554432
-
-При работе без immediate_commit=all - это лимит объёма "грязных" (не
-зафиксированных fsync-ом) данных, при достижении которого клиент будет
-принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
-что в этом случае до момента fsync клиент хранит копию незафиксированных
-данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
-
-Параметр не влияет на сами OSD.
--- a/docs/config/network.en.md
+++ b/docs/config/network.en.md
@ -29,6 +29,7 @@ between clients, OSDs and etcd.
 - [etcd_slow_timeout](#etcd_slow_timeout)
 - [etcd_keepalive_timeout](#etcd_keepalive_timeout)
 - [etcd_ws_keepalive_timeout](#etcd_ws_keepalive_timeout)
+- [client_dirty_limit](#client_dirty_limit)

 ## tcp_header_buffer_size

@ -212,3 +213,16 @@ etcd_report_interval to guarantee that keepalive actually works.

 etcd websocket ping interval required to keep the connection alive and
 detect disconnections quickly.
+
+## client_dirty_limit
+
+- Type: integer
+- Default: 33554432
+
+Without immediate_commit=all this parameter sets the limit of "dirty"
+(not committed by fsync) data allowed by the client before forcing an
+additional fsync and committing the data. Also note that the client always
+holds a copy of uncommitted data in memory so this setting also affects
+RAM usage of clients.
+
+This parameter doesn't affect OSDs themselves.
--- a/docs/config/network.ru.md
+++ b/docs/config/network.ru.md
@ -29,6 +29,7 @@
 - [etcd_slow_timeout](#etcd_slow_timeout)
 - [etcd_keepalive_timeout](#etcd_keepalive_timeout)
 - [etcd_ws_keepalive_timeout](#etcd_ws_keepalive_timeout)
+- [client_dirty_limit](#client_dirty_limit)

 ## tcp_header_buffer_size

@ -222,3 +223,16 @@ etcd_report_interval, чтобы keepalive гарантированно рабо
 - Значение по умолчанию: 30

 Интервал проверки живости вебсокет-подключений к etcd.
+
+## client_dirty_limit
+
+- Тип: целое число
+- Значение по умолчанию: 33554432
+
+При работе без immediate_commit=all - это лимит объёма "грязных" (не
+зафиксированных fsync-ом) данных, при достижении которого клиент будет
+принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
+что в этом случае до момента fsync клиент хранит копию незафиксированных
+данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
+
+Параметр не влияет на сами OSD.
--- a/docs/config/pool.en.md
+++ b/docs/config/pool.en.md
@ -33,6 +33,9 @@ Parameters:
 - [pg_count](#pg_count)
 - [failure_domain](#failure_domain)
 - [max_osd_combinations](#max_osd_combinations)
+- [block_size](#block_size)
+- [bitmap_granularity](#bitmap_granularity)
+- [immediate_commit](#immediate_commit)
 - [pg_stripe_size](#pg_stripe_size)
 - [root_node](#root_node)
 - [osd_tags](#osd_tags)
@ -186,6 +189,43 @@ number of combinations to generate when optimising PG placement.

 This parameter usually doesn't require to be changed.

+## block_size
+
+- Type: integer
+- Default: 131072
+
+Block size for this pool. The value from /vitastor/config/global is used when
+unspecified. If your cluster has OSDs with different block sizes then pool must
+be restricted by [osd_tags](#osd_tags) to only include OSDs with matching block
+size.
+
+Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-cluster.en.md#block_size).
+
+## bitmap_granularity
+
+- Type: integer
+- Default: 4096
+
+"Sector" size of virtual disks in this pool. The value from
+/vitastor/config/global is used when unspecified. Similar to block_size, the
+pool must be restricted by [osd_tags](#osd_tags) to only include OSDs with
+matching bitmap_granularity.
+
+Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-cluster.en.md#bitmap_granularity).
+
+## immediate_commit
+
+- Type: string, one of "all", "small" and "none"
+- Default: none
+
+Immediate commit setting for this pool. The value from /vitastor/config/global
+is used when unspecified. Similar to block_size, the pool must be restricted by
+[osd_tags](#osd_tags) to only include OSDs with compatible immediate_commit.
+Compatible means that a pool with non-immediate commit will work with OSDs with
+immediate commit enabled, but not vice versa.
+
+Read more about this parameter in [Cluster-Wide Disk Layout Parameters](layout-cluster.en.md#immediate_commit).
+
 ## pg_stripe_size

 - Type: integer
--- a/docs/config/pool.ru.md
+++ b/docs/config/pool.ru.md
@ -32,6 +32,9 @@
 - [pg_count](#pg_count)
 - [failure_domain](#failure_domain)
 - [max_osd_combinations](#max_osd_combinations)
+- [block_size](#block_size)
+- [bitmap_granularity](#bitmap_granularity)
+- [immediate_commit](#immediate_commit)
 - [pg_stripe_size](#pg_stripe_size)
 - [root_node](#root_node)
 - [osd_tags](#osd_tags)
@ -185,13 +188,51 @@ PG в Vitastor эферемерны, то есть вы можете менят

 Обычно данный параметр не требует изменений.

+## block_size
+
+- Тип: целое число
+- По умолчанию: 131072
+
+Размер блока для данного пула. Если не задан, используется значение из
+/vitastor/config/global. Если в вашем кластере есть OSD с разными размерами
+блока, пул должен быть ограничен только OSD, блок которых равен блоку пула,
+с помощью [osd_tags](#osd_tags).
+
+О самом параметре читайте в разделе [Дисковые параметры уровня кластера](layout-cluster.ru.md#block_size).
+
+## bitmap_granularity
+
+- Тип: целое число
+- По умолчанию: 4096
+
+Размер "сектора" виртуальных дисков в данном пуле. Если не задан, используется
+значение из /vitastor/config/global. Аналогично block_size, пул должен быть
+ограничен OSD со значением bitmap_granularity, равным значению пула, с помощью
+[osd_tags](#osd_tags).
+
+О самом параметре читайте в разделе [Дисковые параметры уровня кластера](layout-cluster.ru.md#bitmap_granularity).
+
+## immediate_commit
+
+- Тип: строка "all", "small" или "none"
+- По умолчанию: none
+
+Настройка мгновенного коммита для данного пула. Если не задана, используется
+значение из /vitastor/config/global. Аналогично block_size, пул должен быть
+ограничен OSD со значением bitmap_granularity, совместимым со значением пула, с
+помощью [osd_tags](#osd_tags). Совместимость означает, что пул с отключенным
+мгновенным коммитом может работать на OSD с включённым мгновенным коммитом, но
+не наоборот.
+
+О самом параметре читайте в разделе [Дисковые параметры уровня кластера](layout-cluster.ru.md#immediate_commit).
+
 ## pg_stripe_size

 - Тип: целое число
 - По умолчанию: 0

 Данный параметр задаёт размер полосы "нарезки" образов на PG. Размер полосы не может
-быть меньше, чем [block_size](layout-cluster.ru.md#block_size), умноженный на
+быть меньше, чем [block_size](#block_size), умноженный на
 (pg_size - parity_chunks) для EC-пулов или 1 для реплицированных пулов. То же
 значение используется по умолчанию.

--- a/docs/config/src/layout-cluster.en.md
+++ b/docs/config/src/layout-cluster.en.md
@ -2,3 +2,13 @@

 These parameters apply to clients and OSDs, are fixed at the moment of OSD drive
 initialization and can't be changed after it without losing data.
+
+OSDs with different values of these parameters (for example, SSD and SSD+HDD
+OSDs) can coexist in one Vitastor cluster within different pools. Each pool can
+only include OSDs with identical settings of these parameters.
+
+These parameters, when set to a non-default value, must also be specified in
+etcd for clients to be aware of their values, either in /vitastor/config/global
+or in pool configuration. Pool configuration overrides the global setting.
+If the value for a pool in etcd doesn't match on-disk OSD configuration, the
+OSD will refuse to start PGs of that pool.
--- a/docs/config/src/layout-cluster.ru.md
+++ b/docs/config/src/layout-cluster.ru.md
@ -2,3 +2,13 @@

 Данные параметры используются клиентами и OSD, задаются в момент инициализации
 диска OSD и не могут быть изменены после этого без потери данных.
+
+OSD с разными значениями данных параметров (например, SSD и гибридные SSD+HDD
+OSD) могут сосуществовать в одном кластере Vitastor в разных пулах. Один пул
+может включать только OSD с одинаковыми настройками этих параметров.
+
+Данные параметры, отличаясь от значения по умолчанию, должны также быть заданы
+в etcd, чтобы клиенты могли узнать их значение, либо в глобальной конфигурации
+/vitastor/config/global, либо в настройках пулов. Настройки пула переопределяют
+глобальное значение. Если значение в настройках пула не будет соответствовать
+конфигурации OSD, OSD откажется запускать PG данного пула.
--- a/docs/config/src/layout-cluster.yml
+++ b/docs/config/src/layout-cluster.yml
@ -2,46 +2,28 @@
  type: int
  default: 131072
  info: |
-    Size of objects (data blocks) into which all physical and virtual drives are
-    subdivided in Vitastor. One of current main settings in Vitastor, affects
-    memory usage, write amplification and I/O load distribution effectiveness.
+    Size of objects (data blocks) into which all physical and virtual drives
+    (within a pool) are subdivided in Vitastor. One of current main settings
+    in Vitastor, affects memory usage, write amplification and I/O load
+    distribution effectiveness.

    Recommended default block size is 128 KB for SSD and 4 MB for HDD. In fact,
    it's possible to use 4 MB for SSD too - it will lower memory usage, but
    may increase average WA and reduce linear performance.

-    OSDs with different block sizes (for example, SSD and SSD+HDD OSDs) can
-    currently coexist in one etcd instance only within separate Vitastor
-    clusters with different etcd_prefix'es.
-
-    Also block size can't be changed after OSD initialization without losing
-    data.
-
-    You must always specify block_size in etcd in /vitastor/config/global if
-    you change it so all clients can know about it.
-
    OSD memory usage is roughly (SIZE / BLOCK * 68 bytes) which is roughly
    544 MB per 1 TB of used disk space with the default 128 KB block size.
  info_ru: |
    Размер объектов (блоков данных), на которые делятся физические и виртуальные
-    диски в Vitastor. Одна из ключевых на данный момент настроек, влияет на
-    потребление памяти, объём избыточной записи (write amplification) и
-    эффективность распределения нагрузки по OSD.
+    диски в Vitastor (в рамках каждого пула). Одна из ключевых на данный момент
+    настроек, влияет на потребление памяти, объём избыточной записи (write
+    amplification) и эффективность распределения нагрузки по OSD.

    Рекомендуемые по умолчанию размеры блока - 128 килобайт для SSD и 4
    мегабайта для HDD. В принципе, для SSD можно тоже использовать 4 мегабайта,
    это понизит использование памяти, но ухудшит распределение нагрузки и в
    среднем увеличит WA.

-    OSD с разными размерами блока (например, SSD и SSD+HDD OSD) на данный
-    момент могут сосуществовать в рамках одного etcd только в виде двух независимых
-    кластеров Vitastor с разными etcd_prefix.
-
-    Также размер блока нельзя менять после инициализации OSD без потери данных.
-
-    Если вы меняете размер блока, обязательно прописывайте его в etcd в
-    /vitastor/config/global, дабы все клиенты его знали.
-
    Потребление памяти OSD составляет примерно (РАЗМЕР / БЛОК * 68 байт),
    т.е. примерно 544 МБ памяти на 1 ТБ занятого места на диске при
    стандартном 128 КБ блоке.
@ -54,25 +36,14 @@
    an allocation bitmap for each object containing 2 bits per each
    (bitmap_granularity) bytes.

-    This parameter can't be changed after OSD initialization without losing
-    data. Also it's fixed for the whole Vitastor cluster i.e. two different
-    values can't be used in a single Vitastor cluster.
-
-    Clients MUST be aware of this parameter value, so put it into etcd key
-    /vitastor/config/global if you change it for any reason.
+    Can't be smaller than the OSD data device sector.
  info_ru: |
    Требуемое выравнивание записи на виртуальные диски (размер их "сектора").
    Должен быть кратен disk_alignment. Называется гранулярностью битовой карты
    потому, что Vitastor хранит битовую карту для каждого объекта, содержащую
    по 2 бита на каждые (bitmap_granularity) байт.

-    Данный параметр нельзя менять после инициализации OSD без потери данных.
-    Также он фиксирован для всего кластера Vitastor, т.е. разные значения
-    не могут сосуществовать в одном кластере.
-
-    Клиенты ДОЛЖНЫ знать правильное значение этого параметра, так что если вы
-    его меняете, обязательно прописывайте изменённое значение в etcd в ключ
-    /vitastor/config/global.
+    Не может быть меньше размера сектора дисков данных OSD.
 - name: immediate_commit
  type: string
  default: false
@ -114,10 +85,9 @@
    SSD cache or "media-cache" - for example, a lot of Seagate EXOS drives have
    it (they have internal SSD cache even though it's not stated in datasheets).

-    This parameter must be set both in etcd in /vitastor/config/global and in
-    OSD command line or configuration. Setting it to "all" or "small" requires
-    enabling disable_journal_fsync and disable_meta_fsync, setting it to "all"
-    also requires enabling disable_data_fsync.
+    Setting this parameter to "all" or "small" in OSD parameters requires enabling
+    disable_journal_fsync and disable_meta_fsync, setting it to "all" also requires
+    enabling disable_data_fsync.

    TLDR: For optimal performance, set immediate_commit to "all" if you only use
    SSDs with supercapacitor-based power loss protection (nonvolatile
@ -168,8 +138,7 @@
    многих дисках Seagate EXOS (у них есть внутренний SSD-кэш, хотя это и не
    указано в спецификациях).

-    Данный параметр нужно указывать и в etcd в /vitastor/config/global, и в
-    командной строке или конфигурации OSD. Значения "all" и "small" требуют
+    Указание "all" или "small" в настройках / командной строке OSD требует
    включения disable_journal_fsync и disable_meta_fsync, значение "all" также
    требует включения disable_data_fsync.

@ -179,22 +148,3 @@
    такие SSD для всех журналов, но не для данных - можете установить параметр
    в "small". Если и какие-то из дисков журналов имеют волатильный кэш записи -
    оставьте параметр пустым.
- name: client_dirty_limit
-  type: int
-  default: 33554432
-  info: |
-    Without immediate_commit=all this parameter sets the limit of "dirty"
-    (not committed by fsync) data allowed by the client before forcing an
-    additional fsync and committing the data. Also note that the client always
-    holds a copy of uncommitted data in memory so this setting also affects
-    RAM usage of clients.
-
-    This parameter doesn't affect OSDs themselves.
-  info_ru: |
-    При работе без immediate_commit=all - это лимит объёма "грязных" (не
-    зафиксированных fsync-ом) данных, при достижении которого клиент будет
-    принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
-    что в этом случае до момента fsync клиент хранит копию незафиксированных
-    данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
-
-    Параметр не влияет на сами OSD.
--- a/docs/config/src/network.yml
+++ b/docs/config/src/network.yml
@ -223,3 +223,22 @@
    detect disconnections quickly.
  info_ru: |
    Интервал проверки живости вебсокет-подключений к etcd.
+- name: client_dirty_limit
+  type: int
+  default: 33554432
+  info: |
+    Without immediate_commit=all this parameter sets the limit of "dirty"
+    (not committed by fsync) data allowed by the client before forcing an
+    additional fsync and committing the data. Also note that the client always
+    holds a copy of uncommitted data in memory so this setting also affects
+    RAM usage of clients.
+
+    This parameter doesn't affect OSDs themselves.
+  info_ru: |
+    При работе без immediate_commit=all - это лимит объёма "грязных" (не
+    зафиксированных fsync-ом) данных, при достижении которого клиент будет
+    принудительно вызывать fsync и фиксировать данные. Также стоит иметь в виду,
+    что в этом случае до момента fsync клиент хранит копию незафиксированных
+    данных в памяти, то есть, настройка влияет на потребление памяти клиентами.
+
+    Параметр не влияет на сами OSD.
--- a/docs/intro/architecture.ru.md
+++ b/docs/intro/architecture.ru.md
@ -127,7 +127,7 @@
  запросы записи клиенты копируют в памяти и при потере соединения и повторном соединении
  с OSD повторяют из памяти. Скопированные в память данные удаляются при успешном fsync,
  а чтобы хранение этих данных не приводило к чрезмерному потреблению памяти, клиенты
-  автоматически выполняют fsync каждые [client_dirty_limit](../config/layout-cluster.ru.md#client_dirty_limit)
+  автоматически выполняют fsync каждые [client_dirty_limit](../config/network.ru.md#client_dirty_limit)
  записанных байт.

 ## Схожесть с Ceph
--- a/mon/mon.js
+++ b/mon/mon.js
@ -157,7 +157,12 @@ const etcd_tree = {
                pg_count: 100,
                failure_domain: 'host',
                max_osd_combinations: 10000,
-                pg_stripe_size: 4194304,
+                // block_size, bitmap_granularity, immediate_commit must match all OSDs used in that pool
+                data_block_size: 131072,
+                bitmap_granularity: 4096,
+                // 'all'/'small'/'none', same as in OSD options
+                immediate_commit: 'none',
+                pg_stripe_size: 0,
                root_node?: 'rack1',
                // restrict pool to OSDs having all of these tags
                osd_tags?: 'nvme' | [ 'nvme', ... ],
@ -323,6 +328,13 @@ const etcd_tree = {
            misplaced: uint64_t,
            degraded: uint64_t,
            incomplete: uint64_t,
+        },
+        object_bytes: {
+            total: uint64_t,
+            clean: uint64_t,
+            misplaced: uint64_t,
+            degraded: uint64_t,
+            incomplete: uint64_t,
        }, */
    },
    history: {
@ -1438,8 +1450,23 @@ class Mon
    sum_object_counts()
    {
        const object_counts = { object: 0n, clean: 0n, misplaced: 0n, degraded: 0n, incomplete: 0n };
+        const object_bytes = { object: 0n, clean: 0n, misplaced: 0n, degraded: 0n, incomplete: 0n };
        for (const pool_id in this.state.pg.stats)
        {
+            let object_size = 0;
+            for (const osd_num of this.state.pg.stats[pool_id].write_osd_set||[])
+            {
+                if (osd_num && this.state.osd.stats[osd_num] && this.state.osd.stats[osd_num].block_size)
+                {
+                    object_size = this.state.osd.stats[osd_num].block_size;
+                    break;
+                }
+            }
+            if (!object_size)
+            {
+                object_size = this.config['block_size'];
+            }
+            object_size = BigInt(object_size);
            for (const pg_num in this.state.pg.stats[pool_id])
            {
                const st = this.state.pg.stats[pool_id][pg_num];
@ -1450,12 +1477,13 @@ class Mon
                        if (st[k+'_count'])
                        {
                            object_counts[k] += BigInt(st[k+'_count']);
+                            object_bytes[k] += BigInt(st[k+'_count']) * object_size;
                        }
                    }
                }
            }
        }
-        return object_counts;
+        return { object_counts, object_bytes };
    }

    sum_inode_stats(prev_stats, timestamp, prev_timestamp)
@ -1568,7 +1596,7 @@ class Mon
    {
        const txn = [];
        const timestamp = Date.now();
-        const object_counts = this.sum_object_counts();
+        const { object_counts, object_bytes } = this.sum_object_counts();
        let stats = this.sum_op_stats(timestamp, this.prev_stats);
        let inode_stats = this.sum_inode_stats(
            this.prev_stats ? this.prev_stats.inode_stats : null,
@ -1576,6 +1604,7 @@ class Mon
        );
        this.prev_stats = { timestamp, ...stats, inode_stats };
        stats.object_counts = object_counts;
+        stats.object_bytes = object_bytes;
        stats = this.serialize_bigints(stats);
        inode_stats = this.serialize_bigints(inode_stats);
        txn.push({ requestPut: { key: b64(this.etcd_prefix+'/stats'), value: b64(JSON.stringify(stats)) } });
--- a/src/cli_merge.cpp
+++ b/src/cli_merge.cpp
@ -47,6 +47,7 @@ struct snap_merger_t
    int state = 0;
    int lists_todo = 0;
    uint64_t target_block_size = 0;
+    uint32_t target_bitmap_granularity = 0;
    btree::safe_btree_set<uint64_t> merge_offsets;
    btree::safe_btree_set<uint64_t>::iterator oit;
    std::map<inode_t, std::vector<uint64_t>> layer_lists;
@ -101,7 +102,7 @@ struct snap_merger_t
        std::vector<inode_t> chain_list;
        inode_config_t *cur = to_cfg;
        chain_list.push_back(cur->num);
-        layer_block_size[cur->num] = get_block_size(cur->num);
+        layer_block_size[cur->num] = get_block_size(cur->num, NULL);
        while (cur->parent_id != from_cfg->num &&
            cur->parent_id != to_cfg->num &&
            cur->parent_id != 0)
@ -124,7 +125,7 @@ struct snap_merger_t
            }
            cur = &it->second;
            chain_list.push_back(cur->num);
-            layer_block_size[cur->num] = get_block_size(cur->num);
+            layer_block_size[cur->num] = get_block_size(cur->num, NULL);
        }
        if (cur->parent_id != from_cfg->num)
        {
@ -133,7 +134,7 @@ struct snap_merger_t
            return;
        }
        chain_list.push_back(from_cfg->num);
-        layer_block_size[from_cfg->num] = get_block_size(from_cfg->num);
+        layer_block_size[from_cfg->num] = get_block_size(from_cfg->num, NULL);
        int i = chain_list.size()-1;
        for (inode_t item: chain_list)
        {
@ -204,14 +205,16 @@ struct snap_merger_t
                use_cas ? " online (with CAS)" : "", INODE_NO_POOL(target), INODE_POOL(target)
            );
        }
-        target_block_size = get_block_size(target);
+        target_block_size = get_block_size(target, &target_bitmap_granularity);
    }

-    uint64_t get_block_size(inode_t inode)
+    uint64_t get_block_size(inode_t inode, uint32_t *bitmap_granularity)
    {
        auto & pool_cfg = parent->cli->st_cli.pool_config.at(INODE_POOL(inode));
        uint64_t pg_data_size = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks);
-        return parent->cli->get_bs_block_size() * pg_data_size;
+        if (bitmap_granularity)
+            *bitmap_granularity = pool_cfg.bitmap_granularity;
+        return pool_cfg.data_block_size * pg_data_size;
    }

    void continue_merge_reent()
@ -409,7 +412,7 @@ struct snap_merger_t
            }
            else
            {
-                uint64_t bitmap_bytes = target_block_size/parent->cli->get_bs_bitmap_granularity()/8;
+                uint64_t bitmap_bytes = target_block_size/target_bitmap_granularity/8;
                int i;
                for (i = 0; i < bitmap_bytes; i++)
                {
@ -469,7 +472,7 @@ struct snap_merger_t
    {
        // Write each non-empty range using an individual operation
        // FIXME: Allow to use single write with "holes" (OSDs don't allow it yet)
-        uint32_t gran = parent->cli->get_bs_bitmap_granularity();
+        uint32_t gran = target_bitmap_granularity;
        uint64_t bitmap_size = target_block_size / gran;
        while (rwo->end < bitmap_size && !rwo->error_code)
        {
--- a/src/cli_status.cpp
+++ b/src/cli_status.cpp
@ -7,6 +7,8 @@
 #include "pg_states.h"
 #include "http_client.h"

+static const char *obj_states[] = { "clean", "misplaced", "degraded", "incomplete" };
+
 // Print cluster status:
 // etcd, mon, osd states
 // raw/used space, object states, pool states, pg states
@ -196,21 +198,57 @@ resume_2:
            }
            pgs_by_state_str += std::to_string(kv.second)+" "+kv.first;
        }
-        uint64_t object_size = parent->cli->get_bs_block_size();
-        std::string more_states;
-        uint64_t obj_n;
-        obj_n = agg_stats["object_counts"]["misplaced"].uint64_value();
-        if (obj_n > 0)
-            more_states += ", "+format_size(obj_n*object_size)+" misplaced";
-        obj_n = agg_stats["object_counts"]["degraded"].uint64_value();
-        if (obj_n > 0)
-            more_states += ", "+format_size(obj_n*object_size)+" degraded";
-        obj_n = agg_stats["object_counts"]["incomplete"].uint64_value();
-        if (obj_n > 0)
-            more_states += ", "+format_size(obj_n*object_size)+" incomplete";
        bool readonly = json_is_true(parent->cli->merged_config["readonly"]);
        bool no_recovery = json_is_true(parent->cli->merged_config["no_recovery"]);
        bool no_rebalance = json_is_true(parent->cli->merged_config["no_rebalance"]);
+        if (parent->json_output)
+        {
+            // JSON output
+            auto json_status = json11::Json::object {
+                { "etcd_alive", etcd_alive },
+                { "etcd_count", (uint64_t)etcd_states.size() },
+                { "etcd_db_size", etcd_db_size },
+                { "mon_count", mon_count },
+                { "mon_master", mon_master },
+                { "osd_up", osd_up },
+                { "osd_count", osd_count },
+                { "total_raw", total_raw },
+                { "free_raw", free_raw },
+                { "down_raw", down_raw },
+                { "free_down_raw", free_down_raw },
+                { "readonly", readonly },
+                { "no_recovery", no_recovery },
+                { "no_rebalance", no_rebalance },
+                { "pool_count", pool_count },
+                { "active_pool_count", pools_active },
+                { "pg_states", pgs_by_state },
+                { "op_stats", agg_stats["op_stats"] },
+                { "recovery_stats", agg_stats["recovery_stats"] },
+                { "object_counts", agg_stats["object_counts"] },
+            };
+            for (int i = 0; i < sizeof(obj_states)/sizeof(obj_states[0]); i++)
+            {
+                std::string str(obj_states[i]);
+                uint64_t obj_n = agg_stats["object_bytes"][str].uint64_value();
+                if (!obj_n)
+                    obj_n = agg_stats["object_counts"][str].uint64_value() * parent->cli->st_cli.global_block_size;
+                json_status[str+"_data"] = obj_n;
+            }
+            printf("%s\n", json11::Json(json_status).dump().c_str());
+            state = 100;
+            return;
+        }
+        std::string more_states;
+        for (int i = 0; i < sizeof(obj_states)/sizeof(obj_states[0]); i++)
+        {
+            std::string str(obj_states[i]);
+            uint64_t obj_n = agg_stats["object_bytes"][str].uint64_value();
+            if (!obj_n)
+                obj_n = agg_stats["object_counts"][str].uint64_value() * parent->cli->st_cli.global_block_size;
+            if (!i || obj_n > 0)
+                more_states += format_size(obj_n)+" "+str+", ";
+        }
+        more_states.resize(more_states.size()-2);
        std::string recovery_io;
        {
            uint64_t deg_bps = agg_stats["recovery_stats"]["degraded"]["bps"].uint64_value();
@ -232,38 +270,6 @@ resume_2:
            else if (no_rebalance)
                recovery_io += "    rebalance: disabled\n";
        }
-        if (parent->json_output)
-        {
-            // JSON output
-            printf("%s\n", json11::Json(json11::Json::object {
-                { "etcd_alive", etcd_alive },
-                { "etcd_count", (uint64_t)etcd_states.size() },
-                { "etcd_db_size", etcd_db_size },
-                { "mon_count", mon_count },
-                { "mon_master", mon_master },
-                { "osd_up", osd_up },
-                { "osd_count", osd_count },
-                { "total_raw", total_raw },
-                { "free_raw", free_raw },
-                { "down_raw", down_raw },
-                { "free_down_raw", free_down_raw },
-                { "readonly", readonly },
-                { "no_recovery", no_recovery },
-                { "no_rebalance", no_rebalance },
-                { "clean_data", agg_stats["object_counts"]["clean"].uint64_value() * object_size },
-                { "misplaced_data", agg_stats["object_counts"]["misplaced"].uint64_value() * object_size },
-                { "degraded_data", agg_stats["object_counts"]["degraded"].uint64_value() * object_size },
-                { "incomplete_data", agg_stats["object_counts"]["incomplete"].uint64_value() * object_size },
-                { "pool_count", pool_count },
-                { "active_pool_count", pools_active },
-                { "pg_states", pgs_by_state },
-                { "op_stats", agg_stats["op_stats"] },
-                { "recovery_stats", agg_stats["recovery_stats"] },
-                { "object_counts", agg_stats["object_counts"] },
-            }).dump().c_str());
-            state = 100;
-            return;
-        }
        printf(
            "  cluster:\n"
            "    etcd: %d / %ld up, %s database size\n"
@ -272,7 +278,7 @@ resume_2:
            "  \n"
            "  data:\n"
            "    raw:   %s used, %s / %s available%s\n"
-            "    state: %s clean%s\n"
+            "    state: %s\n"
            "    pools: %d / %d active\n"
            "    pgs:   %s\n"
            "  \n"
@ -286,7 +292,7 @@ resume_2:
            format_size(free_raw-free_down_raw).c_str(),
            format_size(total_raw-down_raw).c_str(),
            (down_raw > 0 ? (", "+format_size(down_raw)+" down").c_str() : ""),
-            format_size(agg_stats["object_counts"]["clean"].uint64_value() * object_size).c_str(), more_states.c_str(),
+            more_states.c_str(),
            pools_active, pool_count,
            pgs_by_state_str.c_str(),
            readonly ? " (read-only mode)" : "",
--- a/src/cluster_client.cpp
+++ b/src/cluster_client.cpp
@ -14,6 +14,7 @@
 #define CACHE_FLUSHING 2
 #define CACHE_REPEATING 3
 #define OP_FLUSH_BUFFER 0x02
+#define OP_IMMEDIATE_COMMIT 0x04

 cluster_client_t::cluster_client_t(ring_loop_t *ringloop, timerfd_manager_t *tfd, json11::Json & config)
 {
@ -127,26 +128,26 @@ void cluster_client_t::calc_wait(cluster_op_t *op)
                op->prev_wait++;
            }
        }
-        if (!op->prev_wait && pgs_loaded)
+        if (!op->prev_wait)
            continue_rw(op);
    }
    else if (op->opcode == OSD_OP_SYNC)
    {
        for (auto prev = op->prev; prev; prev = prev->prev)
        {
-            if (prev->opcode == OSD_OP_SYNC || prev->opcode == OSD_OP_WRITE)
+            if (prev->opcode == OSD_OP_SYNC || prev->opcode == OSD_OP_WRITE && !(prev->flags & OP_IMMEDIATE_COMMIT))
            {
                op->prev_wait++;
            }
        }
-        if (!op->prev_wait && pgs_loaded)
+        if (!op->prev_wait)
            continue_sync(op);
    }
    else /* if (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP) */
    {
        for (auto prev = op_queue_head; prev && prev != op; prev = prev->next)
        {
-            if (prev->opcode == OSD_OP_WRITE && prev->flags & OP_FLUSH_BUFFER)
+            if (prev->opcode == OSD_OP_WRITE && (prev->flags & OP_FLUSH_BUFFER))
            {
                op->prev_wait++;
            }
@ -156,7 +157,7 @@ void cluster_client_t::calc_wait(cluster_op_t *op)
                break;
            }
        }
-        if (!op->prev_wait && pgs_loaded)
+        if (!op->prev_wait)
            continue_rw(op);
    }
 }
@ -168,7 +169,7 @@ void cluster_client_t::inc_wait(uint64_t opcode, uint64_t flags, cluster_op_t *n
        while (next)
        {
            auto n2 = next->next;
-            if (next->opcode == OSD_OP_SYNC ||
+            if (next->opcode == OSD_OP_SYNC && !(flags & OP_IMMEDIATE_COMMIT) ||
                next->opcode == OSD_OP_WRITE && (flags & OP_FLUSH_BUFFER) && !(next->flags & OP_FLUSH_BUFFER) ||
                (next->opcode == OSD_OP_READ || next->opcode == OSD_OP_READ_BITMAP) && (flags & OP_FLUSH_BUFFER))
            {
@ -221,7 +222,7 @@ void cluster_client_t::erase_op(cluster_op_t *op)
        op_queue_tail = op->prev;
    op->next = op->prev = NULL;
    std::function<void(cluster_op_t*)>(op->callback)(op);
-    if (!immediate_commit)
+    if (!(flags & OP_IMMEDIATE_COMMIT))
        inc_wait(opcode, flags, next, -1);
 }

@ -262,21 +263,6 @@ restart:
    continuing_ops = 0;
 }

-static uint32_t is_power_of_two(uint64_t value)
-{
-    uint32_t l = 0;
-    while (value > 1)
-    {
-        if (value & 1)
-        {
-            return 64;
-        }
-        value = value >> 1;
-        l++;
-    }
-    return l;
-}
-
 void cluster_client_t::on_load_config_hook(json11::Json::object & config)
 {
    this->merged_config = config;
@ -284,24 +270,6 @@ void cluster_client_t::on_load_config_hook(json11::Json::object & config)
    {
        this->merged_config[kv.first] = kv.second;
    }
-    bs_block_size = config["block_size"].uint64_value();
-    bs_bitmap_granularity = config["bitmap_granularity"].uint64_value();
-    if (!bs_block_size)
-    {
-        bs_block_size = DEFAULT_BLOCK_SIZE;
-    }
-    if (!bs_bitmap_granularity)
-    {
-        bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
-    }
-    bs_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
-    uint32_t block_order;
-    if ((block_order = is_power_of_two(bs_block_size)) >= 64 || bs_block_size < MIN_DATA_BLOCK_SIZE || bs_block_size >= MAX_DATA_BLOCK_SIZE)
-    {
-        throw std::runtime_error("Bad block size");
-    }
-    // Cluster-wide immediate_commit mode
-    immediate_commit = (config["immediate_commit"] == "all");
    if (config.find("client_max_dirty_bytes") != config.end())
    {
        client_max_dirty_bytes = config["client_max_dirty_bytes"].uint64_value();
@ -379,9 +347,15 @@ void cluster_client_t::on_change_hook(std::map<std::string, etcd_kv_t> & changes
    continue_ops();
 }

-bool cluster_client_t::get_immediate_commit()
+bool cluster_client_t::get_immediate_commit(uint64_t inode)
 {
-    return immediate_commit;
+    pool_id_t pool_id = INODE_POOL(inode);
+    if (!pool_id)
+        return true;
+    auto pool_it = st_cli.pool_config.find(pool_id);
+    if (pool_it == st_cli.pool_config.end())
+        return true;
+    return pool_it->second.immediate_commit == IMMEDIATE_ALL;
 }

 void cluster_client_t::on_change_osd_state_hook(uint64_t peer_osd)
@ -439,9 +413,45 @@ void cluster_client_t::execute(cluster_op_t *op)
        std::function<void(cluster_op_t*)>(op->callback)(op);
        return;
    }
+    if (!pgs_loaded)
+    {
+        offline_ops.push_back(op);
+        return;
+    }
    op->cur_inode = op->inode;
    op->retval = 0;
-    if (op->opcode == OSD_OP_WRITE && !immediate_commit)
+    op->flags = op->flags & OSD_OP_IGNORE_READONLY; // single allowed flag
+    if (op->opcode != OSD_OP_SYNC)
+    {
+        pool_id_t pool_id = INODE_POOL(op->cur_inode);
+        if (!pool_id)
+        {
+            op->retval = -EINVAL;
+            std::function<void(cluster_op_t*)>(op->callback)(op);
+            return;
+        }
+        auto pool_it = st_cli.pool_config.find(pool_id);
+        if (pool_it == st_cli.pool_config.end() || pool_it->second.real_pg_count == 0)
+        {
+            // Pools are loaded, but this one is unknown
+            op->retval = -EINVAL;
+            std::function<void(cluster_op_t*)>(op->callback)(op);
+            return;
+        }
+        // Check alignment
+        if ((op->opcode == OSD_OP_READ || op->opcode == OSD_OP_WRITE) && !op->len ||
+            op->offset % pool_it->second.bitmap_granularity || op->len % pool_it->second.bitmap_granularity)
+        {
+            op->retval = -EINVAL;
+            std::function<void(cluster_op_t*)>(op->callback)(op);
+            return;
+        }
+        if (pool_it->second.immediate_commit == IMMEDIATE_ALL)
+        {
+            op->flags |= OP_IMMEDIATE_COMMIT;
+        }
+    }
+    if (op->opcode == OSD_OP_WRITE && !(op->flags & OP_IMMEDIATE_COMMIT))
    {
        if (dirty_bytes >= client_max_dirty_bytes || dirty_ops >= client_max_dirty_ops)
        {
@ -480,9 +490,9 @@ void cluster_client_t::execute(cluster_op_t *op)
    }
    else
        op_queue_tail = op_queue_head = op;
-    if (!immediate_commit)
+    if (!(op->flags & OP_IMMEDIATE_COMMIT))
        calc_wait(op);
-    else if (pgs_loaded)
+    else
    {
        if (op->opcode == OSD_OP_SYNC)
            continue_sync(op);
@ -610,28 +620,6 @@ int cluster_client_t::continue_rw(cluster_op_t *op)
    else if (op->state == 3)
        goto resume_3;
 resume_0:
-    if ((op->opcode == OSD_OP_READ || op->opcode == OSD_OP_WRITE) && !op->len ||
-        op->offset % bs_bitmap_granularity || op->len % bs_bitmap_granularity)
-    {
-        op->retval = -EINVAL;
-        erase_op(op);
-        return 1;
-    }
-    {
-        pool_id_t pool_id = INODE_POOL(op->cur_inode);
-        if (!pool_id)
-        {
-            op->retval = -EINVAL;
-            erase_op(op);
-            return 1;
-        }
-        if (st_cli.pool_config.find(pool_id) == st_cli.pool_config.end() ||
-            st_cli.pool_config[pool_id].real_pg_count == 0)
-        {
-            // Postpone operations to unknown pools
-            return 0;
-        }
-    }
    if (op->opcode == OSD_OP_WRITE || op->opcode == OSD_OP_DELETE)
    {
        if (!(op->flags & OSD_OP_IGNORE_READONLY))
@ -644,7 +632,7 @@ resume_0:
                return 1;
            }
        }
-        if (op->opcode == OSD_OP_WRITE && !immediate_commit && !(op->flags & OP_FLUSH_BUFFER))
+        if (op->opcode == OSD_OP_WRITE && !(op->flags & OP_IMMEDIATE_COMMIT) && !(op->flags & OP_FLUSH_BUFFER))
        {
            copy_write(op, dirty_buffers);
        }
@ -814,7 +802,7 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
    // Primary OSDs still operate individual stripes, but their size is multiplied by PG minsize in case of EC
    auto & pool_cfg = st_cli.pool_config.at(INODE_POOL(op->cur_inode));
    uint32_t pg_data_size = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks);
-    uint64_t pg_block_size = bs_block_size * pg_data_size;
+    uint64_t pg_block_size = pool_cfg.data_block_size * pg_data_size;
    uint64_t first_stripe = (op->offset / pg_block_size) * pg_block_size;
    uint64_t last_stripe = op->len > 0 ? ((op->offset + op->len - 1) / pg_block_size) * pg_block_size : first_stripe;
    op->retval = 0;
@ -822,9 +810,9 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
    if (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP)
    {
        // Allocate memory for the bitmap
-        unsigned object_bitmap_size = (((op->opcode == OSD_OP_READ_BITMAP ? pg_block_size : op->len) / bs_bitmap_granularity + 7) / 8);
+        unsigned object_bitmap_size = (((op->opcode == OSD_OP_READ_BITMAP ? pg_block_size : op->len) / pool_cfg.bitmap_granularity + 7) / 8);
        object_bitmap_size = (object_bitmap_size < 8 ? 8 : object_bitmap_size);
-        unsigned bitmap_mem = object_bitmap_size + (bs_bitmap_size * pg_data_size) * op->parts.size();
+        unsigned bitmap_mem = object_bitmap_size + (pool_cfg.data_block_size / pool_cfg.bitmap_granularity / 8 * pg_data_size) * op->parts.size();
        if (op->bitmap_buf_size < bitmap_mem)
        {
            op->bitmap_buf = realloc_or_die(op->bitmap_buf, bitmap_mem);
@ -854,7 +842,7 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
            bool skip_prev = true;
            while (cur < end)
            {
-                unsigned bmp_loc = (cur - op->offset)/bs_bitmap_granularity;
+                unsigned bmp_loc = (cur - op->offset)/pool_cfg.bitmap_granularity;
                bool skip = (((*((uint8_t*)op->bitmap_buf + bmp_loc/8)) >> (bmp_loc%8)) & 0x1);
                if (skip_prev != skip)
                {
@ -872,7 +860,7 @@ void cluster_client_t::slice_rw(cluster_op_t *op)
                    skip_prev = skip;
                    prev = cur;
                }
-                cur += bs_bitmap_granularity;
+                cur += pool_cfg.bitmap_granularity;
            }
            assert(cur > prev);
            if (skip_prev)
@ -904,7 +892,7 @@ bool cluster_client_t::affects_osd(uint64_t inode, uint64_t offset, uint64_t len
 {
    auto & pool_cfg = st_cli.pool_config.at(INODE_POOL(inode));
    uint32_t pg_data_size = (pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks);
-    uint64_t pg_block_size = bs_block_size * pg_data_size;
+    uint64_t pg_block_size = pool_cfg.data_block_size * pg_data_size;
    uint64_t first_stripe = (offset / pg_block_size) * pg_block_size;
    uint64_t last_stripe = len > 0 ? ((offset + len - 1) / pg_block_size) * pg_block_size : first_stripe;
    for (uint64_t stripe = first_stripe; stripe <= last_stripe; stripe += pg_block_size)
@ -935,7 +923,7 @@ bool cluster_client_t::try_send(cluster_op_t *op, int i)
            part->osd_num = primary_osd;
            part->flags |= PART_SENT;
            op->inflight_count++;
-            uint64_t pg_bitmap_size = bs_bitmap_size * (
+            uint64_t pg_bitmap_size = (pool_cfg.data_block_size / pool_cfg.bitmap_granularity / 8) * (
                pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks
            );
            uint64_t meta_rev = 0;
@ -983,7 +971,7 @@ int cluster_client_t::continue_sync(cluster_op_t *op)
 {
    if (op->state == 1)
        goto resume_1;
-    if (immediate_commit || !dirty_osds.size())
+    if (!dirty_osds.size())
    {
        // Sync is not required in the immediate_commit mode or if there are no dirty_osds
        op->retval = 0;
@ -1140,7 +1128,8 @@ void cluster_client_t::handle_op_part(cluster_op_part_t *part)
    else
    {
        // OK
-        dirty_osds.insert(part->osd_num);
+        if (!(op->flags & OP_IMMEDIATE_COMMIT))
+            dirty_osds.insert(part->osd_num);
        part->flags |= PART_DONE;
        op->done_count++;
        if (op->opcode == OSD_OP_READ || op->opcode == OSD_OP_READ_BITMAP)
@ -1162,12 +1151,12 @@ void cluster_client_t::copy_part_bitmap(cluster_op_t *op, cluster_op_part_t *par
 {
    // Copy (OR) bitmap
    auto & pool_cfg = st_cli.pool_config.at(INODE_POOL(op->cur_inode));
-    uint32_t pg_block_size = bs_block_size * (
+    uint32_t pg_block_size = pool_cfg.data_block_size * (
        pool_cfg.scheme == POOL_SCHEME_REPLICATED ? 1 : pool_cfg.pg_size-pool_cfg.parity_chunks
    );
-    uint32_t object_offset = (part->op.req.rw.offset - op->offset) / bs_bitmap_granularity;
-    uint32_t part_offset = (part->op.req.rw.offset % pg_block_size) / bs_bitmap_granularity;
-    uint32_t part_len = (op->opcode == OSD_OP_READ_BITMAP ? pg_block_size : part->op.req.rw.len) / bs_bitmap_granularity;
+    uint32_t object_offset = (part->op.req.rw.offset - op->offset) / pool_cfg.bitmap_granularity;
+    uint32_t part_offset = (part->op.req.rw.offset % pg_block_size) / pool_cfg.bitmap_granularity;
+    uint32_t part_len = (op->opcode == OSD_OP_READ_BITMAP ? pg_block_size : part->op.req.rw.len) / pool_cfg.bitmap_granularity;
    if (!(object_offset & 0x7) && !(part_offset & 0x7) && (part_len >= 8))
    {
        // Copy bytes
--- a/src/cluster_client.h
+++ b/src/cluster_client.h
@ -6,8 +6,6 @@
 #include "messenger.h"
 #include "etcd_state_client.h"

-#define MIN_DATA_BLOCK_SIZE 4*1024
-#define MAX_DATA_BLOCK_SIZE 128*1024*1024
 #define DEFAULT_CLIENT_MAX_DIRTY_BYTES 32*1024*1024
 #define DEFAULT_CLIENT_MAX_DIRTY_OPS 1024
 #define INODE_LIST_DONE 1
@ -79,11 +77,7 @@ class cluster_client_t
    timerfd_manager_t *tfd;
    ring_loop_t *ringloop;

-    uint64_t bs_block_size = 0;
-    uint32_t bs_bitmap_granularity = 0, bs_bitmap_size = 0;
    std::map<pool_id_t, uint64_t> pg_counts;
-    // WARNING: initially true so execute() doesn't create fake sync
-    bool immediate_commit = true;
    // FIXME: Implement inmemory_commit mode. Note that it requires to return overlapping reads from memory.
    uint64_t client_max_dirty_bytes = 0;
    uint64_t client_max_dirty_ops = 0;
@ -119,7 +113,7 @@ public:
    bool is_ready();
    void on_ready(std::function<void(void)> fn);

-    bool get_immediate_commit();
+    bool get_immediate_commit(uint64_t inode);

    static void copy_write(cluster_op_t *op, std::map<object_id, cluster_buffer_t> & dirty_buffers);
    void continue_ops(bool up_retry = false);
@ -127,8 +121,8 @@ public:
        std::function<void(inode_list_t* lst, std::set<object_id>&& objects, pg_num_t pg_num, osd_num_t primary_osd, int status)> callback);
    int list_pg_count(inode_list_t *lst);
    void list_inode_next(inode_list_t *lst, int next_pgs);
-    inline uint32_t get_bs_bitmap_granularity() { return bs_bitmap_granularity; }
-    inline uint64_t get_bs_block_size() { return bs_block_size; }
+    //inline uint32_t get_bs_bitmap_granularity() { return st_cli.global_bitmap_granularity; }
+    //inline uint64_t get_bs_block_size() { return st_cli.global_block_size; }
    uint64_t next_op_id();

 protected:
--- a/src/etcd_state_client.cpp
+++ b/src/etcd_state_client.cpp
@ -534,11 +534,18 @@ void etcd_state_client_t::load_global_config()
                global_config = kv.value.object_items();
            }
        }
-        bs_block_size = global_config["block_size"].uint64_value();
-        if (!bs_block_size)
+        global_block_size = global_config["block_size"].uint64_value();
+        if (!global_block_size)
        {
-            bs_block_size = DEFAULT_BLOCK_SIZE;
+            global_block_size = DEFAULT_BLOCK_SIZE;
        }
+        global_bitmap_granularity = global_config["bitmap_granularity"].uint64_value();
+        if (!global_bitmap_granularity)
+        {
+            global_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
+        }
+        global_immediate_commit = global_config["immediate_commit"].string_value() == "all"
+            ? IMMEDIATE_ALL : (global_config["immediate_commit"].string_value() == "small" ? IMMEDIATE_SMALL : IMMEDIATE_NONE);
        on_load_config_hook(global_config);
    });
 }
@ -732,9 +739,35 @@ void etcd_state_client_t::parse_state(const etcd_kv_t & kv)
                fprintf(stderr, "Pool %u has invalid max_osd_combinations (must be at least 100), skipping pool\n", pool_id);
                continue;
            }
+            // Data Block Size
+            pc.data_block_size = pool_item.second["block_size"].uint64_value();
+            if (!pc.data_block_size)
+                pc.data_block_size = global_block_size;
+            if ((pc.data_block_size & (pc.data_block_size-1)) ||
+                pc.data_block_size < MIN_DATA_BLOCK_SIZE || pc.data_block_size > MAX_DATA_BLOCK_SIZE)
+            {
+                fprintf(stderr, "Pool %u has invalid block_size (must be a power of two between %u and %u), skipping pool\n",
+                    pool_id, MIN_DATA_BLOCK_SIZE, MAX_DATA_BLOCK_SIZE);
+                continue;
+            }
+            // Bitmap Granularity
+            pc.bitmap_granularity = pool_item.second["bitmap_granularity"].uint64_value();
+            if (!pc.bitmap_granularity)
+                pc.bitmap_granularity = global_bitmap_granularity;
+            if (!pc.bitmap_granularity || pc.data_block_size % pc.bitmap_granularity)
+            {
+                fprintf(stderr, "Pool %u has invalid bitmap_granularity (must divide block_size), skipping pool\n", pool_id);
+                continue;
+            }
+            // Immediate Commit Mode
+            pc.immediate_commit = pool_item.second["immediate_commit"].is_string()
+                ? (pool_item.second["immediate_commit"].string_value() == "all"
+                    ? IMMEDIATE_ALL : (pool_item.second["immediate_commit"].string_value() == "small"
+                        ? IMMEDIATE_SMALL : IMMEDIATE_NONE))
+                : global_immediate_commit;
            // PG Stripe Size
            pc.pg_stripe_size = pool_item.second["pg_stripe_size"].uint64_value();
-            uint64_t min_stripe_size = bs_block_size * (pc.scheme == POOL_SCHEME_REPLICATED ? 1 : (pc.pg_size-pc.parity_chunks));
+            uint64_t min_stripe_size = pc.data_block_size * (pc.scheme == POOL_SCHEME_REPLICATED ? 1 : (pc.pg_size-pc.parity_chunks));
            if (pc.pg_stripe_size < min_stripe_size)
                pc.pg_stripe_size = min_stripe_size;
            // Save
--- a/src/etcd_state_client.h
+++ b/src/etcd_state_client.h
@ -13,6 +13,13 @@
 #define ETCD_OSD_STATE_WATCH_ID 4

 #define DEFAULT_BLOCK_SIZE 128*1024
+#define MIN_DATA_BLOCK_SIZE 4*1024
+#define MAX_DATA_BLOCK_SIZE 128*1024*1024
+#define DEFAULT_BITMAP_GRANULARITY 4096
+
+#define IMMEDIATE_NONE 0
+#define IMMEDIATE_SMALL 1
+#define IMMEDIATE_ALL 2

 struct etcd_kv_t
 {
@ -41,6 +48,7 @@ struct pool_config_t
    std::string name;
    uint64_t scheme;
    uint64_t pg_size, pg_minsize, parity_chunks;
+    uint32_t data_block_size, bitmap_granularity, immediate_commit;
    uint64_t pg_count;
    uint64_t real_pg_count;
    std::string failure_domain;
@ -83,7 +91,6 @@ protected:
    int ws_keepalive_timer = -1;
    int ws_alive = 0;
    bool rand_initialized = false;
-    uint64_t bs_block_size = DEFAULT_BLOCK_SIZE;
    void add_etcd_url(std::string);
    void pick_next_etcd();
 public:
@ -92,6 +99,9 @@ public:
    int max_etcd_attempts = 5;
    int etcd_quick_timeout = 1000;
    int etcd_slow_timeout = 5000;
+    uint64_t global_block_size = DEFAULT_BLOCK_SIZE;
+    uint32_t global_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
+    uint32_t global_immediate_commit = IMMEDIATE_NONE;

    std::string etcd_prefix;
    int log_level = 0;
--- a/src/messenger.cpp
+++ b/src/messenger.cpp
@ -427,6 +427,10 @@ void osd_messenger_t::check_peer_config(osd_client_t *cl)
                    cl->osd_num, config["protocol_version"].uint64_value(), OSD_PROTOCOL_VERSION
                );
            }
+            if (check_config_hook)
+            {
+                err = !check_config_hook(cl, config);
+            }
        }
        if (err)
        {
--- a/src/messenger.h
+++ b/src/messenger.h
@ -33,7 +33,6 @@
 #define PEER_RDMA 4
 #define PEER_STOPPED 5

-#define DEFAULT_BITMAP_GRANULARITY 4096
 #define VITASTOR_CONFIG_PATH "/etc/vitastor/vitastor.conf"

 #define MSGR_SENDP_HDR 1
@ -160,6 +159,7 @@ public:
    void outbox_push(osd_op_t *cur_op);
    std::function<void(osd_op_t*)> exec_op;
    std::function<void(osd_num_t)> repeer_pgs;
+    std::function<bool(osd_client_t*, json11::Json)> check_config_hook;
    void read_requests();
    void send_replies();
    void accept_connections(int listen_fd);
--- a/src/nfs_conn.cpp
+++ b/src/nfs_conn.cpp
@ -314,7 +314,12 @@ static int nfs3_read_proc(void *opaque, rpc_op_t *rop)
        rpc_queue_reply(rop);
        return 0;
    }
-    uint64_t alignment = self->parent->cli->get_bs_bitmap_granularity();
+    uint64_t alignment = self->parent->cli->st_cli.global_bitmap_granularity;
+    auto pool_cfg = self->parent->cli->st_cli.pool_config.find(INODE_POOL(ino_it->second));
+    if (pool_cfg != self->parent->cli->st_cli.pool_config.end())
+    {
+        alignment = pool_cfg->second.bitmap_granularity;
+    }
    uint64_t aligned_offset = args->offset - (args->offset % alignment);
    uint64_t aligned_count = args->offset + args->count;
    if (aligned_count % alignment)
@ -375,7 +380,12 @@ static int nfs3_write_proc(void *opaque, rpc_op_t *rop)
        return 0;
    }
    uint64_t count = args->count > args->data.size ? args->data.size : args->count;
-    uint64_t alignment = self->parent->cli->get_bs_bitmap_granularity();
+    uint64_t alignment = self->parent->cli->st_cli.global_bitmap_granularity;
+    auto pool_cfg = self->parent->cli->st_cli.pool_config.find(INODE_POOL(ino_it->second));
+    if (pool_cfg != self->parent->cli->st_cli.pool_config.end())
+    {
+        alignment = pool_cfg->second.bitmap_granularity;
+    }
    // Pre-fill reply
    *reply = (WRITE3res){
        .status = NFS3_OK,
@ -471,6 +481,7 @@ static void nfs_do_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint
    op->iov.push_back(buf, count);
    op->callback = [self, rop](cluster_op_t *op)
    {
+        uint64_t inode = op->inode;
        WRITE3args *args = (WRITE3args*)rop->request;
        WRITE3res *reply = (WRITE3res*)rop->reply;
        if (op->retval != op->len)
@ -483,8 +494,8 @@ static void nfs_do_write(nfs_client_t *self, rpc_op_t *rop, uint64_t inode, uint
        {
            *(uint64_t*)reply->resok.verf = self->parent->server_id;
            delete op;
-            if (!self->parent->cli->get_immediate_commit() &&
-                args->stable != UNSTABLE)
+            if (args->stable != UNSTABLE &&
+                !self->parent->cli->get_immediate_commit(inode))
            {
                // Client requested a stable write. Add an fsync
                op = new cluster_op_t;
@ -1179,25 +1190,17 @@ static int nfs3_commit_proc(void *opaque, rpc_op_t *rop)
 {
    nfs_client_t *self = (nfs_client_t*)opaque;
    //COMMIT3args *args = (COMMIT3args*)rop->request;
-    if (!self->parent->cli->get_immediate_commit())
+    cluster_op_t *op = new cluster_op_t;
+    // fsync. we don't know how to fsync a single inode, so just fsync everything
+    op->opcode = OSD_OP_SYNC;
+    op->callback = [rop](cluster_op_t *op)
    {
-        cluster_op_t *op = new cluster_op_t;
-        // fsync. we don't know how to fsync a single inode, so just fsync everything
-        op->opcode = OSD_OP_SYNC;
-        op->callback = [rop](cluster_op_t *op)
-        {
-            COMMIT3res *reply = (COMMIT3res*)rop->reply;
-            *reply = (COMMIT3res){ .status = vitastor_nfs_map_err(op->retval) };
-            rpc_queue_reply(rop);
-        };
-        self->parent->cli->execute(op);
-        return 1;
-    }
-    // pretend we just did an fsync
-    COMMIT3res *reply = (COMMIT3res*)rop->reply;
-    *reply = (COMMIT3res){ .status = NFS3_OK };
-    rpc_queue_reply(rop);
-    return 0;
+        COMMIT3res *reply = (COMMIT3res*)rop->reply;
+        *reply = (COMMIT3res){ .status = vitastor_nfs_map_err(op->retval) };
+        rpc_queue_reply(rop);
+    };
+    self->parent->cli->execute(op);
+    return 1;
 }

 static int mount3_mnt_proc(void *opaque, rpc_op_t *rop)
--- a/src/osd.cpp
+++ b/src/osd.cpp
@ -38,13 +38,12 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
    this->config = msgr.read_config(config).object_items();
    if (this->config.find("log_level") == this->config.end())
        this->config["log_level"] = 1;
-    parse_config(this->config);
+    parse_config(this->config, true);

    epmgr = new epoll_manager_t(ringloop);
    // FIXME: Use timerfd_interval based directly on io_uring
    this->tfd = epmgr->tfd;

-    // FIXME: Create Blockstore from on-disk superblock config and check it against the OSD cluster config
    auto bs_cfg = json_to_bs(this->config);
    this->bs = new blockstore_t(bs_cfg, ringloop, tfd);
    {
@ -81,6 +80,7 @@ osd_t::osd_t(const json11::Json & config, ring_loop_t *ringloop)
    msgr.ringloop = this->ringloop;
    msgr.exec_op = [this](osd_op_t *op) { exec_op(op); };
    msgr.repeer_pgs = [this](osd_num_t peer_osd) { repeer_pgs(peer_osd); };
+    msgr.check_config_hook = [this](osd_client_t *cl, json11::Json conf) { return check_peer_config(cl, conf); };
    msgr.init();

    init_cluster();
@ -98,23 +98,33 @@ osd_t::~osd_t()
    free(zero_buffer);
 }

-void osd_t::parse_config(const json11::Json & config)
+void osd_t::parse_config(const json11::Json & config, bool allow_disk_params)
 {
    st_cli.parse_config(config);
    msgr.parse_config(config);
-    // OSD number
-    osd_num = config["osd_num"].uint64_value();
-    if (!osd_num)
-        throw std::runtime_error("osd_num is required in the configuration");
-    msgr.osd_num = osd_num;
-    // Vital Blockstore parameters
-    bs_block_size = config["block_size"].uint64_value();
-    if (!bs_block_size)
-        bs_block_size = DEFAULT_BLOCK_SIZE;
-    bs_bitmap_granularity = config["bitmap_granularity"].uint64_value();
-    if (!bs_bitmap_granularity)
-        bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
-    clean_entry_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
+    if (allow_disk_params)
+    {
+        // OSD number
+        osd_num = config["osd_num"].uint64_value();
+        if (!osd_num)
+            throw std::runtime_error("osd_num is required in the configuration");
+        msgr.osd_num = osd_num;
+        // Vital Blockstore parameters
+        bs_block_size = config["block_size"].uint64_value();
+        if (!bs_block_size)
+            bs_block_size = DEFAULT_BLOCK_SIZE;
+        bs_bitmap_granularity = config["bitmap_granularity"].uint64_value();
+        if (!bs_bitmap_granularity)
+            bs_bitmap_granularity = DEFAULT_BITMAP_GRANULARITY;
+        clean_entry_bitmap_size = bs_block_size / bs_bitmap_granularity / 8;
+        // immediate_commit
+        if (config["immediate_commit"] == "all")
+            immediate_commit = IMMEDIATE_ALL;
+        else if (config["immediate_commit"] == "small")
+            immediate_commit = IMMEDIATE_SMALL;
+        else
+            immediate_commit = IMMEDIATE_NONE;
+    }
    // Bind address
    bind_address = config["bind_address"].string_value();
    if (bind_address == "")
@ -132,12 +142,6 @@ void osd_t::parse_config(const json11::Json & config)
    no_rebalance = json_is_true(config["no_rebalance"]);
    no_recovery = json_is_true(config["no_recovery"]);
    allow_test_ops = json_is_true(config["allow_test_ops"]);
-    if (config["immediate_commit"] == "all")
-        immediate_commit = IMMEDIATE_ALL;
-    else if (config["immediate_commit"] == "small")
-        immediate_commit = IMMEDIATE_SMALL;
-    else
-        immediate_commit = IMMEDIATE_NONE;
    if (!config["autosync_interval"].is_null())
    {
        // Allow to set it to 0
--- a/src/osd.h
+++ b/src/osd.h
@ -29,10 +29,6 @@
 #define OSD_FLUSHING_PGS 0x08
 #define OSD_RECOVERING 0x10

-#define IMMEDIATE_NONE 0
-#define IMMEDIATE_SMALL 1
-#define IMMEDIATE_ALL 2
-
 #define MAX_AUTOSYNC_INTERVAL 3600
 #define DEFAULT_AUTOSYNC_INTERVAL 5
 #define DEFAULT_AUTOSYNC_WRITES 128
@ -172,7 +168,7 @@ class osd_t
    uint64_t recovery_stat_bytes[2][2] = {};

    // cluster connection
-    void parse_config(const json11::Json & config);
+    void parse_config(const json11::Json & config, bool allow_disk_params);
    void init_cluster();
    void on_change_osd_state_hook(osd_num_t peer_osd);
    void on_change_pg_history_hook(pool_id_t pool_id, pg_num_t pg_num);
@ -201,6 +197,7 @@ class osd_t
    // peer handling (primary OSD logic)
    void parse_test_peer(std::string peer);
    void handle_peers();
+    bool check_peer_config(osd_client_t *cl, json11::Json conf);
    void repeer_pgs(osd_num_t osd_num);
    void start_pg_peering(pg_t & pg);
    void submit_sync_and_list_subop(osd_num_t role_osd, pg_peering_state_t *ps);
--- a/src/osd_cluster.cpp
+++ b/src/osd_cluster.cpp
@ -108,6 +108,52 @@ void osd_t::parse_test_peer(std::string peer)
    msgr.connect_peer(peer_osd, st_cli.peer_states[peer_osd]);
 }

+bool osd_t::check_peer_config(osd_client_t *cl, json11::Json conf)
+{
+    // Check block_size, bitmap_granularity and immediate_commit of the peer
+    if (conf["block_size"].is_null() ||
+        conf["bitmap_granularity"].is_null() ||
+        conf["immediate_commit"].is_null())
+    {
+        printf(
+            "[OSD %lu] Warning: peer OSD %lu does not report block_size/bitmap_granularity/immediate_commit."
+            " Is it older than 0.6.3?\n", this->osd_num, cl->osd_num
+        );
+    }
+    else
+    {
+        int peer_immediate_commit = (conf["immediate_commit"].string_value() == "all"
+            ? IMMEDIATE_ALL : (conf["immediate_commit"].string_value() == "small" ? IMMEDIATE_SMALL : IMMEDIATE_NONE));
+        if (immediate_commit == IMMEDIATE_ALL && peer_immediate_commit != IMMEDIATE_ALL ||
+            immediate_commit == IMMEDIATE_SMALL && peer_immediate_commit == IMMEDIATE_NONE)
+        {
+            printf(
+                "[OSD %lu] My immediate_commit is \"%s\", but peer OSD %lu has \"%s\". We can't work together\n",
+                this->osd_num, immediate_commit == IMMEDIATE_ALL ? "all" : "small",
+                cl->osd_num, conf["immediate_commit"].string_value().c_str()
+            );
+            return true;
+        }
+        else if (conf["block_size"].uint64_value() != (uint64_t)this->bs_block_size)
+        {
+            printf(
+                "[OSD %lu] My block_size is %u, but peer OSD %lu has %lu. We can't work together\n",
+                this->osd_num, this->bs_block_size, cl->osd_num, conf["block_size"].uint64_value()
+            );
+            return true;
+        }
+        else if (conf["bitmap_granularity"].uint64_value() != (uint64_t)this->bs_bitmap_granularity)
+        {
+            printf(
+                "[OSD %lu] My bitmap_granularity is %u, but peer OSD %lu has %lu. We can't work together\n",
+                this->osd_num, this->bs_bitmap_granularity, cl->osd_num, conf["bitmap_granularity"].uint64_value()
+            );
+            return true;
+        }
+    }
+    return true;
+}
+
 json11::Json osd_t::get_osd_state()
 {
    std::vector<char> hostname;
@ -137,6 +183,7 @@ json11::Json osd_t::get_statistics()
    sprintf(time_str, "%ld.%03ld", ts.tv_sec, ts.tv_nsec/1000000);
    st["time"] = time_str;
    st["blockstore_ready"] = bs->is_started();
+    st["data_block_size"] = (uint64_t)bs->get_block_size();
    if (bs)
    {
        st["size"] = bs->get_block_count() * bs->get_block_size();
@ -365,7 +412,7 @@ void osd_t::on_load_config_hook(json11::Json::object & global_config)
    for (auto & kv: global_config)
        if (osd_config.find(kv.first) == osd_config.end())
            osd_config[kv.first] = kv.second;
-    parse_config(osd_config);
+    parse_config(osd_config, false);
    bind_socket();
    acquire_lease();
 }
@ -598,6 +645,7 @@ void osd_t::apply_pg_config()
    bool all_applied = true;
    for (auto & pool_item: st_cli.pool_config)
    {
+        bool warned_block_size = false;
        auto pool_id = pool_item.first;
        for (auto & kv: pool_item.second.pg_config)
        {
@ -607,6 +655,22 @@ void osd_t::apply_pg_config()
                !pg_cfg.pause && (!pg_cfg.cur_primary || pg_cfg.cur_primary == this->osd_num);
            auto pg_it = this->pgs.find({ .pool_id = pool_id, .pg_num = pg_num });
            bool currently_taken = pg_it != this->pgs.end() && pg_it->second.state != PG_OFFLINE;
+            // Check pool block size and bitmap granularity
+            if (this->bs_block_size != pool_item.second.data_block_size ||
+                this->bs_bitmap_granularity != pool_item.second.bitmap_granularity)
+            {
+                if (!warned_block_size)
+                {
+                    printf(
+                        "[OSD %lu] My block_size and bitmap_granularity are %u/%u"
+                        ", but pool has %u/%u. Refusing to start PGs of this pool\n",
+                        this->osd_num, bs_block_size, bs_bitmap_granularity,
+                        pool_item.second.data_block_size, pool_item.second.bitmap_granularity
+                    );
+                }
+                warned_block_size = true;
+                take = false;
+            }
            if (currently_taken && !take)
            {
                // Stop this PG