Simplified distributed block storage with strong consistency, like in Ceph
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

530 lines
27 KiB

1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
7 months ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
1 year ago
  1. ## Vitastor
  2. [Читать на русском](
  3. ## The Idea
  4. Make Software-Defined Block Storage Great Again.
  5. Vitastor is a small, simple and fast clustered block storage (storage for VM drives),
  6. architecturally similar to Ceph which means strong consistency, primary-replication, symmetric
  7. clustering and automatic data distribution over any number of drives of any size
  8. with configurable redundancy (replication or erasure codes/XOR).
  9. ## Features
  10. Vitastor is currently a pre-release, a lot of features are missing and you can still expect
  11. breaking changes in the future. However, the following is implemented:
  12. - Basic part: highly-available block storage with symmetric clustering and no SPOF
  13. - Performance ;-D
  14. - Multiple redundancy schemes: Replication, XOR n+1, Reed-Solomon erasure codes
  15. based on jerasure library with any number of data and parity drives in a group
  16. - Configuration via simple JSON data structures in etcd
  17. - Automatic data distribution over OSDs, with support for:
  18. - Mathematical optimization for better uniformity and less data movement
  19. - Multiple pools
  20. - Placement tree, OSD selection by tags (device classes) and placement root
  21. - Configurable failure domains
  22. - Recovery of degraded blocks
  23. - Rebalancing (data movement between OSDs)
  24. - Lazy fsync support
  25. - I/O statistics reporting to etcd
  26. - Generic user-space client library
  27. - QEMU driver (built out-of-tree)
  28. - Loadable fio engine for benchmarks (also built out-of-tree)
  29. - NBD proxy for kernel mounts
  30. - Inode removal tool (vitastor-rm)
  31. - Packaging for Debian and CentOS
  32. - Per-inode I/O and space usage statistics
  33. - Inode metadata storage in etcd
  34. - Snapshots and copy-on-write image clones
  35. - Write throttling to smooth random write workloads in SSD+HDD configurations
  36. - RDMA/RoCEv2 support via libibverbs
  37. - CSI plugin for Kubernetes
  38. - Basic OpenStack support: Cinder driver, Nova and libvirt patches
  39. ## Roadmap
  40. - Snapshot deletion (layer merge) support
  41. - Better OSD creation and auto-start tools
  42. - Other administrative tools
  43. - Plugins for OpenNebula, Proxmox and other cloud systems
  44. - iSCSI proxy
  45. - Faster failover
  46. - Scrubbing without checksums (verification of replicas)
  47. - Checksums
  48. - Tiered storage
  49. - NVDIMM support
  50. - Web GUI
  51. - Compression (possibly)
  52. - Read caching using system page cache (possibly)
  53. ## Architecture
  54. Similarities:
  55. - Just like Ceph, Vitastor has Pools, PGs, OSDs, Monitors, Failure Domains, Placement Tree.
  56. - Just like Ceph, Vitastor is transactional (even though there's a "lazy fsync mode" which
  57. doesn't implicitly flush every operation to disks).
  58. - OSDs also have journal and metadata and they can also be put on separate drives.
  59. - Just like in Ceph, client library attempts to recover from any cluster failure so
  60. you can basically reboot the whole cluster and only pause, but not crash, your clients
  61. (I consider this a bug if the client crashes in that case).
  62. Some basic terms for people not familiar with Ceph:
  63. - OSD (Object Storage Daemon) is a process that stores data and serves read/write requests.
  64. - PG (Placement Group) is a container for data that (normally) shares the same replicas.
  65. - Pool is a container for data that has the same redundancy scheme and placement rules.
  66. - Monitor is a separate daemon that watches cluster state and handles failures.
  67. - Failure Domain is a group of OSDs that you allow to fail. It's "host" by default.
  68. - Placement Tree groups OSDs in a hierarchy to later split them into Failure Domains.
  69. Architectural differences from Ceph:
  70. - Vitastor's primary focus is on SSDs. Proper SSD+HDD optimizations may be added in the future, though.
  71. - Vitastor OSD is (and will always be) single-threaded. If you want to dedicate more than 1 core
  72. per drive you should run multiple OSDs each on a different partition of the drive.
  73. Vitastor isn't CPU-hungry though (as opposed to Ceph), so 1 core is sufficient in a lot of cases.
  74. - Metadata and journal are always kept in memory. Metadata size depends linearly on drive capacity
  75. and data store block size which is 128 KB by default. With 128 KB blocks metadata should occupy
  76. around 512 MB per 1 TB (which is still less than Ceph wants). Journal doesn't have to be big,
  77. the example test below was conducted with only 16 MB journal. A big journal is probably even
  78. harmful as dirty write metadata also take some memory.
  79. - Vitastor storage layer doesn't have internal copy-on-write or redirect-write. I know that maybe
  80. it's possible to create a good copy-on-write storage, but it's much harder and makes performance
  81. less deterministic, so CoW isn't used in Vitastor.
  82. - The basic layer of Vitastor is block storage with fixed-size blocks, not object storage with
  83. rich semantics like in Ceph (RADOS).
  84. - There's a "lazy fsync" mode which allows to batch writes before flushing them to the disk.
  85. This allows to use Vitastor with desktop SSDs, but still lowers performance due to additional
  86. network roundtrips, so use server SSDs with capacitor-based power loss protection
  87. ("Advanced Power Loss Protection") for best performance.
  88. - PGs are ephemeral. This means that they aren't stored on data disks and only exist in memory
  89. while OSDs are running.
  90. - Recovery process is per-object (per-block), not per-PG. Also there are no PGLOGs.
  91. - Monitors don't store data. Cluster configuration and state is stored in etcd in simple human-readable
  92. JSON structures. Monitors only watch cluster state and handle data movement.
  93. Thus Vitastor's Monitor isn't a critical component of the system and is more similar to Ceph's Manager.
  94. Vitastor's Monitor is implemented in node.js.
  95. - PG distribution isn't based on consistent hashes. All PG mappings are stored in etcd.
  96. Rebalancing PGs between OSDs is done by mathematical optimization - data distribution problem
  97. is reduced to a linear programming problem and solved by lp_solve. This allows for almost
  98. perfect (96-99% uniformity compared to Ceph's 80-90%) data distribution in most cases, ability
  99. to map PGs by hand without breaking rebalancing logic, reduced OSD peer-to-peer communication
  100. (on average, OSDs have fewer peers) and less data movement. It also probably has a drawback -
  101. this method may fail in very large clusters, but up to several hundreds of OSDs it's perfectly fine.
  102. It's also easy to add consistent hashes in the future if something proves their necessity.
  103. - There's no separate CRUSH layer. You select pool redundancy scheme, placement root, failure domain
  104. and so on directly in pool configuration.
  105. ## Understanding Storage Performance
  106. The most important thing for fast storage is latency, not parallel iops.
  107. The best possible latency is achieved with one thread and queue depth of 1 which basically means
  108. "client load as low as possible". In this case IOPS = 1/latency, and this number doesn't
  109. scale with number of servers, drives, server processes or threads and so on.
  110. Single-threaded IOPS and latency numbers only depend on *how fast a single daemon is*.
  111. Why is it important? It's important because some of the applications *can't* use
  112. queue depth greater than 1 because their task isn't parallelizable. A notable example
  113. is any ACID DBMS because all of them write their WALs sequentially with fsync()s.
  114. fsync, by the way, is another important thing often missing in benchmarks. The point is
  115. that drives have cache buffers and don't guarantee that your data is actually persisted
  116. until you call fsync() which is translated to a FLUSH CACHE command by the OS.
  117. Desktop SSDs are very fast without fsync - NVMes, for example, can process ~80000 write
  118. operations per second with queue depth of 1 without fsync - but they're really slow with
  119. fsync because they have to actually write data to flash chips when you call fsync. Typical
  120. number is around 1000-2000 iops with fsync.
  121. Server SSDs often have supercapacitors that act as a built-in UPS and allow the drive
  122. to flush its DRAM cache to the persistent flash storage when a power loss occurs.
  123. This makes them perform equally well with and without fsync. This feature is called
  124. "Advanced Power Loss Protection" by Intel; other vendors either call it similarly
  125. or directly as "Full Capacitor-Based Power Loss Protection".
  126. All software-defined storages that I currently know are slow in terms of latency.
  127. Notable examples are Ceph and internal SDSes used by cloud providers like Amazon, Google,
  128. Yandex and so on. They're all slow and can only reach ~0.3ms read and ~0.6ms 4 KB write latency
  129. with best-in-slot hardware.
  130. And that's in the SSD era when you can buy an SSD that has ~0.04ms latency for 100 $.
  131. I use the following 6 commands with small variations to benchmark any storage:
  132. - Linear write:
  133. `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=write -runtime=60 -filename=/dev/sdX`
  134. - Linear read:
  135. `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -iodepth=32 -rw=read -runtime=60 -filename=/dev/sdX`
  136. - Random write latency (T1Q1, this hurts storages the most):
  137. `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -fsync=1 -rw=randwrite -runtime=60 -filename=/dev/sdX`
  138. - Random read latency (T1Q1):
  139. `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=1 -rw=randread -runtime=60 -filename=/dev/sdX`
  140. - Parallel write iops (use numjobs if a single CPU core is insufficient to saturate the load):
  141. `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randwrite -runtime=60 -filename=/dev/sdX`
  142. - Parallel read iops (use numjobs if a single CPU core is insufficient to saturate the load):
  143. `fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4k -iodepth=128 [-numjobs=4 -group_reporting] -rw=randread -runtime=60 -filename=/dev/sdX`
  144. ## Vitastor's Theoretical Maximum Random Access Performance
  145. Replicated setups:
  146. - Single-threaded (T1Q1) read latency: 1 network roundtrip + 1 disk read.
  147. - Single-threaded write+fsync latency:
  148. - With immediate commit: 2 network roundtrips + 1 disk write.
  149. - With lazy commit: 4 network roundtrips + 1 disk write + 1 disk flush.
  150. - Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
  151. - Saturated parallel write iops: min(network bandwidth, sum(disk write iops / number of replicas / write amplification)).
  152. EC/XOR setups:
  153. - Single-threaded (T1Q1) read latency: 1.5 network roundtrips + 1 disk read.
  154. - Single-threaded write+fsync latency:
  155. - With immediate commit: 3.5 network roundtrips + 1 disk read + 2 disk writes.
  156. - With lazy commit: 5.5 network roundtrips + 1 disk read + 2 disk writes + 2 disk fsyncs.
  157. - 0.5 in actually (k-1)/k which means that an additional roundtrip doesn't happen when
  158. the read sub-operation can be served locally.
  159. - Saturated parallel read iops: min(network bandwidth, sum(disk read iops)).
  160. - Saturated parallel write iops: min(network bandwidth, sum(disk write iops * number of data drives / (number of data + parity drives) / write amplification)).
  161. In fact, you should put disk write iops under the condition of ~10% reads / ~90% writes in this formula.
  162. Write amplification for 4 KB blocks is usually 3-5 in Vitastor:
  163. 1. Journal block write
  164. 2. Journal data write
  165. 3. Metadata block write
  166. 4. Another journal block write for EC/XOR setups
  167. 5. Data block write
  168. If you manage to get an SSD which handles 512 byte blocks well (Optane?) you may
  169. lower 1, 3 and 4 to 512 bytes (1/8 of data size) and get WA as low as 2.375.
  170. Lazy fsync also reduces WA for parallel workloads because journal blocks are only
  171. written when they fill up or fsync is requested.
  172. ## Example Comparison with Ceph
  173. Hardware configuration: 4 nodes, each with:
  174. - 6x SATA SSD Intel D3-4510 3.84 TB
  175. - 2x Xeon Gold 6242 (16 cores @ 2.8 GHz)
  176. - 384 GB RAM
  177. - 1x 25 GbE network interface (Mellanox ConnectX-4 LX), connected to a Juniper QFX5200 switch
  178. CPU powersaving was disabled. Both Vitastor and Ceph were configured with 2 OSDs per 1 SSD.
  179. All of the results below apply to 4 KB blocks and random access (unless indicated otherwise).
  180. Raw drive performance:
  181. - T1Q1 write ~27000 iops (~0.037ms latency)
  182. - T1Q1 read ~9800 iops (~0.101ms latency)
  183. - T1Q32 write ~60000 iops
  184. - T1Q32 read ~81700 iops
  185. Ceph 15.2.4 (Bluestore):
  186. - T1Q1 write ~1000 iops (~1ms latency)
  187. - T1Q1 read ~1750 iops (~0.57ms latency)
  188. - T8Q64 write ~100000 iops, total CPU usage by OSDs about 40 virtual cores on each node
  189. - T8Q64 read ~480000 iops, total CPU usage by OSDs about 40 virtual cores on each node
  190. T8Q64 tests were conducted over 8 400GB RBD images from all hosts (every host was running 2 instances of fio).
  191. This is because Ceph has performance penalties related to running multiple clients over a single RBD image.
  192. cephx_sign_messages was set to false during tests, RocksDB and Bluestore settings were left at defaults.
  193. In fact, not that bad for Ceph. These servers are an example of well-balanced Ceph nodes.
  194. However, CPU usage and I/O latency were through the roof, as usual.
  195. Vitastor:
  196. - T1Q1 write: 7087 iops (0.14ms latency)
  197. - T1Q1 read: 6838 iops (0.145ms latency)
  198. - T2Q64 write: 162000 iops, total CPU usage by OSDs about 3 virtual cores on each node
  199. - T8Q64 read: 895000 iops, total CPU usage by OSDs about 4 virtual cores on each node
  200. - Linear write (4M T1Q32): 2800 MB/s
  201. - Linear read (4M T1Q32): 1500 MB/s
  202. T8Q64 read test was conducted over 1 larger inode (3.2T) from all hosts (every host was running 2 instances of fio).
  203. Vitastor has no performance penalties related to running multiple clients over a single inode.
  204. If conducted from one node with all primary OSDs moved to other nodes the result was slightly lower (689000 iops),
  205. this is because all operations resulted in network roundtrips between the client and the primary OSD.
  206. When fio was colocated with OSDs (like in Ceph benchmarks above), 1/4 of the read workload actually
  207. used the loopback network.
  208. Vitastor was configured with: `--disable_data_fsync true --immediate_commit all --flusher_count 8
  209. --disk_alignment 4096 --journal_block_size 4096 --meta_block_size 4096
  210. --journal_no_same_sector_overwrites true --journal_sector_buffer_count 1024
  211. --journal_size 16777216`.
  212. ### EC/XOR 2+1
  213. Vitastor:
  214. - T1Q1 write: 2808 iops (~0.355ms latency)
  215. - T1Q1 read: 6190 iops (~0.16ms latency)
  216. - T2Q64 write: 85500 iops, total CPU usage by OSDs about 3.4 virtual cores on each node
  217. - T8Q64 read: 812000 iops, total CPU usage by OSDs about 4.7 virtual cores on each node
  218. - Linear write (4M T1Q32): 3200 MB/s
  219. - Linear read (4M T1Q32): 1800 MB/s
  220. Ceph:
  221. - T1Q1 write: 730 iops (~1.37ms latency)
  222. - T1Q1 read: 1500 iops with cold cache (~0.66ms latency), 2300 iops after 2 minute metadata cache warmup (~0.435ms latency)
  223. - T4Q128 write (4 RBD images): 45300 iops, total CPU usage by OSDs about 30 virtual cores on each node
  224. - T8Q64 read (4 RBD images): 278600 iops, total CPU usage by OSDs about 40 virtual cores on each node
  225. - Linear write (4M T1Q32): 1950 MB/s before preallocation, 2500 MB/s after preallocation
  226. - Linear read (4M T1Q32): 2400 MB/s
  227. ### NBD
  228. NBD is currently required to mount Vitastor via kernel, but it imposes additional overhead
  229. due to additional copying between the kernel and userspace. This mostly hurts linear
  230. bandwidth, not iops.
  231. Vitastor with single-thread NBD on the same hardware:
  232. - T1Q1 write: 6000 iops (0.166ms latency)
  233. - T1Q1 read: 5518 iops (0.18ms latency)
  234. - T1Q128 write: 94400 iops
  235. - T1Q128 read: 103000 iops
  236. - Linear write (4M T1Q128): 1266 MB/s (compared to 2800 MB/s via fio)
  237. - Linear read (4M T1Q128): 975 MB/s (compared to 1500 MB/s via fio)
  238. ## Installation
  239. ### Debian
  240. - Trust Vitastor package signing key:
  241. `wget -q -O - | sudo apt-key add -`
  242. - Add Vitastor package repository to your /etc/apt/sources.list:
  243. - Debian 11 (Bullseye/Sid): `deb bullseye main`
  244. - Debian 10 (Buster): `deb buster main`
  245. - For Debian 10 (Buster) also enable backports repository:
  246. `deb buster-backports main`
  247. - Install packages: `apt update; apt install vitastor lp-solve etcd linux-image-amd64 qemu`
  248. ### CentOS
  249. - Add Vitastor package repository:
  250. - CentOS 7: `yum install`
  251. - CentOS 8: `dnf install`
  252. - Enable EPEL: `yum/dnf install epel-release`
  253. - Enable additional CentOS repositories:
  254. - CentOS 7: `yum install centos-release-scl`
  255. - CentOS 8: `dnf install centos-release-advanced-virtualization`
  256. - Enable elrepo-kernel:
  257. - CentOS 7: `yum install`
  258. - CentOS 8: `dnf install`
  259. - Install packages: `yum/dnf install vitastor lpsolve etcd kernel-ml qemu-kvm`
  260. ### Building from Source
  261. - Install Linux kernel 5.4 or newer, for io_uring support. 5.8 or later is highly recommended because
  262. there is at least one known io_uring hang with 5.4 and an HP SmartArray controller.
  263. - Install liburing 0.4 or newer and its headers.
  264. - Install lp_solve.
  265. - Install etcd, at least version 3.4.15. Earlier versions won't work because of various bugs,
  266. for example [#12402]( You can also take 3.4.13
  267. with this specific fix from here:, branch release-3.4.
  268. - Install node.js 10 or newer.
  269. - Install gcc and g++ 8.x or newer.
  270. - Clone with submodules.
  271. - Install QEMU 3.0+, get its source, begin to build it, stop the build and copy headers:
  272. - `<qemu>/include` &rarr; `<vitastor>/qemu/include`
  273. - Debian:
  274. * Use qemu packages from the main repository
  275. * `<qemu>/b/qemu/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
  276. * `<qemu>/b/qemu/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
  277. - CentOS 8:
  278. * Use qemu packages from the Advanced-Virtualization repository. To enable it, run
  279. `yum install centos-release-advanced-virtualization.noarch` and then `yum install qemu`
  280. * `<qemu>/config-host.h` &rarr; `<vitastor>/qemu/b/qemu/config-host.h`
  281. * For QEMU 3.0+: `<qemu>/qapi` &rarr; `<vitastor>/qemu/b/qemu/qapi`
  282. * For QEMU 2.0+: `<qemu>/qapi-types.h` &rarr; `<vitastor>/qemu/b/qemu/qapi-types.h`
  283. - `config-host.h` and `qapi` are required because they contain generated headers
  284. - You can also rebuild QEMU with a patch that makes LD_PRELOAD unnecessary to load vitastor driver.
  285. See `patches/qemu-*.*-vitastor.patch`.
  286. - Install fio 3.7 or later, get its source and symlink it into `<vitastor>/fio`.
  287. - Build & install Vitastor with `mkdir build && cd build && cmake .. && make -j8 && make install`.
  288. Pay attention to the `QEMU_PLUGINDIR` cmake option - it must be set to `qemu-kvm` on RHEL.
  289. ## Running
  290. Please note that startup procedure isn't currently simple - you specify configuration
  291. and calculate disk offsets almost by hand. This will be fixed in near future.
  292. - Get some SATA or NVMe SSDs with capacitors (server-grade drives). You can use desktop SSDs
  293. with lazy fsync, but prepare for inferior single-thread latency.
  294. - Get a fast network (at least 10 Gbit/s).
  295. - Disable CPU powersaving: `cpupower idle-set -D 0 && cpupower frequency-set -g performance`.
  296. - Check `/usr/lib/vitastor/mon/` and `/usr/lib/vitastor/mon/` and
  297. put desired values into the variables at the top of these files.
  298. - Create systemd units for the monitor and etcd: `/usr/lib/vitastor/mon/`
  299. - Create systemd units for your OSDs: `/usr/lib/vitastor/mon/ /dev/disk/by-partuuid/XXX [/dev/disk/by-partuuid/YYY ...]`
  300. - You can edit the units and change OSD configuration. Notable configuration variables:
  301. - `disable_data_fsync 1` - only safe with server-grade drives with capacitors.
  302. - `immediate_commit all` - use this if all your drives are server-grade.
  303. - `disable_device_lock 1` - only required if you run multiple OSDs on one block device.
  304. - `flusher_count 256` - flusher is a micro-thread that removes old data from the journal.
  305. You don't have to worry about this parameter anymore, 256 is enough.
  306. - `disk_alignment`, `journal_block_size`, `meta_block_size` should be set to the internal
  307. block size of your SSDs which is 4096 on most drives.
  308. - `journal_no_same_sector_overwrites true` prevents multiple overwrites of the same journal sector.
  309. Most (99%) SSDs don't need this option. But Intel D3-4510 does because it doesn't like when you
  310. overwrite the same sector twice in a short period of time. The setting forces Vitastor to never
  311. overwrite the same journal sector twice in a row which makes D3-4510 almost happy. Not totally
  312. happy, because overwrites of the same block can still happen in the metadata area... When this
  313. setting is set, it is also required to raise `journal_sector_buffer_count` setting, which is the
  314. number of dirty journal sectors that may be written to at the same time.
  315. - `systemctl start` everywhere.
  316. - Create global configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/global '{"immediate_commit":"all"}'`
  317. (if all your drives have capacitors).
  318. - Create pool configuration in etcd: `etcdctl --endpoints=... put /vitastor/config/pools '{"1":{"name":"testpool","scheme":"replicated","pg_size":2,"pg_minsize":1,"pg_count":256,"failure_domain":"host"}}'`.
  319. For jerasure pools the configuration should look like the following: `2:{"name":"ecpool","scheme":"jerasure","pg_size":4,"parity_chunks":2,"pg_minsize":2,"pg_count":256,"failure_domain":"host"}`.
  320. - At this point, one of the monitors will configure PGs and OSDs will start them.
  321. - You can check PG states with `etcdctl --endpoints=... get --prefix /vitastor/pg/state`. All PGs should become 'active'.
  322. ### Name an image
  323. ```
  324. etcdctl --endpoints=<etcd> put /vitastor/config/inode/<pool>/<inode> '{"name":"<name>","size":<size>[,"parent_id":<parent_inode_number>][,"readonly":true]}'
  325. ```
  326. For example:
  327. ```
  328. etcdctl --endpoints= put /vitastor/config/inode/1/1 '{"name":"testimg","size":2147483648}'
  329. ```
  330. If you specify parent_id the image becomes a CoW clone. I.e. all writes go to the new inode and reads first check it
  331. and then upper layers. You can then make parent readonly by updating its entry with `"readonly":true` for safety and
  332. basically treat it as a snapshot.
  333. So to create a snapshot you basically rename the previous upper layer (for example from testimg to testimg@0), make it readonly
  334. and create a new top layer with the original name (testimg) and the previous one as a parent.
  335. ### Run fio benchmarks
  336. fio command example:
  337. ```
  338. fio -thread -name=test -bs=4M -direct=1 -iodepth=16 -rw=write -etcd= -image=testimg
  339. ```
  340. If you don't want to access your image by name, you can specify pool number, inode number and size
  341. (`-pool=1 -inode=1 -size=400G`) instead of the image name (`-image=testimg`).
  342. ### Upload VM image
  343. Use qemu-img and `vitastor:etcd_host=<HOST>:image=<IMAGE>` disk filename. For example:
  344. ```
  345. qemu-img convert -f qcow2 debian10.qcow2 -p -O raw 'vitastor:etcd_host=\:2379/v3:image=testimg'
  346. ```
  347. Note that the command requires to be run with `LD_PRELOAD=/usr/lib/x86_64-linux-gnu/qemu/ qemu-img ...`
  348. if you use unmodified QEMU.
  349. You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`
  350. if you don't want to use inode metadata.
  351. ### Start a VM
  352. Run QEMU with `-drive file=vitastor:etcd_host=<HOST>:image=<IMAGE>` and use 4 KB physical block size.
  353. For example:
  354. ```
  355. qemu-system-x86_64 -enable-kvm -m 1024
  356. -drive 'file=vitastor:etcd_host=\:2379/v3:image=testimg',format=raw,if=none,id=drive-virtio-disk0,cache=none
  357. -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1,write-cache=off,physical_block_size=4096,logical_block_size=512
  358. -vnc
  359. ```
  360. You can also specify `:pool=<POOL>:inode=<INODE>:size=<SIZE>` instead of `:image=<IMAGE>`,
  361. just like in qemu-img.
  362. ### Remove inode
  363. Use vitastor-rm. For example:
  364. ```
  365. vitastor-rm --etcd_address --pool 1 --inode 1 --parallel_osds 16 --iodepth 32
  366. ```
  367. ### NBD
  368. To create a local block device for a Vitastor image, use NBD. For example:
  369. ```
  370. vitastor-nbd map --etcd_address --image testimg
  371. ```
  372. It will output the device name, like /dev/nbd0 which you can then format and mount as a normal block device.
  373. Again, you can use `--pool <POOL> --inode <INODE> --size <SIZE>` insteaf of `--image <IMAGE>` if you want.
  374. ### Kubernetes
  375. Vitastor has a CSI plugin for Kubernetes which supports RWO volumes.
  376. To deploy it, take manifests from [csi/deploy/](csi/deploy/) directory, put your
  377. Vitastor configuration in [csi/deploy/001-csi-config-map.yaml](001-csi-config-map.yaml),
  378. configure storage class in [csi/deploy/009-storage-class.yaml](009-storage-class.yaml)
  379. and apply all `NNN-*.yaml` manifests to your Kubernetes installation:
  380. ```
  381. for i in ./???-*.yaml; do kubectl apply -f $i; done
  382. ```
  383. After that you'll be able to create PersistentVolumes. See example in [csi/deploy/example-pvc.yaml](csi/deploy/example-pvc.yaml).
  384. ## Known Problems
  385. - Object deletion requests may currently lead to 'incomplete' objects in EC pools
  386. if your OSDs crash during deletion because proper handling of object cleanup
  387. in a cluster should be "three-phase" and it's currently not implemented.
  388. Just repeat the removal request again in this case.
  389. ## Implementation Principles
  390. - I like architecturally simple solutions. Vitastor is and will always be designed
  391. exactly like that.
  392. - I also like reinventing the wheel to some extent, like writing my own HTTP client
  393. for etcd interaction instead of using prebuilt libraries, because in this case
  394. I'm confident about what my code does and what it doesn't do.
  395. - I don't care about C++ "best practices" like RAII or proper inheritance or usage of
  396. smart pointers or whatever and I don't intend to change my mind, so if you're here
  397. looking for ideal reference C++ code, this probably isn't the right place.
  398. - I like node.js better than any other dynamically-typed language interpreter
  399. because it's faster than any other interpreter in the world, has neutral C-like
  400. syntax and built-in event loop. That's why Monitor is implemented in node.js.
  401. ## Author and License
  402. Copyright (c) Vitaliy Filippov (vitalif [at], 2019+
  403. Join Vitastor Telegram Chat:
  404. All server-side code (OSD, Monitor and so on) is licensed under the terms of
  405. Vitastor Network Public License 1.1 (VNPL 1.1), a copyleft license based on
  406. GNU GPLv3.0 with the additional "Network Interaction" clause which requires
  407. opensourcing all programs directly or indirectly interacting with Vitastor
  408. through a computer network and expressly designed to be used in conjunction
  409. with it ("Proxy Programs"). Proxy Programs may be made public not only under
  410. the terms of the same license, but also under the terms of any GPL-Compatible
  411. Free Software License, as listed by the Free Software Foundation.
  412. This is a stricter copyleft license than the Affero GPL.
  413. Please note that VNPL doesn't require you to open the code of proprietary
  414. software running inside a VM if it's not specially designed to be used with
  415. Vitastor.
  416. Basically, you can't use the software in a proprietary environment to provide
  417. its functionality to users without opensourcing all intermediary components
  418. standing between the user and Vitastor or purchasing a commercial license
  419. from the author 😀.
  420. Client libraries (cluster_client and so on) are dual-licensed under the same
  421. VNPL 1.1 and also GNU GPL 2.0 or later to allow for compatibility with GPLed
  422. software like QEMU and fio.
  423. You can find the full text of VNPL-1.1 in the file [VNPL-1.1.txt](VNPL-1.1.txt).
  424. GPL 2.0 is also included in this repository as [GPL-2.0.txt](GPL-2.0.txt).