Commit Graph

203 Commits (nbd-vmsplice)

Author SHA1 Message Date
Vitaliy Filippov 28bd94d2c2 Make diagnostics slightly better 2021-07-18 01:24:38 +03:00
Vitaliy Filippov 148ff04aa8 Do not lose flusher queue entries when an "older object rescan" happens in parallel with flushing of an older version of another object 2021-07-18 01:20:54 +03:00
JiangYu e86df4a2a2 fix BLOCKSTORE_DEBUG, error: ‘dirty_it’ was not declared in this scope
Signed-off-by: JiangYu <lnsyyj@hotmail.com>
2021-07-18 00:46:05 +08:00
Vitaliy Filippov e74af9745e Print journal flusher diagnostics on slow ops 2021-07-17 16:13:41 +03:00
Vitaliy Filippov 0e0509e3da Dump op states in slow operation log 2021-07-16 01:58:50 +03:00
Vitaliy Filippov cb282d25e0 Release 0.6.5
- Basic support for OpenStack: Cinder driver, patches for Nova and libvirt
- Add missing "image" and "config_path" QEMU options
- Calculate aggregate per-pool statistics in monitor
- Implement writes with Check-And-Set semantics
- Add a C wrapper library with public header
2021-07-10 11:01:21 +03:00
Vitaliy Filippov b52dd6843a Rename qemu_rbd_unescape and qemu_rbd_next_tok to *_vitastor_* 2021-07-03 23:14:44 +03:00
Vitaliy Filippov b66160a7ad Aggregate per-pool statistics in mon 2021-07-03 23:14:44 +03:00
Vitaliy Filippov aad7792d3f Check for loops in parent inode chains 2021-06-20 00:23:03 +03:00
Vitaliy Filippov 6ca8afffe5 Add CAS version parameter to the C wrapper 2021-06-19 01:00:52 +03:00
Vitaliy Filippov 511a89948b Rework qemu_proxy into a C wrapper library with public header 2021-06-19 00:39:11 +03:00
Vitaliy Filippov 3de553ecd7 Add a test for CAS write operation 2021-06-15 00:12:35 +03:00
Vitaliy Filippov 891250d355 Implement CAS writes
From now on, reads will return the server-side object version numbers
and writes and deletes will have an additional "version" parameter
which, if set to a non-zero value, will be atomically compared with
the current version of the object plus 1 and the modification will
fail if it doesn't match.

This feature opens the road to correct online flattening of snapshot
layers and other interesting things.
2021-06-15 00:12:35 +03:00
Vitaliy Filippov f9fe72d40a Release 0.6.4
- Implement a basic Kubernetes CSI driver
- Minor fixes for vitastor-nbd
- Fix build without RDMA broken in 0.6.3
2021-05-16 01:38:01 +03:00
Vitaliy Filippov eaac1fc5d1 Log to stderr in etcd_state_client, too 2021-05-16 01:09:25 +03:00
Vitaliy Filippov 57be1923d3 Daemonize NBD_DO_IT process, correctly cleanup unmounted NBD clients 2021-05-16 01:09:25 +03:00
Vitaliy Filippov c467acc388 Fix /v3 appendage to etcd URLs without /v3 2021-05-15 19:22:24 +03:00
Vitaliy Filippov bf591ba3ee Fix nbd module load check 2021-05-15 19:22:24 +03:00
Vitaliy Filippov 699a0fbbc7 Log to stderr instead of stdout in client 2021-05-15 19:22:24 +03:00
Vitaliy Filippov 6b2dd50f27 Fix build without RDMA 2021-05-08 18:20:43 +03:00
Vitaliy Filippov caf2f3c56f Release 0.6.3
- RDMA support
- Client performance optimisations (4k randread ~120k -> ~180k on 1 core)
- JSON configuration file (/etc/vitastor/vitastor.conf) support
- Bug fixes
2021-05-02 17:47:43 +03:00
Vitaliy Filippov 9174f188b1 Build packages with libibverbs
For CentOS 7 it also requires newer rdma-core as CentOS 7's native version doesn't have
implicit ODP support. The updated version is already uploaded into the vitastor repo.
2021-05-02 17:47:16 +03:00
Vitaliy Filippov d3978c6d0e Do not print RDMA connection messages when log_level=0
By the way, it's 1 by default in the OSD, so these messages will still be there in OSD logs
2021-05-01 00:26:09 +03:00
Vitaliy Filippov 4a7365660d Do not wait for down OSDs during sync
Fixes a hang introduced in 0.5.11 in the non-immediate_commit mode
2021-05-01 00:26:07 +03:00
Vitaliy Filippov 818ae5d61d Some config parsing fixes 2021-05-01 00:20:01 +03:00
Vitaliy Filippov f6f35f4127 Pass options correctly to not override /etc/vitastor/vitastor.conf 2021-04-30 01:17:44 +03:00
Vitaliy Filippov 72aa2fd819 Make OSD and client read common configuration from /etc/vitastor/vitastor.conf 2021-04-30 01:11:27 +03:00
Vitaliy Filippov 5010b0dd75 Use json11 instead of blockstore_config_t 2021-04-30 00:52:46 +03:00
Vitaliy Filippov 483c5ab380 Negotiate max_msg instead of max_sge, make buffer settings more conservative :-) 2021-04-29 11:10:35 +03:00
Vitaliy Filippov 6a6fd6544d Add RDMA options to the QEMU driver 2021-04-29 11:02:49 +03:00
Vitaliy Filippov 971aa4ae4f Implement RDMA receive with memory copying (send remains zero-copy)
This is the simplest and, as usual, the best implementation :)

100% zero-copy implementation is also possible (see rdma-zerocopy branch),
but it requires to create A LOT of queues (~128 per client) to use QPN as a 'tag'
because of the lack of receive tags and the server may simply run out of queues.
Hardware limit is 262144 on Mellanox ConnectX-4 which amounts to only 2048
'connections' per host. And even with that amount of queues it's still less optimal
than the non-zerocopy one.

In fact, newest hardware like Mellanox ConnectX-5 does have Tag Matching
support, but it's still unsuitable for us because it doesn't support scatter/gather
(tm_caps.max_sge=1).
2021-04-29 02:34:45 +03:00
Vitaliy Filippov 9e6cbc6ebc Negotiate max_sge between RDMA client & server 2021-04-29 02:15:20 +03:00
Vitaliy Filippov ce777319c3 WIP RDMA support
Basic naive implementation works, but it's highly non-optimal as
RNR retransmissions occur all the time. RDMA expects the receiver
to always have place for incoming WRs...
2021-04-29 02:03:54 +03:00
Vitaliy Filippov f8ff39b0ab Rework continue_ops() to remove a CPU hot spot
This rework increases fio -rw=randread -iodepth=128 result from ~120k to ~180k iops :)
2021-04-29 01:50:13 +03:00
Vitaliy Filippov d749159585 Linked list experiment
Rework client operation queue from a vector to a linked list.
This is required to rework continue_ops() as its current implementation
consumes ~25% of client process CPU.
2021-04-29 01:47:33 +03:00
Vitaliy Filippov 9703773a63 Fix has_flushes setting 2021-04-28 23:40:44 +03:00
Vitaliy Filippov 5d8d486f7c Add SOVERSION 2021-04-20 01:01:32 +03:00
Vitaliy Filippov 2b546cdd55 Link vitastor_blk with vitastor_common for timerfd_manager_t
Not really required to operate, but fixes a verify-elf error
2021-04-20 00:51:53 +03:00
Vitaliy Filippov bd7b177707 Report sensitive configuration values instead of the configuration source 2021-04-17 23:11:16 +03:00
Vitaliy Filippov 82e6aff17b Support mapping NBD by the image name 2021-04-17 17:39:55 +03:00
Vitaliy Filippov 57e2c503f7 Rename osd_t::c_cli to msgr 2021-04-17 16:32:09 +03:00
Vitaliy Filippov 715bc8d53d Release 0.6.2
- Fix a possible crash during SYNC when journal fsyncs are enabled
- Fix a memory leak in the chained read implementation
2021-04-15 23:40:06 +03:00
Vitaliy Filippov 0af077701c Fix a possible crash during SYNC when journal fsyncs are enabled 2021-04-15 02:01:50 +03:00
Vitaliy Filippov cac976ce25 Fix a memory leak in the chained read implementation 2021-04-15 01:42:18 +03:00
Vitaliy Filippov acf0646542 Build common sources once 2021-04-15 01:13:34 +03:00
Vitaliy Filippov ede1c1d667 Release 0.6.1
A bugfix for the new "chained read from snapshot" feature
2021-04-14 22:32:23 +03:00
Vitaliy Filippov 38bd51c97f Remove aio_context assertion, it seems it is unneeded 2021-04-14 22:32:15 +03:00
Vitaliy Filippov 966fb763ca Oooops, fix chained reads 2021-04-13 16:19:21 +03:00
Vitaliy Filippov 0b41ffc08d Release 0.6.0
Warning: upgrading from 0.5.x is currently not supported!
Please create an issue if you really need upgrade capability.

New features:
- Snapshots and Copy-on-Write clones
- Inode (image) names
- Inode I/O and space statistics
- Write throttling for smoothing random write workloads in SSD+HDD configurations
2021-04-11 00:49:18 +03:00
Vitaliy Filippov 64eeb79051 Prevent 0.6.x OSDs from talking to 0.5.x
The new protocol is almost compatible - it has bitmaps, but also it has
a "bitmap_length" field. It's not hard to make 0.5-0.6 OSDs and clients
compatible, but for now I just assume nobody needs it.

If I'm wrong and anybody requests to upgrade their production 0.5.x system
to 0.6.x I'll fix it.
2021-04-10 22:26:17 +03:00
Vitaliy Filippov 2a02f3c4c7 Add metadata superblock and check it on start
Refuse to start if the superblock is missing or bad version;
zero out the metadata area when initializing superblock.
2021-04-10 22:26:17 +03:00
Vitaliy Filippov f684d9101a Refuse to start with old journal version 2021-04-10 17:44:12 +03:00
Vitaliy Filippov a1f2f19489 Do not increment inode statistics if the object already exists 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 82c1a7ec67 Fix statistics reporting, split inode number into pool & inode 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 2ab423d4ef Implement journaled write throttling for the SSD+HDD case 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 4694811eab Add microsecond accuracy to set_timer 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6b988de17d Remove timerfd_interval 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 37efdc2a83 Fix bitmap_set for replicated pools 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 591cad09c9 Fix bitmaps for objects larger than 128K 2021-04-10 17:44:12 +03:00
Vitaliy Filippov b907ad50aa Oops, forgot to add external bitmaps to blockstore in some places 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 5f5b6ef150 Enable chained reads in the client 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 38a3df4a0e Implement chained (optimized) read in the primary OSD code 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6950b8e3a0 Watch inode metadata revisions 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 0cea3576fb Add "read bitmaps" operation to secondary OSD protocol 2021-04-10 17:44:12 +03:00
Vitaliy Filippov f01eea07d3 Add simplified interface to read blockstore bitmaps synchronously 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 2c2f08aca2 Shorten some structure names 2021-04-10 17:44:12 +03:00
Vitaliy Filippov d6524670e1 Introduce data distribution locality 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 7aeb2cbac7 Capture all by value in qemu_proxy 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 2612d3198a Introduce image names and metadata storage in etcd
Each inode has: image name, parent inode number & pool, size and readonly flag

Snapshots are created by switching image name to a different inode number
while using the older inode as parent.
2021-04-10 17:44:12 +03:00
Vitaliy Filippov ab39ce2bbb Use clean_entry_bitmap_size instead of entry_attr_size back because of changed bitmap handling 2021-04-10 17:44:12 +03:00
Vitaliy Filippov d0c2e31312 Add a test for snapshots, fix bugs. Now the test passes 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 9038d42327 Fix several snapshot I/O bugs 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 691f066055 Actual snapshot support (untested) 2021-04-10 17:44:12 +03:00
Vitaliy Filippov ffe1cd4c79 Report inode I/O statistics, aggregate it in the monitor 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 4ae1b84c67 Report inode space usage statistics to etcd, aggregate it in the monitor 2021-04-10 17:44:12 +03:00
Vitaliy Filippov c35963967f Add inode space usage statistics tracking to blockstore 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 0aa2dd2890 Send bitmaps with primary-reads, actually read bitmaps for READ ops 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6bf88883ac Allocate bitmaps along with stripes to avoid memory fragmentation 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 004f265393 Remove cryptic bitmap inlining from bs_op_t and osd_op_t, use bitmap in primary OSD code 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 860ac24762 Add "external" bitmap support to the secondary OSD protocol 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6107a4d07b Add "external" bitmap support to blockstore 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 95c29b9dc3 Add "external" bitmap support to osd_rmw 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 6909807068 Allow to start the OSD just to flush the journal completely 2021-04-10 17:44:12 +03:00
Vitaliy Filippov 18c72f4835 Correct reenterability fix (now verified with a test)
It's rather funny but 0.5.12 has to be re-published again
2021-04-09 12:10:16 +03:00
Vitaliy Filippov 40b7c21fb1 Followup to 307c1731c1 - fix mark_stable 2021-04-08 15:47:18 +03:00
Vitaliy Filippov efb3678606 Fix qemu-img broken in 0.5.11
Caused by the lack of reenterability of the main cluster_client function
2021-04-08 14:59:20 +03:00
Vitaliy Filippov 8d87e32175 Fix msgr_op.h includes 2021-04-08 01:18:46 +03:00
Vitaliy Filippov b0b2e7df3c Fix use-after-free in keepalive_timer and rework stop_client()
The bug reproduced if fio was temporarily stopped with SIGSTOP
during write test and then resumed after 10 seconds. In this case
"pings" were failed for all clients and fio process crashed with
'use-after-free' in keepalive_timer. It happened because it called
stop_client while having a live iterator to the map.
2021-04-07 11:06:31 +03:00
Vitaliy Filippov 97efb9e299 Do not crash on PG re-peering events when operations are in progress 2021-04-07 11:06:31 +03:00
Vitaliy Filippov f6d705383a Fix client connection recovery bugs, add dirty_ops limit 2021-04-07 11:06:31 +03:00
Vitaliy Filippov 68567c0e1f Fix messenger possibly trying to connect to the same OSD twice 2021-04-07 01:30:38 +03:00
Vitaliy Filippov 04b00003e9 Log ping failures 2021-04-07 01:30:38 +03:00
Vitaliy Filippov 307c1731c1 Forget all dirty_entries before stable big_write or delete during initialisation
This fixes a 'double_alloc' assertion in the following case:
- big_write object #1 v1 to block #100
- big_write object #1 v2 to block #101
- big_write object #2 v1 to block #100
2021-04-07 01:30:38 +03:00
Vitaliy Filippov a48e2bbf18 Fix write replay ordering when immediate_commit != all
Previous implementation didn't respect write ordering and could lead
to corrupted data when restarting writes after an OSD outage

Also rework cluster_client queueing logic and add tests for it to verify the correct behaviour
2021-04-03 14:51:52 +03:00
Vitaliy Filippov 688821665a Remove stoull_full() from etcd_state_client.cpp 2021-04-03 14:36:04 +03:00
Vitaliy Filippov 3e162d95a0 Remove http_client.h include from etcd_state_client.h 2021-04-03 14:36:04 +03:00
Vitaliy Filippov 829381b335 Extract some definitions to msgr_op.{cpp,h} 2021-04-03 14:36:04 +03:00
Vitaliy Filippov 54f2353f24 Use bitmap granularity for alignment checks 2021-04-03 14:36:04 +03:00
Vitaliy Filippov e47f6fba60 Remove cluster_client_t::stop() 2021-04-03 14:35:42 +03:00
Vitaliy Filippov 883bf84a16 Fix build 2021-04-03 01:47:15 +03:00