Commit Graph

113 Commits (57e2c503f7152cc5be5ebb3a977b7a151faad8a1)

Author SHA1 Message Date
Vitaliy Filippov 68567c0e1f Fix messenger possibly trying to connect to the same OSD twice 2021-04-07 01:30:38 +03:00
Vitaliy Filippov 04b00003e9 Log ping failures 2021-04-07 01:30:38 +03:00
Vitaliy Filippov 307c1731c1 Forget all dirty_entries before stable big_write or delete during initialisation
This fixes a 'double_alloc' assertion in the following case:
- big_write object #1 v1 to block #100
- big_write object #1 v2 to block #101
- big_write object #2 v1 to block #100
2021-04-07 01:30:38 +03:00
Vitaliy Filippov a48e2bbf18 Fix write replay ordering when immediate_commit != all
Previous implementation didn't respect write ordering and could lead
to corrupted data when restarting writes after an OSD outage

Also rework cluster_client queueing logic and add tests for it to verify the correct behaviour
2021-04-03 14:51:52 +03:00
Vitaliy Filippov 688821665a Remove stoull_full() from etcd_state_client.cpp 2021-04-03 14:36:04 +03:00
Vitaliy Filippov 3e162d95a0 Remove http_client.h include from etcd_state_client.h 2021-04-03 14:36:04 +03:00
Vitaliy Filippov 829381b335 Extract some definitions to msgr_op.{cpp,h} 2021-04-03 14:36:04 +03:00
Vitaliy Filippov 54f2353f24 Use bitmap granularity for alignment checks 2021-04-03 14:36:04 +03:00
Vitaliy Filippov e47f6fba60 Remove cluster_client_t::stop() 2021-04-03 14:35:42 +03:00
Vitaliy Filippov 883bf84a16 Fix build 2021-04-03 01:47:15 +03:00
Vitaliy Filippov 52097c4856 Stop flushing when less than min_flusher_count operations are available (unless a trim is forced) 2021-04-03 00:53:28 +03:00
Vitaliy Filippov e1355cbc74 Report failed operation name in cluster_client 2021-04-03 00:53:28 +03:00
Vitaliy Filippov 8f8b90be7a Add min_flusher_count configuration 2021-04-03 00:53:28 +03:00
Vitaliy Filippov ad9f619370 Skip double allocs when reading journal 2021-04-03 00:53:28 +03:00
Vitaliy Filippov f4769ba7c7 Collapse create+delete journal entry pairs if they're already flushed
Old journal replay mechanism could lead to a double allocation of the same
block and a "Fatal error: tried to overwrite non-zero metadata entry"
2021-04-03 00:53:28 +03:00
Vitaliy Filippov 843b7052d2 Add an assertion when clearing deleted metadata entries, add debug details when freeing blocks 2021-04-03 00:53:28 +03:00
Vitaliy Filippov 4095bcc558 Do not ignore object deletion journal entries when they are preceded by a big write 2021-03-25 11:00:10 +03:00
Vitaliy Filippov 564d64e271 Add some details for debug prints 2021-03-25 11:00:10 +03:00
Vitaliy Filippov cf54741c95 Followup to 05db1308aa
Don't do anything with the object state after errors because
it's freed by PG re-peer in this case
2021-03-25 11:00:10 +03:00
Vitaliy Filippov 18a5fafa2a Fix rollback 2021-03-25 02:41:58 +03:00
Vitaliy Filippov 06f4978085 Fix fsync check in blockstore_flush (data fsyncs were disabled instead of journal fsyncs) 2021-03-25 02:41:58 +03:00
Vitaliy Filippov 7ebf1588c5 Check for immediate_commit==small in the OSD code 2021-03-25 02:41:58 +03:00
Vitaliy Filippov b0ad1e1e6d Remember writes as "unsynced" only after completing them
Previously BS_OP_SYNC could take unfinished writes and add them into the journal before
they were actually completed. This was leading to crashes with the message
"BUG: Unexpected dirty_entry 2000000000001:9f2a0000 v3 unstable state during flush: 338"
2021-03-25 02:41:58 +03:00
Vitaliy Filippov 0949f08407 Extract osd_primary write and sync code into separate files 2021-03-24 14:20:56 +03:00
Vitaliy Filippov 04a1f18fa5 Assign .req as a whole to always zero out the remaining part
Also clear .reply before processing the operation
2021-03-24 14:20:56 +03:00
Vitaliy Filippov cf9a641d66 Skip disconnected OSDs during sync 2021-03-24 14:20:56 +03:00
Vitaliy Filippov 05db1308aa Fix two potential read/write ordering problems (even though not yet seen in tests)
- Write operations could be 'stabilized' and previous versions could be
  purged from OSDs before the removal of version_override and following
  reads could potentially hit different version in EC pools
- Object was marked clean after completing the delete during recovery, so
  reads could in theory hit a deleted version and return nothing
2021-03-24 14:20:56 +03:00
Vitaliy Filippov 98b54ca948 Don't try to "recover" misplaced objects if it would make them degraded 2021-03-21 01:37:23 +03:00
Vitaliy Filippov 23225c5e62 Do not run ping on clients that are not yet connected 2021-03-21 01:37:23 +03:00
Vitaliy Filippov 435045751d Delete objects only after a SYNC during rebalance in the non-immediate_commit mode
Previously OSDs could commit deletes before writes during recovery or rebalance
in the "lazy fsync" (immediate_commit=off) mode which could result in lost objects
2021-03-16 12:48:26 +03:00
Vitaliy Filippov c5fb1d5987 Do not duplicate blockstore operations when io_uring fills up
This bug was leading to OSDs dying with "Assertion `fulfilled == read_op->len' failed"
when testing fio -rw=randread -numjobs=8 -iodepth=128
2021-03-16 12:48:26 +03:00
Vitaliy Filippov 9ac7e75178 Allow to specify etcd URLs for OSDs with http://, do not die with a strange error if -etcd option is missing for fio 2021-03-16 12:48:26 +03:00
Vitaliy Filippov 88671cf745 Fix a bug causing all flushers to wait for an fsync without actually trying to do it
This happened because flusher_count became dynamic and fsync_batch() was comparing the number
of flushers currently ready to do an fsync with the maximum number of flushers. Also the number
wasn't rechecked on every loop which was also incorrect.

Now the interrupted_rebalance test passes even without IMMEDIATE_COMMIT=1.
2021-03-13 17:27:29 +03:00
Vitaliy Filippov ceb9c28de7 Set default log_level before passing config to etcd_state_client 2021-03-13 17:19:45 +03:00
Vitaliy Filippov 299d7d7c95 Use common macro for get_sqe 2021-03-13 17:19:45 +03:00
Vitaliy Filippov d1526b415f Correctly resume writes when OSD is full to return an error 2021-03-13 17:19:45 +03:00
Vitaliy Filippov f49fd53d55 Fix a bug where allocator was unable to allocate up to last (n%64) blocks, add tests for it 2021-03-13 02:19:02 +03:00
Vitaliy Filippov b44f49aab2 Ignore zero OSDs in history osd_sets 2021-03-12 12:40:15 +03:00
Vitaliy Filippov af5155fcd9 Implement "no_recovery" and "no_rebalance" flags 2021-03-11 00:36:31 +03:00
Vitaliy Filippov c4ba24c305 Do not print ping op latency 2021-03-10 02:01:44 +03:00
Vitaliy Filippov bd178ac20f Fix history osd_set check - local OSD is always available! 2021-03-09 02:18:18 +03:00
Vitaliy Filippov ad577c4aac Add PING operation and timeouts to detect OSD failures when a host goes down 2021-03-09 02:15:38 +03:00
Vitaliy Filippov e91ff2a9ec Only forget offline PGs if their state is not changed during reporting 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 086667f568 Do not check PG state key ownership if it doesn't exist yet
This fixes the bug where OSDs were sometimes trying to report updated PG states
infinitely without luck when PGs transitioned from 'starting' to 'peering' too fast
2021-03-08 17:04:10 +03:00
Vitaliy Filippov 1be94da437 Check & remove extra chunks for degraded / incomplete objects, too 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 80e12358a2 Use pg_data_size instead of pg_minsize for object state calculation 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 36c935ace6 Use std::vector for the blockstore submission queue 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 0d8b5e2ef9 Remove unused enqueue_op_first() 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 98f1e2c277 Rework write/sync ordering
Make syncs wait for all previous writes because it's the only way
to make sure that OSDs do not receive incomplete writes in LIST results
during peering when some writes are still in progress.

Also simplify blockstore submission queue logic.
2021-03-08 17:04:10 +03:00
Vitaliy Filippov 21e7686037 Fix possible "assertion failed: pg.inflight >= 0" error during PG stop 2021-03-08 17:04:10 +03:00
Vitaliy Filippov ab21a1908b Check for the dirty PG flag when trying to continue to stop it after sync 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 30d1ccd43e Fix an infinite loop when discarding list operations during stop_pg() 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 8bdd6d8d78 Reset PG state when stopping them 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 09b3e4e789 Fix OSDs being unable to stop PGs that are 'peering', not 'active'
This was sometimes leading to incorrect misplaced and degraded object count statistics
2021-03-08 17:04:10 +03:00
Vitaliy Filippov bc742ccf8c Fix a small memory leak in etcd_state_client 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 314b20437b Do not break subsequent small writes badly when a big write is canceled 2021-03-08 17:04:10 +03:00
Vitaliy Filippov 29d8ac8b1b Do not report statistics for the empty operation 2021-03-01 16:20:57 +03:00
Vitaliy Filippov 6155b23a7e Replace pgs[id] with pgs.at(id) to prevent accidental auto-vivification 2021-02-28 19:36:59 +03:00
Vitaliy Filippov 46e79f3306 Wait for PGs to become clean before stopping them 2021-02-28 19:36:59 +03:00
Vitaliy Filippov 41fd14e024 Fix deletes not increasing write_iodepth 2021-02-28 19:36:59 +03:00
Vitaliy Filippov 2d73b19a6c Fix online PG count change bugs 2021-02-25 23:59:33 +03:00
Vitaliy Filippov c974cb539c Make flusher_count adaptive and limit write iodepth 2021-02-25 23:59:33 +03:00
Vitaliy Filippov bf9a175efc Move C/C++ sources to src subdirectory 2021-02-25 23:59:03 +03:00