Commit Graph

23 Commits (master)

Author SHA1 Message Date
Vitaliy Filippov 5ca7cde612 Experiment/WIP: Try to track "secondary" recovery ops separately 2023-12-31 01:23:17 +03:00
Vitaliy Filippov 225eb2fe3d Support RDMA without ODP by stupidly copying memory. Disable ODP by default
ODP is slower than regular RDMA even with memory copy overhead

Example numbers:
- 3950000 random read iops without ODP vs 240000 iops with ODP
- 1447000 random write iops without ODP vs 101000 iops with ODP

Reference: https://tkygtr6.github.io/pub/ISPASS21_slides.pdf
2023-11-12 15:03:47 +03:00
Vitaliy Filippov 65487da4b1 Do not include msgr_rdma.h into messenger.h 2023-08-24 01:55:35 +03:00
Vitaliy Filippov 32f2c4dd27 Measure scrub statistics 2023-06-17 20:56:26 +03:00
Vitaliy Filippov cdfc74665b Close client FDs only when destroying the client, after handling all async reads/writes
Test / test_change_pg_count (push) Successful in 50s Details
Test / test_change_pg_count_ec (push) Successful in 58s Details
Test / test_change_pg_size (push) Successful in 17s Details
Test / test_create_nomaxid (push) Successful in 19s Details
Test / test_etcd_fail (push) Successful in 58s Details
Test / test_failure_domain (push) Successful in 14s Details
Test / test_interrupted_rebalance (push) Successful in 1m31s Details
Test / test_interrupted_rebalance_imm (push) Successful in 1m0s Details
Test / test_interrupted_rebalance_ec (push) Successful in 1m23s Details
Test / test_interrupted_rebalance_ec_imm (push) Successful in 1m16s Details
Test / test_minsize_1 (push) Successful in 38s Details
Test / test_move_reappear (push) Successful in 54s Details
Test / test_rebalance_verify (push) Successful in 2m26s Details
Test / test_rebalance_verify_imm (push) Successful in 2m7s Details
Test / test_rebalance_verify_ec (push) Successful in 2m51s Details
Test / test_rebalance_verify_ec_imm (push) Successful in 2m15s Details
Test / test_rm (push) Successful in 22s Details
Test / test_snapshot (push) Successful in 40s Details
Test / test_snapshot_ec (push) Successful in 34s Details
Test / test_splitbrain (push) Successful in 23s Details
Test / test_write (push) Successful in 1m7s Details
Test / test_write_xor (push) Successful in 2m13s Details
Test / test_write_no_same (push) Successful in 18s Details
Test / test_heal_pg_size_2 (push) Successful in 5m14s Details
Test / test_scrub (push) Successful in 36s Details
Test / test_scrub_zero_osd_2 (push) Successful in 40s Details
Test / test_scrub_xor (push) Successful in 54s Details
Test / test_scrub_pg_size_3 (push) Successful in 54s Details
Test / test_scrub_pg_size_6_pg_minsize_4_osd_count_6_ec (push) Successful in 1m12s Details
Test / test_scrub_ec (push) Successful in 1m10s Details
Fixes "Client XX command out of sync" sometimes happening on reconnections
2023-05-25 00:52:43 +03:00
Vitaliy Filippov d06ed2b0e7 Implement online config update 2023-03-26 19:21:50 +03:00
Vitaliy Filippov 1a1ba0d1e7 Add set_immediate to ringloop and use it for bs/osd ops to prevent reenterability issues 2023-02-09 17:37:26 +03:00
Vitaliy Filippov 5a10d135f3 Allow to configure block_size, bitmap_granularity and immediate_commit per-pool 2022-08-11 01:56:33 +03:00
Vitaliy Filippov 7d79c58095 Use the larger sockaddr_storage structure 2022-02-12 11:22:56 +03:00
Vitaliy Filippov 20a4406acc Support IPv6 OSD addresses 2021-12-19 10:42:17 +03:00
Vitaliy Filippov 660c3f7b0d Change default RDMA settings to 128x 129K buffers
129K to leave extra space for the header

The problem with 8x 1M buffers is that the following happens with,
for example, 2 OSDs and 4M T1Q1 write:
- Server posts 8 receives
- Client posts 8 sends
- WRs are processed by the RDMA stack, but the OSD doesn't have the time
  to handle them and doesn't refill buffers
- Client posts 1 more send
- RNR retransmission happens and performance drops to zero

Overall it seems that RDMA support should be reworked to use real 'RDMA'
operations i.e. operations writing into remote memory. This has an
additional advantage of avoiding a copy at the receive side of the OSD.
2021-11-21 12:05:52 +03:00
Vitaliy Filippov fc3a1e076a Fix minor bugs in snapshot removal, check it in tests 2021-09-25 19:30:29 +03:00
Vitaliy Filippov 72aa2fd819 Make OSD and client read common configuration from /etc/vitastor/vitastor.conf 2021-04-30 01:11:27 +03:00
Vitaliy Filippov 483c5ab380 Negotiate max_msg instead of max_sge, make buffer settings more conservative :-) 2021-04-29 11:10:35 +03:00
Vitaliy Filippov 971aa4ae4f Implement RDMA receive with memory copying (send remains zero-copy)
This is the simplest and, as usual, the best implementation :)

100% zero-copy implementation is also possible (see rdma-zerocopy branch),
but it requires to create A LOT of queues (~128 per client) to use QPN as a 'tag'
because of the lack of receive tags and the server may simply run out of queues.
Hardware limit is 262144 on Mellanox ConnectX-4 which amounts to only 2048
'connections' per host. And even with that amount of queues it's still less optimal
than the non-zerocopy one.

In fact, newest hardware like Mellanox ConnectX-5 does have Tag Matching
support, but it's still unsuitable for us because it doesn't support scatter/gather
(tm_caps.max_sge=1).
2021-04-29 02:34:45 +03:00
Vitaliy Filippov 9e6cbc6ebc Negotiate max_sge between RDMA client & server 2021-04-29 02:15:20 +03:00
Vitaliy Filippov ce777319c3 WIP RDMA support
Basic naive implementation works, but it's highly non-optimal as
RNR retransmissions occur all the time. RDMA expects the receiver
to always have place for incoming WRs...
2021-04-29 02:03:54 +03:00
Vitaliy Filippov 860ac24762 Add "external" bitmap support to the secondary OSD protocol 2021-04-10 17:44:12 +03:00
Vitaliy Filippov b0b2e7df3c Fix use-after-free in keepalive_timer and rework stop_client()
The bug reproduced if fio was temporarily stopped with SIGSTOP
during write test and then resumed after 10 seconds. In this case
"pings" were failed for all clients and fio process crashed with
'use-after-free' in keepalive_timer. It happened because it called
stop_client while having a live iterator to the map.
2021-04-07 11:06:31 +03:00
Vitaliy Filippov a48e2bbf18 Fix write replay ordering when immediate_commit != all
Previous implementation didn't respect write ordering and could lead
to corrupted data when restarting writes after an OSD outage

Also rework cluster_client queueing logic and add tests for it to verify the correct behaviour
2021-04-03 14:51:52 +03:00
Vitaliy Filippov 829381b335 Extract some definitions to msgr_op.{cpp,h} 2021-04-03 14:36:04 +03:00
Vitaliy Filippov ad577c4aac Add PING operation and timeouts to detect OSD failures when a host goes down 2021-03-09 02:15:38 +03:00
Vitaliy Filippov bf9a175efc Move C/C++ sources to src subdirectory 2021-02-25 23:59:03 +03:00