Non-capacitors SSD flush journal error( set "immediate_commit" to none) #3

Open
opened 6 months ago by DongliSi · 5 comments

Hi, I think the bug is in journal_flusher_co,using one node, one osd can reproduce this problem, vitastor version is 0.5.9 :-).

run osd:
osd --etcd_address 127.0.0.1:2379/v3 --bind_address 127.0.0.1 --osd_num 1
--immediate_commit none
--journal_offset 0
--meta_offset 16777216
--data_offset 260964352
--data_size 858993459200
--flusher_count 256
--data_device /dev/sdb

run qemu-img:
qemu-img convert -p a.raw -O raw 'vitastor:etcd_host=127.0.0.1:2379/v3:pool=1:inode=1:size=85899345920'
Stop here:(4.03/100%)

osd log:
Still waiting to flush journal offset 00001000
[OSD 1] Slow op from client 9: primary_write id=6558 inode=1000000000001 offset=36463000 len=1d000

The larger the flusher_count setting, the more likely this problem will occur, Setting flusher_count to 1 works fine.

Hi, I think the bug is in journal_flusher_co,using one node, one osd can reproduce this problem, vitastor version is 0.5.9 :-). run osd: osd --etcd_address 127.0.0.1:2379/v3 --bind_address 127.0.0.1 --osd_num 1 \ --immediate_commit none \ --journal_offset 0 \ --meta_offset 16777216 \ --data_offset 260964352 \ --data_size 858993459200 \ --flusher_count 256 \ --data_device /dev/sdb run qemu-img: qemu-img convert -p a.raw -O raw 'vitastor:etcd_host=127.0.0.1\:2379/v3:pool=1:inode=1:size=85899345920' Stop here:(4.03/100%) osd log: Still waiting to flush journal offset 00001000 [OSD 1] Slow op from client 9: primary_write id=6558 inode=1000000000001 offset=36463000 len=1d000 The larger the flusher_count setting, the more likely this problem will occur, Setting flusher_count to 1 works fine.
Owner

Try v0.5.10 - I fixed exactly this problem during the last few days :)

Try v0.5.10 - I fixed exactly this problem during the last few days :)
vitalif closed this issue 6 months ago
Poster

Hi, I reappear this problem using v0.5.10.

This problem is easy to appear under very high io load.

Eg: run "fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting" in a Linux virtual machine

Hi, I reappear this problem using v0.5.10. This problem is easy to appear under very high io load. Eg: run "fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting" in a Linux virtual machine
Owner

I.e. does "Still waiting to flush journal offset 00001000" still happen?

I.e. does "Still waiting to flush journal offset 00001000" still happen?
Poster

Hi, This problem still exists in v0.5.13, but now it is "Still waiting to flush journal offset 00418000".

Interestingly, this problem always does not appear when I run the following command:
fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting

But this problem always appears when I run the following command:
fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=10G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=600 --group_reporting

Hi, This problem still exists in v0.5.13, but now it is "Still waiting to flush journal offset 00418000". Interestingly, this problem always does not appear when I run the following command: fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=1G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=60 --group_reporting But this problem always appears when I run the following command: fio --name benchmark --filename=/root/fio-tempfile.dat --rw=write --size=10G -bs=4M --ioengine=libaio --fsync=10000 --iodepth=32 --direct=1 --numjobs=16 --runtime=600 --group_reporting
Owner

Hm... it's strange, I thought I caught most "flush stall" bugs. I'll try to reproduce it, OK.

Hm... it's strange, I thought I caught most "flush stall" bugs. I'll try to reproduce it, OK.
vitalif reopened this issue 5 months ago
Sign in to join this conversation.
No Label
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.