Cephadm provides a mechanism for upgrading, but it didn’t work for me right away, so I had to assist it. Let’s see what exactly needs to be upgraded.
1
ceph orch upgrade check ceph/ceph 15.2.4
A JSON file appears with a list of daemons that have already been upgraded and those that differ from the requested version, i.e., all of them at the moment. Time to get started. So, the procedure has been launched, and now there are some commands worth running to get a summary.
ceph status - for general information about the cluster’s state. You can run it in a separate terminal like this: watch -n 10 ceph status
ceph status -W cephadm - definitely run this command in the terminal to monitor the orchestrator’s log.
Now we see that the crashes are ready too, good, but then everything seems to be stuck. Yes, HEALTH_OK, but ceph orch ps shows that all OSDs are still on the old version. The logs show the following.
1
Upgrade: It is NOT safe to stop osd.0
Aha, the orchestrator is afraid to stop OSD daemons with data. Let’s help. Here is my situation.
# ceph osd treeID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 14.00000 root default
-3 5.00000 host 010 hdd 1.00000 osd.0 up 1.00000 1.00000
1 hdd 1.00000 osd.1 up 1.00000 1.00000
2 hdd 1.00000 osd.2 up 1.00000 1.00000
3 hdd 1.00000 osd.3 up 1.00000 1.00000
4 hdd 1.00000 osd.4 up 1.00000 1.00000
-5 5.00000 host 025 hdd 1.00000 osd.5 up 1.00000 1.00000
6 hdd 1.00000 osd.6 up 1.00000 1.00000
7 hdd 1.00000 osd.7 up 1.00000 1.00000
8 hdd 1.00000 osd.8 up 1.00000 1.00000
9 hdd 1.00000 osd.9 up 1.00000 1.00000
-7 4.00000 host 0310 hdd 1.00000 osd.10 up 1.00000 1.00000
11 hdd 1.00000 osd.11 up 1.00000 1.00000
12 hdd 1.00000 osd.12 up 1.00000 1.00000
13 hdd 1.00000 osd.13 up 1.00000 1.00000
At first, I stopped OSDs one by one.
1
ceph osd out osd.0
Then I relaxed and began to decommission a whole server sequentially. When we decommission osd.0, after some time, when rebalancing finishes, ceph status -W cephadm complains about osd.1, and ceph orch ps shows that osd.0 is upgraded. The idea is clear; just don’t forget to reintegrate it.
1
ceph osd in osd.0
In short, we decommission the OSDs of the first server, wait until cephadm complains about the next ones, and then reintegrate them. Further, wait for HEALTH_OK before decommissioning the next server’s OSD, or you could lose data… Based on your OSD tree, the plan, if done from scratch, is as follows:
out 0, 1, 2, 3, 4
wait for complaints about 5
in 0, 1, 2, 3, 4
wait for HEALTH_OK
out 5, 6, 7, 8, 9
wait for complaints about 10
in 5, 6, 7, 8, 9
wait for HEALTH_OK
out 10, 11, 12, 13
check that everything is upgraded with ceph orch ps