Recover ceph backed file system in OpenStack instance

I have had problems with one of my instances today. The instance does not reboot properly and when I checked it, I have been able to see the root of the problem, the file system was corrupted. To fix the instance disk file system I followed these steps:

First go to compute node and see what instance is what you are looking for. You can get the instance-id number from nova show <uuid> command output. In my case the instance was 63.

# virsh list
Id Name State
----------------------------------------------------
15 instance-00000111 running
34 instance-00000174 running
61 instance-0000017d running
63 instance-00000177 running

Now, because you want to change the file system, it is a good idea to stop or suspend the instance

# virsh suspend 63
Domain 63 suspended

The instance ephemeral disk is in a CEPH pool so you need to know the ceph image name used by the instance. This can be check in the instance xml definition file:

/var/lib/nova/instances/7d6e2893-7151-4ce0-b706-6bab230ad586/libvirt.xml

You are going to see some line like:

<source protocol="rbd" name="vmtier-10/7d6e2893-7151-4ce0-b706-6bab230ad586_disk">

Now, you need map the image with the rbd kernel module. You will want to do that in some server (it may be different from the compute node) with recent kernel version running and access to the ceph cluster.

# rbd -p vmtier-10 map 7d6e2893-7151-4ce0-b706-6bab230ad586_disk

You can see your mapped devices:

# rbd showmapped
id pool image snap device
1 vmtier-10 7d6e2893-7151-4ce0-b706-6bab230ad586_disk - /dev/rbd1

Assuming that the corrupted file system is in first partition of the disk you can fix it with:

# fsck.ext4 -f -y /dev/rbd1p1

Once the FS is fixed you can unmap the device

rbd unmap /dev/rbd1

And resume the instance or start it if you have stopped it before.

# virsh resume 63
Domain 63 resumed

Leave a Reply Cancel reply