I have had problems with one of my instances today. The instance does not reboot properly and when I checked it, I have been able to see the root of the problem, the file system was corrupted. To fix the instance disk file system I followed these steps:
First go to compute node and see what instance is what you are looking for. You can get the instance-id number from nova show <uuid> command output. In my case the instance was 63.
# virsh list Id Name State ---------------------------------------------------- 15 instance-00000111 running 34 instance-00000174 running 61 instance-0000017d running 63 instance-00000177 running
Now, because you want to change the file system, it is a good idea to stop or suspend the instance
# virsh suspend 63 Domain 63 suspended
The instance ephemeral disk is in a CEPH pool so you need to know the ceph image name used by the instance. This can be check in the instance xml definition file:
You are going to see some line like:
<source protocol="rbd" name="vmtier-10/7d6e2893-7151-4ce0-b706-6bab230ad586_disk">
Now, you need map the image with the rbd kernel module. You will want to do that in some server (it may be different from the compute node) with recent kernel version running and access to the ceph cluster.
# rbd -p vmtier-10 map 7d6e2893-7151-4ce0-b706-6bab230ad586_disk
You can see your mapped devices:
# rbd showmapped id pool image snap device 1 vmtier-10 7d6e2893-7151-4ce0-b706-6bab230ad586_disk - /dev/rbd1
Assuming that the corrupted file system is in first partition of the disk you can fix it with:
# fsck.ext4 -f -y /dev/rbd1p1
Once the FS is fixed you can unmap the device
rbd unmap /dev/rbd1
And resume the instance or start it if you have stopped it before.
# virsh resume 63 Domain 63 resumed