Skip to main content

pg X.Y is stuck stale for , current state stale+active+clean, last acting [N]

I got these states when I removed the last OSD assigned to a pool with size 1 in the crushmap. Of course, I didn’t have any precious data in it, but to avoid removing the pool I tried reassigning the pool to a new root and new OSDs through a the crusmap rule.

ceph health detail
HEALTH_WARN 9 pgs stale; 9 pgs stuck stale
pg 18.6 is stuck stale for 8422.941233, current state stale+active+clean, last acting [13]
pg 18.1 is stuck stale for 8422.941247, current state stale+active+clean, last acting [13]
pg 19.0 is stuck stale for 8422.941251, current state stale+active+clean, last acting [13]
pg 18.0 is stuck stale for 8422.941252, current state stale+active+clean, last acting [13]
pg 19.1 is stuck stale for 8422.941255, current state stale+active+clean, last acting [13]
pg 18.3 is stuck stale for 8422.941254, current state stale+active+clean, last acting [13]
pg 19.2 is stuck stale for 8422.941258, current state stale+active+clean, last acting [13]
pg 18.2 is stuck stale for 8422.941259, current state stale+active+clean, last acting [13]
pg 19.3 is stuck stale for 8422.941263, current state stale+active+clean, last acting [13]

The Pgs show that the their last acting and removed OSD was number 13 and indeed, this OSD no longer exists in the cluster.

If I try querying the pg:

# ceph pg 18.6 query
Error ENOENT: i don't have pgid 18.6

The data insight those pgs is not valid so I tried recreating the pg:

# ceph pg force_create_pg 18.6
pg 18.6 now creating, ok

Remember that I reassigned the pool to a new root in the crushmap, so there are many OSDs available for the pool. But now, the PG is stuck with the state “creating” forever:

pg 18.6 is stuck inactive since forever, current state creating, last acting []

I supposed that the problem was with the Pg number of the pool, I thought that the pool couldn’t create more PGs because of the its Pg number.
I tried increasing the pool pg number and finally the PGs where created ok.

I follow these steps for documentation purposes, but if you don’t mind the data insight the pool the best option should be remove the pool and create it again.

 

Installing mod_pagespeed on a cpanel server

PageSpeed is a very popular module for Apache and Nginx developed by Google that can optimize a lot of aspects of your web server.

 

Installation

You can follow these steps to install on a server with cpanel:

$>/usr/local/cpanel/3rdparty/bin/git clone https://github.com/pagespeed/cpanel.git /tmp/pagespeed/
$> cd /tmp/pagespeed/Easy
$> tar -zcvf Speed.pm.tar.gz pagespeed
$> mkdir -p /var/cpanel/easy/apache/custom_opt_mods/Cpanel/Easy
$> mv Speed.pm Speed.pm.tar.gz -t /var/cpanel/easy/apache/custom_opt_mods/Cpanel/Easy/
$> cd && rm -rf /tmp/pagespeed

Now, you can go to EasyApache in your WHM and compile the module, you have to check Mod PageSpeed in the apache modules config page.

PageSpeed Configuration

The module configuration file is:

/usr/local/apache/conf/pagespeed.conf

To activate/deactivate the module you can set this option on/off:

ModPagespeed on

Tmpfs for the pagespeed cache

Due to the hight IOPS load that the cache may involve, a good idea is to put it in memory if you have enough memory in your server. This ensures that the cache accesses will be very fast.

To configure a 1,5GB size tmpfs cache you have to add this line to your /etc/fstab file:

tmpfs			/var/mod_pagespeed/cache tmpfs  rw,uid=99,gid=99,size=1500m,mode=0777 0 0

After that, simply run mount /var/mod_pagespeed/cache command.

The uid and gid must be the user that the web server uses to run.

Some parameters to configure cache:

 
# Cache max size
ModPagespeedFileCacheSizeKb          768000 
 
# Cache Ionode limit
ModPagespeedFileCacheInodeLimit      500000 

The cache expires old entries every 10 minutes if the value in ModPagespeedFileCacheSizeKb or ModPagespeedFileCacheInodeLimit is reached:

  ModPagespeedFileCacheCleanIntervalMs 600000 

If you want to disable pagespeed in some accounts you can do it through .htaccess file

<IfModule pagespeed_module>
	ModPagespeed off
</IfModule>

Side Effects after installing PageSpeed on Cpanel

I got an unwanted side effect after installing PageSpeed on cpanel shared servers. The pagespeed module does its own http requests through its internal proxy to get the resources if them can not be found in the cache. These requests are originated from the server and the user-agent is something like “Serf/1.1.0 mod_pagespeed/1.9.32.4-7251”.

These requests are logged in domains access logs in /usr/local/apache/domlogs and in the logs used to audit the bandwidth used by each domain, so inevitable leads in an increase in traffic accounted by each domain.

192.168.1.10 - - [01/Sep/2015:14:52:26 +0200] "GET /style.css HTTP/1.1" 200 1861 "http://yourdomain.com/" "Serf/1.1.0 mod_pagespeed/1.9.32.4-7251"

Solution: Don’t log these internal requests made by mod_pagespeed in the apache logs!

Setting up SetEnvIF module to detect pagespeed requests

We are going to use the SetEnvIF apache module to detect the specific traffic made by pagespeed and then set an env variable. We will use this variable later to avoid logging these requests.

Edit pre_main_global.conf in cpanel and add the following configuration:

/usr/local/apache/conf/includes/pre_main_global.conf
<IfModule mod_setenvif.c>
    SetEnvIFNoCase User-Agent "^Serf/0.* mod_pagespeed/.*$" dontlog
    #SetEnvIf Remote_Addr "IP_SERVER" dontlog
</IfModule>

Now, if the User-Agent used by pagespeed is detected the variable dontlog is set. You can set the variable too if the request is made by the server itself (2nd line).

Set virtualhosts logging depending on the donlog variable

Because Cpanel rebuilds apache configuration file (httpd.conf) in every compilation you won’t want to make the changes directly in httpd.conf file. Instead of that, you should change the templates provided by cpanel to be sure the changes will kept.

Depending on the apache version you are running the templates are located in:

/var/cpanel/templates/apache2_2/
/var/cpanel/templates/apache2_4/

To make custom changes in the templates you have to make a local copy of the template you want to change and make the changes in it:

# cp /var/cpanel/templates/apache2_4/vhost.default /var/cpanel/templates/apache2_4/vhost.local

Edit vhost.local template and add env=!dontlog to the lines begining with CustomLog word:

[% IF logstyle == 'combined' -%]
    [%- IF !enable_piped_logs || !supported.mod_log_config -%]
    CustomLog [% paths.dir_domlogs %]/[% wildcard_safe(vhost.log_servername) %] combined env=!dontlog
    [%- END %]
[% ELSE %]
    TransferLog [% paths.dir_domlogs %]/[% wildcard_safe(vhost.log_servername) %]
[% END -%]
[% IF supported.mod_log_config && supported.mod_logio && !enable_piped_logs -%]
    CustomLog [% paths.dir_domlogs %]/[% wildcard_safe(vhost.log_servername) %]-bytes_log "%{%s}t %I .\n%{%s}t %O ." env=!dontlog
[% END -%]

This setting is telling apache not log anything if the dontlog variable is set.

The last step you need to do is to rebuild the apache configuration file:

/scripts/rebuildhttpdconf

Now, you can verify in your domains access logs that the pagespeed internal requests are not logged anymore.

Recover ceph backed file system in OpenStack instance

I have had problems with one of my instances today. The instance does not reboot properly and when I checked it, I have been able to see the root of the problem, the file system was corrupted. To fix the instance disk file system I followed these steps:

First go to compute node and see what instance is what you are looking for. You can get the instance-id number from nova show <uuid> command output. In my case the instance was 63.

# virsh list
Id Name State
----------------------------------------------------
15 instance-00000111 running
34 instance-00000174 running
61 instance-0000017d running
63 instance-00000177 running

Now, because you want to change the file system, it is a good idea to stop or suspend the instance

# virsh suspend 63
Domain 63 suspended

The instance ephemeral disk is in a CEPH pool so you need to know the ceph image name used by the instance. This can be check in the instance xml definition file:

/var/lib/nova/instances/7d6e2893-7151-4ce0-b706-6bab230ad586/libvirt.xml

You are going to see some line like:

<source protocol="rbd" name="vmtier-10/7d6e2893-7151-4ce0-b706-6bab230ad586_disk">

Now, you need map the image with the rbd kernel module. You will want to do that in some server (it may be different from the compute node) with recent kernel version running and access to the ceph cluster.

# rbd -p vmtier-10 map 7d6e2893-7151-4ce0-b706-6bab230ad586_disk

You can see your mapped devices:

# rbd showmapped
id pool image snap device
1 vmtier-10 7d6e2893-7151-4ce0-b706-6bab230ad586_disk - /dev/rbd1

Assuming that the corrupted file system is in first partition of the disk you can fix it with:

# fsck.ext4 -f -y /dev/rbd1p1

Once the FS is fixed you can unmap the device

rbd unmap /dev/rbd1

And resume the instance or start it if you have stopped it before.

# virsh resume 63
Domain 63 resumed