Skip to main content

venet access like in Linux Containers (LXC)

I’ve been using OpenVZ containers in Proxmox for a while, and after upgrading to Proxmox 4 OpenVZ has been removed in favor of LXC containers. Although LXC containers have a lot of great features, the way they access to the network is not very good if you have untrusted users using them, because the network device in the container is attached directly to the bridge.

I liked a lot the venet devices in OpenVZ because with this devices the container has only access to the layer 3 of the network. In this post I tried to get venet like access with LXC containers, for that I used ebtables rules to limit this access.

In my case I created a new bridge (vmbr02) in a separate vlan (550) for the containers. The idea is that the proxmox node is going to be the gateway in layer 2 and layer 3 for the container and the MAC of the container is going to be masqueraded with the MAC of the bridge in the proxmox node. Because I have already a gateway configured in the proxmox node I’m going to create a new route table with the configuration of this network.

First we add the IP (10.10.10.10) to the new bridge in the proxmox node. This IP will be the gateway in the container.

sudo ip addr add 10.10.10.10/32 dev vmbr02v550 
sudo ip route add 10.10.10.0/24 dev vmbr02v550 src 10.10.10.10

Create the new route table for the new network in vlan 550

sudo echo "550 vlan550" >> /etc/iproute2/rt_tables

Populate the new table with the real gateway of this network (10.10.10.1)

sudo ip route add throw 10.10.10.0/24 table 550
sudo ip route add  default via 10.10.10.1 table 550

Add a rule for the traffic comming up from the bridge lookup the table 550:

sudo ip rule add from 31.193.227.0/24 iif vmbr02v550 lookup vlan550

Enable ip forwarding because the proxmox node will be the router for the LXC container. Because we only want ip forwarding in the bridge interface we can enable it only for this interface:

sudo systectl -w net.ipv4.conf.vmbr02v550.forwarding= 1

or

echo 1 > /proc/sys/net/ipv4/conf/vmbr02v550/forwarding

and to be permanent between reboots:

echo "net.ipv4.conf.vmbr02v550.forwarding= 1" >> /etc/sysctl.conf

I added this iptables rule because I’dont want the server be accessed with this IP address, this IP is going to use only to provide networking to the containers.

iptables -A INPUT -d 10.10.10.10 -j REJECT

At this point we have to have the LXC container created. In the following example we see a LXC with id 150:

$ sudo pct list
VMID       Status     Lock         Name                
150        stopped                 pruebas-lxc         

The bridge vmbr02v550 has one port connected to the physical network (bond0.550) and one virtual port connected to the container (veth150i1). Note that the container must be running to see the virtual port up:

$ sudo brctl show vmbr02v550
bridge name	bridge id		STP enabled	interfaces
vmbr02v550		8000.002590911cee	no		bond0.550
							veth150i1

Now it’s time to put some ebtables rules to limit the traffic forwarded in the bridge, these are general rules used for all the containers:

# ebtables rules in the table filter and chain forward
ebtables -A FORWARD -o veth+ --pkttype-type multicast -j DROP #1
ebtables -A FORWARD -i veth+ --pkttype-type multicast -j DROP #2
ebtables -A FORWARD -o veth+ --pkttype-type broadcast -j DROP #3
ebtables -A FORWARD -i veth+ --pkttype-type broadcast -j DROP #4
ebtables -A FORWARD -p IPv4 -i veth+ -o bond0.550 --ip-dst 10.10.10.0/24 -j ACCEPT #5
ebtables -A FORWARD -p IPv4 -i bond0.550 -o veth+ --ip-dst 10.10.10.0/24 -j ACCEPT #6
ebtables -A FORWARD -o veth+ -j DROP #6
ebtables -A FORWARD -i veth+ -j DROP #7

In this rules we use the expression veth+ to refer to all LXC virtual ports that can be connected to the bridge. The short explanation of each rule is the following:

  • #1 and #2 => Stop all multicast layer 3 packets delivered from an to the LXC virtual ports
  • #2 and #3 => Stop all broadcast layer 3 packets delivered from and to the LXC virtual ports
  • #5 and #6 => Only allow forwarding if the source IP or the destination IP  is in the LXC containers network.
  • #6 and #7 => Drop all packets that don’t met the above rules. We don’t want any packets from layer 2.

We need some ebtables rules in the nat table, but in this case these rules are set for each container. For this example we suppose the following:

  • Proxmox node MAC address: 0:25:90:91:1c:ee
  • LXC real MAC address: 66:36:61:62:32:31
  • LXC IP address: 10.10.10.11
#Ebtables rule to translate the packets that must be delivered to the LXC container IP to its real MAC
ebtables -t nat -A PREROUTING -p IPv4 -d 0:25:90:91:1c:ee -i bond0.550 --ip-dst 10.10.10.11 -j dnat --to-dst 66:36:61:62:32:31 --dnat-target ACCEPT
#Ebtables rule to reply to the ARP requests looking for the MAC of the LXC container with the MAC of the host:
ebtables -t nat -A PREROUTING -i bond0.550 -p ARP --arp-op Request --arp-ip-dst 10.10.10.11 -j arpreply --arpreply-mac 0:25:90:91:1c:ee
# I preffer the ebtables rule, but the above can be addressed too with the following arp command:
# arp -i vmbr02v550 -Ds 10.10.10.11 vmbr02v550 pub
#Ebtables rule to mask the LXC container MAC with the MAC of the host
ebtables -t nat -A POSTROUTING -s 66:36:61:62:32:31 -o bond0.550 -j snat --to-src 0:25:90:91:1c:ee --snat-arp --snat-target ACCEPT

And we are done! with this configuration the LXC container, although connected directly to the linux bridge, it has a limited access to the network.

High Availability RabbitMQ cluster in OpenStack

The RabbitMQ service is the heart of the processes communication in OpenStack and in a PROD deployment you want to configure a rabbitmq cluster in order to achieve hight availability for the messages queues.

There are two types of RabbitMQ nodes, disk nodes and ram nodes. Ram nodes require less IOPS because the resource management is not written in disk, but in a cluster, at least, a disk node is required.

In this post I’m going to configure a three node cluster: one disk node and two ram nodes.

Installing the RabbitMQ server in all your nodes is as simple as running these commands:

$ sudo echo "deb http://www.rabbitmq.com/debian/ testing main" > /etc/apt/sources.list.d/rabbitmq.list
$ sudo wget -O - https://www.rabbitmq.com/rabbitmq-signing-key-public.asc | sudo apt-key add - 
$ sudo apt-get update
$ sudo apt-get install rabbitmq-server

To start from scratch stop the rabbitmq service in your nodes and reset the queues and configuration:

$ sudo rabbitmqctl stop_app
$ sudo rabbitmqctl reset
$ sudorabbitmqctl start_app

Note: After the cluster creation the ram nodes can be reset without problems, but the disk node cannot be reset because it is the only disk node in the cluster, to reset it you can do it removing the data from the disk:

rabbitmqctl stop_app
/etc/init.d/rabbitmq-server stop
rm -rf /var/lib/rabbitmq/mnesia/rabbit@node1*
/etc/init.d/rabbitmq-server start
rabbitmqctl start_app

Now, if you run the cluster status command you can see the cluster running with your disk node only:

rabbitmqctl cluster_status
Cluster status of node 'rabbit@node1' ...
[{nodes,[{disc,['rabbit@node1']},
         {ram,[]}]},
 {running_nodes,['rabbit@node1']},
 {cluster_name,<<"rabbit@node1">>},
 {partitions,[]}]

In the disk node create the user for the openstack services, set its permissions and set the cluster name:

$ sudo rabbitmqctl add_user openstack openstack_pass
$ sudo rabbitmqctl set_permissions -p / openstack ".*" ".*" "."
$ sudo rabbitmqctl set_cluster_name openstack

Set the queues ha policy to ensure that all queues except those with auto-generated names are mirrored across all running nodes:

rabbitmqctl set_policy ha-all '^(?!amq\.).*' '{"ha-mode": "all"}'

To make a RabbitMQ cluster all the members have to have the same Erlang cookie, find the cookie in the first node and copy it to the other nodes. The cookie is located at:

$ cat /var/lib/rabbitmq/.erlang.cookie
SRITXWMZBCBIRFZMQOAQ

Join the other two nodes to the cluster as a ram nodes, to do that run the following command in node2 and in node3:

$ sudo rabbitmqctl stop_app
$ sudo rabbitmqctl join_cluster --ram rabbit@node1
$ sudo rabbitmqctl start_app

The cluster is completed now:

$ sudo rabbitmqctl cluster_status
Cluster status of node 'rabbit@node1' ...
[{nodes,[{disc,['rabbit@node1']},
         {ram,['rabbit@node2','rabbit@node3']}]},
 {running_nodes,['rabbit@node1','rabbit@node2',
                 'rabbit@node3']},
 {cluster_name,<<"openstack">>},
 {partitions,[]}]

As an additional step you can enable the rabbitmq management plugin in one or all of your nodes:

$ sudo rabbitmq-plugins enable rabbitmq_management

Create a new user for the management interface:

$ sudo rabbitmqctl add_user admin admin_pass
$ sudo rabbitmqctl set_permissions -p / admin ".*" ".*" ".*"
$ sudo rabbitmqctl set_user_tags  admin administrator

And finally go to your browser an and type:

http://server-name:15672
user: admin
pass: admin_pass

Once you have configured your RabbitMQ cluster you can configure the OpenStack services to use the cluster and mirrored queues. Just in case you should configure check OpenStack documentation for each service:

[oslo_messaging_rabbit]
rabbit_hosts=node1:5672,node2:5672,node3:5672
rabbit_retry_interval=1
rabbit_retry_backoff=2
rabbit_max_retries=0
rabbit_ha_queues=true
rabbit_userid = openstack
rabbit_password = openstack_pass
amqp_auto_delete = true
amqp_durable_queues=True

Changing the IP address of an OpenStack instance

Sometimes you can accidentally broke an instance and when you launch a new one to replace the old one, you want the new instance with the IP address that you had in the old instance.

The steps are very simple, first you have to remove the port that is associated with the instance. You can identify it by its current IP address:

$ sudo neutron port-list --fixed-ips ip_address=172.16.2.128
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
| id                                   | name | mac_address       | fixed_ips                                                                           |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+
| aa2e65ec-bf1c-44c7-b38b-0b27fcc41d8f |      | fa:16:3e:05:f2:fe | {"subnet_id": "6be599d7-702f-4e54-b18d-3dfca1441617", "ip_address": "172.16.2.128"} |
+--------------------------------------+------+-------------------+-------------------------------------------------------------------------------------+

With the instance stopped remove the port:

$sudo neutron port-delete aa2e65ec-bf1c-44c7-b38b-0b27fcc41d8f 
Deleted port: aa2e65ec-bf1c-44c7-b38b-0b27fcc41d8f

And now attach a new one with the desired IP address to your instance (c842228b-71e3-49d6-a5b5-33e6416e2669):

$ sudo nova interface-attach --fixed-ip 172.16.2.106 --net-id 26f6d6f9-0ff6-4825-99e8-35c3821f855f  c842228b-71e3-49d6-a5b5-33e6416e2669

That’s all, now you can start your instance with the new Ip address.

Live migration with OpenStack on Ubuntu 14.04

In this post I going to configure the compute nodes to enable the instance live migration on kvm instances backed with CEPH. In my set-up cinder volumes and the nova instances ephimeral disks are backed with CEPH so all the compute nodes can see all the storage.

Assuming that cinder and nova is correctly integrated with CEPH we have to follow these steps to set up live migration:

In libvirt-bin service configuration file we have to enable -l flag to libvirt-bin service args so it listen through tcp socket.
/etc/default/libvirt-bin

# Defaults for libvirt-bin initscript (/etc/init.d/libvirt-bin)
# This is a POSIX shell fragment

# Start libvirtd to handle qemu/kvm:
start_libvirtd="yes"

# options passed to libvirtd, add "-l" to listen on tcp
libvirtd_opts="-d -l"

In libvirtd configuration, set the options needed to listen on tcp:
/etc/libvirt/libvirtd.conf

# Flag listening for secure TLS connections on the public TCP/IP port.
listen_tls = 0
# Listen for unencrypted TCP connections on the public TCP/IP port.
listen_tcp = 1
tcp_port = "16509"
# Override the default configuration which binds to all network
# interfaces. This can be a numeric IPv4/6 address, or hostname
listen_addr = "172.17.16.117"
# Authentication.
#
#  - none: do not perform auth checks. If you can connect to the
#          socket you are allowed. This is suitable if there are
#          restrictions on connecting to the socket (eg, UNIX
#          socket permissions), or if there is a lower layer in
#          the network providing auth (eg, TLS/x509 certilos resultadosficates)
auth_unix_ro = "none"
auth_unix_rw = "none"
auth_tcp = "none"

Because we are setting no auth for tcp connection you should take other actions for your production environment to ensure only certain servers are allowed to connect to this port, for example using iptables.

Configure qemu user and group with root.
/etc/libvirt/qemu.conf

# The user for QEMU processes run by the system instance. It can be
# specified as a user name or as a user id. The qemu driver will try to
# parse this value first as a name and then, if the name doesn't exist,
# as a user id.
user = "root"
# The group for QEMU processes run by the system instance. It can be
# specified in a similar way to user.
group = "root"
# Whether libvirt should dynamically change file ownership
# to match the configured user/group above. Defaults to 1.
# Set to 0 to disable file ownership changes.
dynamic_ownership = 0

Once the changes are made restart the libvirt-bin service:

$ sudo service libvirt-bin restart
libvirt-bin stop/waiting
libvirt-bin start/running, process 21411

Check if libvirt-bin is listening on tcp port 16509

$ sudo netstat -npta | grep 16509  
tcp        0      0 172.17.16.117:16509     0.0.0.0:*               LISTEN    21411/libvirtd  

Set the needed flags in libvirt for live migration:
/etc/nova/nova.conf

[libvirt]
[..]
live_migration_flag=VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER,VIR_MIGRATE_LIVE,VIR_MIGRATE_TUNNELLED
live_migration_uri=qemu+tcp://%s/system

Assuming that the compute nodes have different hardware you have to set up a common cpu model in nova.conf configuration file. You can set kvm64, the most compatible mode across Intel and AMD platforms or if you have intel cpus, like me, you can set SandyBridge. In any case, the mode you selected must be supported in all compute nodes.

/etc/nova/nova.conf

[libvirt]
[..]
type = qemu
cpu_mode=custom
cpu_model=kvm64
[libvirt]
[..]
type = qemu
cpu_mode=custom
cpu_model=SandyBridge

You can see all the cpu modes that kvm support with:

$ /usr/bin/qemu-system-x86_64 -cpu help
x86           qemu64  QEMU Virtual CPU version 2.0.0                  
x86           phenom  AMD Phenom(tm) 9550 Quad-Core Processor         
x86         core2duo  Intel(R) Core(TM)2 Duo CPU     T7700  @ 2.40GHz 
x86            kvm64  Common KVM processor                            
x86           qemu32  QEMU Virtual CPU version 2.0.0                  
x86            kvm32  Common 32-bit KVM processor                     
x86          coreduo  Genuine Intel(R) CPU           T2600  @ 2.16GHz 
x86              486                                                  
x86          pentium                                                  
x86         pentium2                                                  
x86         pentium3                                                  
x86           athlon  QEMU Virtual CPU version 2.0.0                  
x86             n270  Intel(R) Atom(TM) CPU N270   @ 1.60GHz          
x86           Conroe  Intel Celeron_4x0 (Conroe/Merom Class Core 2)   
x86           Penryn  Intel Core 2 Duo P9xxx (Penryn Class Core 2)    
x86          Nehalem  Intel Core i7 9xx (Nehalem Class Core i7)       
x86         Westmere  Westmere E56xx/L56xx/X56xx (Nehalem-C)          
x86      SandyBridge  Intel Xeon E312xx (Sandy Bridge)                
x86          Haswell  Intel Core Processor (Haswell)                  
x86       Opteron_G1  AMD Opteron 240 (Gen 1 Class Opteron)           
x86       Opteron_G2  AMD Opteron 22xx (Gen 2 Class Opteron)          
x86       Opteron_G3  AMD Opteron 23xx (Gen 3 Class Opteron)          
x86       Opteron_G4  AMD Opteron 62xx class CPU                      
x86       Opteron_G5  AMD Opteron 63xx class CPU                      
x86             host  KVM processor with all supported host features (only available in KVM mode)

After these changes, if you see a message like this:

$ sudo nova live-migration 6fba9cbe-66e2-484d-ba90-18ad519865ff host3
ERROR (BadRequest): Unacceptable CPU info: CPU doesn't have compatibility.

It could be caused by this bug #1082414. In Juno, as a work around, you can comment out the line number 5010 “self._compare_cpu(source_cpu_info)” in libvirt driver:
/usr/lib/python2.7/dist-packages/nova/virt/libvirt/driver.py

# Compare CPU
source_cpu_info = src_compute_info['cpu_info']
#self._compare_cpu(source_cpu_info)

In KILO this bug should be fixed, so no changes are needed in driver.py.

I’m not so sure that the following is a requirement for the live migration, but it definitively is to enable the migration process and the instance resize, because some commands are run through a ssh connection.

Enable ssh access between compute nodes with nova user. First edit each /etc/passwd file and enable shell access for your nova user:
/etc/passwd

nova:x:108:113::/var/lib/nova:/bin/sh

Put this ssh configuration file in your nova home directory to avoid checking host’s keys between the compute nodes.
/var/lib/nova/.ssh/config

Host *
    StrictHostKeyChecking no

For each compute node create a rsa key pair as the nova user:

$ ssh-keygen 
Generating public/private rsa key pair.
Enter file in which to save the key (/var/lib/nova/.ssh/id_rsa): 
Created directory '/var/lib/nova/.ssh'.
Enter passphrase (empty for no passphrase): 
Enter same passphrase again: 
Your identification has been saved in /var/lib/nova/.ssh/id_rsa.
Your public key has been saved in /var/lib/nova/.ssh/id_rsa.pub.
The key fingerprint is:
e1:97:a7:f5:10:71:bb:1f:9a:91:dd:c8:66:22:be:49 nova@host
The key's randomart image is:
+--[ RSA 2048]----+
|            . .  |
|             o . |
|        .   . .  |
|       . . . ooo.|
|        S + =o*o.|
|         o = *+..|
|          E  o. .|
|         . o     |
|          o      |
+-----------------+

Copy all the created public keys content to an authorized_keys file and share it in all the compute nodes for the nova user:
/var/lib/nova/.ssh/authorized_keys

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDl+XPbYlzlDm3F+5N2SCiZlCRL/wZ9WAD3xwC5uNeza7NbQwy9jL5t2jHQn+bLMHP27GJO5Afl0cx9aPMe+mUvXDf0kk1yhND/eqRauNjQ/NONhUT9VDMiQBL7F28xWD+d0XTSr/G1/ddYxt/ouoZF94nPXCLmzqY4JdwWCq2VV/ChJRAXqs0tzPpOxmAGWNm7+mOxL4SFiFRCHR4LxxveV5rf10EzrOJFOEewUQ51yTqn8tuIs59nPuVzwNezYVJ4iZM3gcdm+rnE/40I/sodePDhiuIVkcT0Zl1stGVxVJrpsUtzE8+YsZLe+aH/IlsHXMPdpCIbinyv0vmzIG1H nova@host1
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDUTvfP4RmRdRXIlWn72X+y+DKnwiDlz9iWqB+0zVhMmy3T4bYY4Okw5qXCZ6xOA2BLzsuY07QLNdFCHDs6FjPjEtT+A8U4w3x4aZDwS+jgl6eC3vpTU/rkEpCDF/KOvkvoP+U8zuKS4r1r5+UAoFAKvDCM8RGGwY6mC2+uEqv23at9OIrWrbkdHVlVnxhSYk4prg2PnePMFchs3Sh9yEaLw/3F2wGBJGjYbVkfAu87UbQy6mRqWepJx8qSP2XYvIuVKleYpHS41Vk3H/+L4tTR0ibYBD+eDR80IRN4qGE6vzdf7hJW1Gl0Ozx9fzSzO0u6f/8254PqrNxya0PMmCbb nova@host2
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDapvnExGGOKVx0XVqTPNWTwXR0kXLfzb2se1slb7oAL7clZShhUKDwFHOVRO16tV7k/VD3mEf0Z+VBmU2MyxXa5nOIwbBCIIy9E/01fXh9QcP5dn1Qs8GzsoNh4j3AHSDbmYgsaG0d+BrBxmF/HpU+qZvBOMudT8reXT++5VQFNMP5cXkd6b8gyeYlrRH2SAaa7kIy44z3ZqQHzmFA+TJwYSrMoawgpdDE75HWQMAgiECXFK2Nb71+gd9sHOttzNPGmSx6TmbkHAi1W9rGYSZ88n1+19tHbnyZi+Qn8HYvKmLMyQFhje71DMwzK3FzbSpZuTaMfiEslRS9skYD6OTd nova@host3
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCrBNLab4QNjAIwGm7Ajc0CGHrtSlLnbV447vAdc/QWRoU+yiBlv4NxWq3aOogczuq6ar3hufXAnUX7ClMTon6f2Fcq/cv2D5V8YkXG7NtZQUKj0F6R27dEOUMPX64w2PGZen2QpcJNxLJXokbdTnDRc2odJ+0kw8rGKWDPioeLDjw5Qrb6EfddxWBJLbk3+gravyc2zHWMCzLUhRU4JMxBMutk3AXV2XBUflnOBoUMFixv8Mrm4wWQE3w29dZGL6wYtl2dAt9YENo9UIko/jVreuAc5gTIr4v1iywzaDivLT2HR2BjqTkABOd9cuWw6o7ZS0lTTPf8skGxAGNSOoQT nova@host4

Check if you are able to run the ls command in a remote host in your compute nodes:

$ ssh nova@host1 ls -l /etc/nova/nova.conf 
-rw-r----- 1 nova nova 3329 sep 21 11:17 /etc/nova/nova.conf

Now, you should be able to do an instance live migration between your compute nodes and the instance resize/migration should work too without problems.

Recover ceph backed file system in OpenStack instance

I have had problems with one of my instances today. The instance does not reboot properly and when I checked it, I have been able to see the root of the problem, the file system was corrupted. To fix the instance disk file system I followed these steps:

First go to compute node and see what instance is what you are looking for. You can get the instance-id number from nova show <uuid> command output. In my case the instance was 63.

# virsh list
Id Name State
----------------------------------------------------
15 instance-00000111 running
34 instance-00000174 running
61 instance-0000017d running
63 instance-00000177 running

Now, because you want to change the file system, it is a good idea to stop or suspend the instance

# virsh suspend 63
Domain 63 suspended

The instance ephemeral disk is in a CEPH pool so you need to know the ceph image name used by the instance. This can be check in the instance xml definition file:

/var/lib/nova/instances/7d6e2893-7151-4ce0-b706-6bab230ad586/libvirt.xml

You are going to see some line like:

<source protocol="rbd" name="vmtier-10/7d6e2893-7151-4ce0-b706-6bab230ad586_disk">

Now, you need map the image with the rbd kernel module. You will want to do that in some server (it may be different from the compute node) with recent kernel version running and access to the ceph cluster.

# rbd -p vmtier-10 map 7d6e2893-7151-4ce0-b706-6bab230ad586_disk

You can see your mapped devices:

# rbd showmapped
id pool image snap device
1 vmtier-10 7d6e2893-7151-4ce0-b706-6bab230ad586_disk - /dev/rbd1

Assuming that the corrupted file system is in first partition of the disk you can fix it with:

# fsck.ext4 -f -y /dev/rbd1p1

Once the FS is fixed you can unmap the device

rbd unmap /dev/rbd1

And resume the instance or start it if you have stopped it before.

# virsh resume 63
Domain 63 resumed