Running OESS In A Highly-Available Environment

Using standard Linux HA tools, like Pacemaker, Corosync, and DRBD, OESS can operate in an active/passive failover configuration allowing for quick automatic failover in the event of a problem on the primary server. Note that the specifics of these technologies is beyond the scope of this document, and the place to start if one is looking to operate a HA OESS instance is with the documentation of the relevant technologies.

This document specifically targets Pacemaker 1.0, Corosync 1.2.7, and DRBD 8.4.1 running on Red Hat Enterprise Linux 6. As RPMs for these packages are not typically provided by Red Hat, you will need to be able to build your own binary RPMs from source.

You will need at a minimum two systems, each with an available disk partition to use for DRBD, and a third IP address to be assigned by Pacemaker to the active host.

Pacemaker and Corosync - Used for cluster management:

http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/index.html

DRBD: Real-time filesystem replication:

http://www.drbd.org/users-guide-8.4/

Instructions:

  1. Install DRBD per the DRBD documentation on both hosts.
    1. Use your available partition as the DRBD volume.
    2. Sync and choose one host to be primary.
    3. Once you have verified DRBD is working, unmount the filesystem, stop the service, and remove it from startup (Pacemaker will manage this instead).
    4. Install MySQL on both hosts.
      1. MySQL should be configured with multimaster replication, communicating over SSL. Please see the MySQL documentation as needed.
      2. Install Apache on both hosts, no special configuration is needed.
      3. Install Pacemaker/Corosync per the Cluster Labs documentation. Create a cluster with your hosts and verify they all can join.
      4.  Stop MessageBus on each host and remove it from startup. It will need to be managed by Pacemaker, as OESS requires it.
      5. Install OESS on both hosts.
        1. Be sure to run oess_setup.pl only on the primary, and copy the database.xml to the secondary.
        2. Insure the DB users get created manually on the secondary.
        3. The /SNMP directory created by the setup should be changed to be a symlink into the DRBD volume on both hosts.
        4. Configure the cluster for the non-OESS services.
          1. For a three-node cluster with a primary, backup, and quorum node for breaking ties, this will look like the following:
            node srv1.domain.com
            node srv2.domain.com
            node srv3.domain.com
            primitive ClusterIP ocf:heartbeat:IPaddr2 \
               params ip="SHARED_IP_HERE" cidr_netmask="32" \
               op monitor interval="10s" \
               meta target-role="Started"
            primitive UsageData ocf:linbit:drbd \
               params drbd_resource="usagedata" \
               op monitor interval="10s"
            primitive messagebus lsb:messagebus \
               op monitor interval="10s"
            primitive mysqld lsb:mysqld \
               op monitor interval="10s"
            primitive usageFS ocf:heartbeat:Filesystem \
             params device="/dev/drbd/by-res/usagedata"        directory="/drbd" fstype="ext4" \
             meta target-role="Started"
            primitive website ocf:heartbeat:apache \
               params configfile="/etc/httpd/conf/httpd.conf" \
               op monitor interval="60s"
            ms UsageDataClone UsageData \
               meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" target-role="Master"
            clone messagebusClone messagebus \
               meta target-role="Started"
            clone mysqldClone mysqld \
               meta clone-max="2" target-role="Started" is-managed="true"
            clone websiteClone website \
               meta clone-max="2" target-role="Started"
            location ip_prefer_srv1 ClusterIP 100: srv1.domain.com
            location ip_prefer_srv2 ClusterIP 100: srv2.domain.com
            location messagebus_prefer_srv1 messagebusClone 100: srv1.domain.com
            location messagebus_prefer_srv2 messagebusClone 100: srv2.domain.com
            location messagebus_prefer_srv3 messagebusClone 100: srv3.domain.com
            location mysqld_prefer_srv1 mysqldClone 100: srv1.domain.com
            location mysqld_prefer_srv2 mysqldClone 100: srv2.domain.com
            location usagedataclone_prefer_srv1 UsageDataClone 100: srv1.domain.com
            location usagedataclone_prefer_srv2 UsageDataClone 100: srv2.domain.com
            location usagefs_prefer_srv1 usageFS 100: srv1.domain.com
            location usagefs_prefer_srv2 usageFS 100: srv2.domain.com
            location website_prefer_srv1 websiteClone 100: srv1.domain.com
            location website_prefer_srv2 websiteClone 100: srv2.domain.com
            colocation fs_on_drbd inf: usageFS UsageDataClone:Master
            colocation ip_on_fs inf: ClusterIP usageFS
            order drbd_before_fs inf: UsageDataClone:promote usageFS:start
            order ip_before_website inf: ClusterIP websiteClone
            order mysqld_before_website inf: mysqldClone websiteClone
            property $id="cib-bootstrap-options" \
                  dc-version="1.0.11-XXXXX" \
                  cluster-infrastructure="openais" \
                  expected-quorum-votes="3" \
                  stonith-enabled="false" \
                  symmetric-cluster="false" \
                  last-lrm-refresh="XXXXXXXXX"
  1. Verify the cluster is operating and running these services correctly. Per the above config, MessageBus should be running on all three nodes, MySQL and Apache on the primary and secondary, and the DRBD volume/filesystem and shared IP on the primary only.
  2. Finally add the OESS services to the cluster.
    1. For the cluster described above, add the following:
      primitive fwdctl lsb:oess-fwdctl \
            op monitor interval="10s" \
            meta target-role="Started" is-managed="true"
      primitive notify lsb:oess-notification \
            meta target-role="Started" \
            op monitor interval="10s"
      primitive nox-controller lsb:nox_cored \
            op monitor interval="10s" \
            meta target-role="Started" is-managed="true"
      primitive topo lsb:oess-topo \
            op monitor interval="10s" \
            meta target-role="Started" is-managed="true"
      primitive vlan-stats lsb:oess-vlan_stats \
            op monitor interval="20s" \
            meta is-managed="true" target-role="Started"
      location fwctl_prefer_srv1 fwdctl 100: srv1.domain.com
      location fwctl_prefer_srv2 fwdctl 100: srv2.domain.com
      location notify_prefer_srv1 notify 100: srv1.domain.com
      location notify_prefer_srv2 notify 100: srv2.domain.com
      location nox_prefer_srv1 nox-controller 100: srv1.domain.com
      location nox_prefer_srv2 nox-controller 100: srv2.domain.com
      location topo_prefer_srv1 topo 100: srv1.domain.com
      location topo_prefer_srv2 topo 100: srv2.domain.com
      location vlan_stats_prefer_srv1 vlan-stats 100: srv1.domain.com
      location vlan_stats_prefer_srv2 vlan-stats 100: srv2.domain.com
      colocation fwdctl_on_ip inf: fwdctl ClusterIP
      colocation ip_on_fs inf: ClusterIP usageFS
      colocation notify_on_ip inf: notify ClusterIP
      colocation nox_on_ip inf: nox-controller ClusterIP
      colocation topo_on_ip inf: topo ClusterIP
      colocation vlan-stats_on_ip inf: vlan-stats ClusterIP
      order fs_before_vlan-stats inf: usageFS vlan-stats
      order fwdctl_before_notify inf: fwdctl notify
      order fwdctl_before_vlan-stats inf: fwdctl vlan-stats
      order ip_before_topo inf: ClusterIP topo
      order ip_before_website inf: ClusterIP websiteClone
      order messagebus_before_fwdctl inf: messagebusClone fwdctl
      order messagebus_before_notify inf: messagebusClone notify
      order messagebus_before_nox inf: messagebusClone nox-controller
      order messagebus_before_topo inf: messagebusClone topo
      order messagebus_before_vlan-stats inf: messagebusClone vlan-stats
      order mysqld_before_fwdctl inf: mysqldClone fwdctl
      order mysqld_before_notify inf: mysqldClone notify
      order mysqld_before_topo inf: mysqldClone topo
      order mysqld_before_vlan-stats inf: mysqldClone vlan-stats
      order mysqld_before_website inf: mysqldClone websiteClone
      order nox_before_fwdctl inf: nox-controller fwdctl
      order topo_before_nox inf: topo nox-controller
  1. Commit and verify the services start.
  2. Your output of “crm resource status” should look something close to this:
    ClusterIP   (ocf::heartbeat:IPaddr2) Started
    fwdctl      (lsb:oess-fwdctl) Started
    nox-controller    (lsb:nox_cored) Started
    topo  (lsb:oess-topo) Started
    usageFS     (ocf::heartbeat:Filesystem) Started
    vlan-stats  (lsb:oess-vlan_stats) Started
    Master/Slave Set: UsageDataClone
                Masters: [ srv1.domain.com ]
                Slaves: [ srv2.domain.com ]
    Clone Set: messagebusClone
                Started: [ srv1.domain.com srv2.domain.com srv3.domain.com ]
    Clone Set: mysqldClone
                Started: [ srv1.domain.com srv2.domain.com ]
    Clone Set: websiteClone
                Started: [ srv1.domain.com srv2.domain.com ]
    notify      (lsb:oess-notification) Started

 

The diagram below is a helpful visualization of the OESS dependencies and start order as managed by the cluster.