Oracle MVA

Tales from a Jack of all trades

setting up EDG style HA administration server with corosync and pacemaker

with 5 comments

Most of the Enterprise Deployment Guide’s (EDG) for Fusion Middleware products consider setting up WebLogic’s Administration Server HA. All of these EDG’s describe a manual failover. None of my clients find that a satisfactory solution. Usually I advise my clients to use Oracle’s clustering software to automate failover (Grid Infrastructure / Cluster Ready Services). This works fine if your local DBA is managing the WebLogic layer, although the overhead is large for something “simple” like an HA administration server. Also this requires the failover node to run in the same subnet (network wise) as the primary node. All this led me into investigating other options. One of the viable options I POC’d is Linux clustering with CoroSync and PaceMaker. I considered CoroSync and PaceMaker because this seems to be the RedHat standard for clustering nowadays.

This example is configured on OEL 5.8. The example is not production ready, please don’t install this on production without thorough testing (and some more of the usual disclaimers 🙂 ) I will assume basic knowledge of clustering and linux for this post, not all details will be configured in great depth.

First you need to understand a little bit about my topology. I have a small linux server running a software loadbalancer (Oracle Traffic Director) which is also functioning as NFS server. When configuring this for an enterprise these components will most likely be provided for you (F5’s or Cisco with some NetAPP or alike). In this specific configuration the VIP for the administration server runs on the loadbalancer. The NFS server on the loadbalancer server provides shared storage that hosts the domain home. This NFS share is mounted on both the servers that will run my administration server.

Back to the cluster. To install CoroSync and PaceMaker, first install the EPEL repository for packages that don’t exist in vanilla Redhat/CentOS and add the cluster labs repository.

rpm -ivh http://mirror.iprimus.com.au/epel/5/x86_64/epel-release-5-4.noarch.rpm
wget -O /etc/yum.repos.d/pacemaker.repo http://clusterlabs.org/rpm/epel-5/clusterlabs.repo

Then install Pacemaker 1.0+ and CoroSync 1.2+ via yum

yum install -y pacemaker.$(uname -i) corosync.$(uname -i)

When all software and dependencies are installed, you can configure CoroSync. My configuration file is rather straight forward. I run a cluster over network 10.0.0.0

cat /etc/corosync/corosync.conf
# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
	version: 2
	secauth: on
	threads: 0
	interface {
		ringnumber: 0
		bindnetaddr: 10.0.0.0
		mcastaddr: 226.94.1.1
		mcastport: 5405
	}
}

logging {
	fileline: off
	to_stderr: no
	to_logfile: yes
	to_syslog: yes
	logfile: /var/log/cluster/corosync.log
	debug: off
	timestamp: on
	logger_subsys {
		subsys: AMF
		debug: off
	}
}

amf {
	mode: disabled
}

quorum {
           provider: corosync_votequorum
           expected_votes: 2
}

aisexec {
        # Run as root - this is necessary to be able to manage resources with Pacemaker
        user:        root
        group:       root
}

service {
    # Load the Pacemaker Cluster Resource Manager
    name: pacemaker
    ver: 0
}

Now, you can start CoroSync and check the configuration of the cluster.

service corosync start
corosync-cfgtool -s
Printing ring status.
Local node ID 335544330
RING ID 0
	id	= 10.0.0.20
	status	= ring 0 active with no faults


crm status
============
Last updated: Sun Mar  3 21:30:42 2013
Stack: openais
Current DC: wls1.area51.local - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ wls2.area51.local wls1.area51.local ]

For production usage you should configure stonith, which is beyond this example. So for testing purposes I disabled stonith

crm configure property stonith-enabled=false
crm configure property no-quorum-policy=ignore

Also I configure resources not to fail back when the resource running the resource comes back online.

crm configure rsc_defaults resource-stickiness=100

Now your cluster is ready, although it doesn’t run WebLogic yet. There is no WebLogic cluster resource, so I wrote one myself. To keep it separated from other cluster resources I setup my own OCF resource tree (just mkdir and you are done). A OCF resource requires certain functions to be in the script, the The OCF Resource Agent Developer’s Guide can help you with that one.

Here’s my example WebLogic cluster resource:

cat /usr/lib/ocf/resource.d/area51/weblogic 
#!/bin/bash
#
# Description:  Manages a WebLogic Administration Server as an OCF High-Availability
#               resource under Heartbeat/LinuxHA control
# Author:	Jacco H. Landlust <jacco.landlust@idba.nl>
# 		Inspired on the heartbeat/tomcat OCF resource
# Version:	1.0
#

OCF_ROOT=/usr/lib/ocf

. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs
#RESOURCE_STATUSURL="http://127.0.0.1:7001/console"

usage()
{
	echo "$0 [start|stop|status|monitor|migrate_to|migrate_from]"
	return ${OCF_NOT_RUNNING}
}

isrunning_weblogic()
{
        if ! have_binary wget; then
		ocf_log err "Monitoring not supported by ${OCF_RESOURCE_INSTANCE}"
		ocf_log info "Please make sure that wget is available"
		return ${OCF_ERR_CONFIGURED}
        fi
        wget -O /dev/null ${RESOURCE_STATUSURL} >/dev/null 2>&1
}

isalive_weblogic()
{
        if ! have_binary pgrep; then
                ocf_log err "Monitoring not supported by ${OCF_RESOURCE_INSTANCE}"
                ocf_log info "Please make sure that pgrep is available"
                return ${OCF_ERR_CONFIGURED}
        fi
        pgrep -f weblogic.Name > /dev/null
}

monitor_weblogic()
{
        isalive_weblogic || return ${OCF_NOT_RUNNING}
        isrunning_weblogic || return ${OCF_NOT_RUNNING}
        return ${OCF_SUCCESS}
}

start_weblogic()
{
	if [ -f ${DOMAIN_HOME}/servers/AdminServer/logs/AdminServer.out ]; then
		su - ${WEBLOGIC_USER} --command "mv ${DOMAIN_HOME}/servers/AdminServer/logs/AdminServer.out ${DOMAIN_HOME}/servers/AdminServer/logs/AdminServer.out.`date +%Y-%M-%d-%H%m`"
	fi
	monitor_weblogic
	if [ $? = ${OCF_NOT_RUNNING} ]; then
		ocf_log debug "start_weblogic"
		su - ${WEBLOGIC_USER} --command "nohup ${DOMAIN_HOME}/bin/startWebLogic.sh > ${DOMAIN_HOME}/servers/AdminServer/logs/AdminServer.out 2>&1 &"
		sleep 60
		touch ${OCF_RESKEY_state}
	fi
	monitor_weblogic
	if [ $? =  ${OCF_SUCCESS} ]; then
		return ${OCF_SUCCESS}
	fi
}

stop_weblogic()
{
#	monitor_weblogic
#	if [ $? =  $OCF_SUCCESS ]; then
		ocf_log debug "stop_weblogic"
		pkill -KILL -f startWebLogic.sh
		pkill -KILL -f weblogic.Name
		rm ${OCF_RESKEY_state}
#	fi
	return $OCF_SUCCESS
}

meta_data() {
        cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="weblogic" version="0.9">
	<version>1.0</version>
	<longdesc lang="en"> This is a WebLogic Resource Agent </longdesc>
	<shortdesc lang="en">WebLogic resource agent</shortdesc>
        
	<parameters>
		<parameter name="state" unique="1">
			<longdesc lang="en">Location to store the resource state in.</longdesc>
			<shortdesc lang="en">State file</shortdesc>
			<content type="string" default="${HA_VARRUN}{OCF_RESOURCE_INSTANCE}.state" />
		</parameter>
		<parameter name="statusurl" unique="1">
			<longdesc lang="en">URL for state confirmation.</longdesc>
			<shortdesc>URL for state confirmation</shortdesc>
			<content type="string" default="" />
		</parameter>
		<parameter name="domain_home" unique="1">
			<longdesc lang="en">PATH to the domain_home. Should be a full path</longdesc>
			<shortdesc lang="en">PATH to the domain.</shortdesc>
			<content type="string" default="" required="1" />
		</parameter>
		<parameter name="weblogic_user" unique="1">
			<longdesc lang="en">The user that starts WebLogic</longdesc>
			<shortdesc lang="en">The user that starts WebLogic</shortdesc>
			<content type="string" default="oracle" />
		</parameter>
	</parameters>   
        
	<actions>
		<action name="start"        timeout="90" />
		<action name="stop"         timeout="90" />
		<action name="monitor"      timeout="20" interval="10" depth="0" start-delay="0" />
		<action name="migrate_to"   timeout="90" />
		<action name="migrate_from" timeout="90" />
		<action name="meta-data"    timeout="5" />
	</actions>      
</resource-agent>
END
}

# Make the resource globally unique
: ${OCF_RESKEY_CRM_meta_interval=0}
: ${OCF_RESKEY_CRM_meta_globally_unique:="true"}

if [ "x${OCF_RESKEY_state}" = "x" ]; then
        if [ ${OCF_RESKEY_CRM_meta_globally_unique} = "false" ]; then
                state="${HA_VARRUN}${OCF_RESOURCE_INSTANCE}.state"
                
                # Strip off the trailing clone marker
                OCF_RESKEY_state=`echo $state | sed s/:[0-9][0-9]*\.state/.state/`
        else
                OCF_RESKEY_state="${HA_VARRUN}${OCF_RESOURCE_INSTANCE}.state"
        fi
fi

# Set some defaults
RESOURCE_STATUSURL="${OCF_RESKEY_statusurl-http://127.0.0.1:7001/console}"
DOMAIN_HOME="${OCF_RESKEY_domain_home}"
WEBLOGIC_USER="${OCF_RESKEY_weblogic_user-oracle}"

# MAIN
case $__OCF_ACTION in
	meta-data)      meta_data
       	         exit ${OCF_SUCCESS}
	                ;;
	start)          start_weblogic;;
	stop)           stop_weblogic;;
	status)		monitor_weblogic;;
	monitor)        monitor_weblogic;;
	migrate_to)     ocf_log info "Migrating ${OCF_RESOURCE_INSTANCE} to ${OCF_RESKEY_CRM_meta_migrate_to}."
			stop_weblogic
			;;
	migrate_from)   ocf_log info "Migrating ${OCF_RESOURCE_INSTANCE} to ${OCF_RESKEY_CRM_meta_migrated_from}."
			start_weblogic
			;;
	usage|help)     usage
			exit ${OCF_SUCCESS}
	       	         ;;
	*)		usage
			exit ${OCF_ERR_UNIMPLEMENTED}
			;;
esac
rc=$?

# Finish the script
ocf_log debug "${OCF_RESOURCE_INSTANCE} $__OCF_ACTION : $rc"

exit $rc

Please mind that you would have to copy the script to both nodes of the cluster.

Next up is configuring the WebLogic resource in the cluster. In my example I mounted the NFS share with domain homes on /domains and the domain is called ha-adminserver. My WebLogic is running as oracle and the administration server listens on all addresses at port 7001. Therefore the parameter weblogic_user is left at default (oracle) and the status_url to check if the administration server is running is left at default too (http://127.0.0.1:7001/console). The domain home is parsed to the cluster.

crm configure primitive weblogic ocf:area51:weblogic params domain_home="/domains/ha-adminserver" op start interval="0" timeout="90s" op monitor interval="30s"

When the resource is added, the cluster starts is automatically. Please keep in mind this takes some time, therefore you might not see results instantly. Also the scripts has a sleep configured, if your administration server takes longer to boot you might want to fiddle with the values. For an example like in this blogpost it works.

Next you can check the status of your resource:

crm status
============
Last updated: Sun Mar  3 21:35:47 2013
Stack: openais
Current DC: wls1.area51.local - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ wls2.area51.local wls1.area51.local ]

 weblogic	(ocf::area51:weblogic):	Started wls2.area51.local

To check the configuration of your resource, run the configure show command

crm configure show
node wls1.area51.local
node wls2.area51.local
primitive weblogic ocf:area51:weblogic \
	params domain_home="/domains/ha-adminserver" \
	op start interval="0" timeout="90s" \
	op monitor interval="30s" \
	meta target-role="Started"
property $id="cib-bootstrap-options" \
	dc-version="1.0.12-unknown" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="2" \
	stonith-enabled="false" \
	no-quorum-policy="ignore" \
	last-lrm-refresh="1362338797"
rsc_defaults $id="rsc-options" \
	resource-stickiness="100"

You can test failover by either stopping CoroSync, killing the linux node, etc. etc.

Other useful commands:

# start resource
crm resource start weblogic

# stop resource
crm resource stop weblogic

# Cleanup errors
crm_resource --resource weblogic -C

# Move resource to other node, mind you: that means pinning and taking control of the cluster. Fail-back is automatically introduced
crm resource move weblogic wls1.area51.local
# give authority over resource back to cluster
crm resource unmove weblogic

# delete cluster resource
crm configure delete weblogic

If you want your WebLogic administration server resource to be bound to a vip, just google for setting up an HA apache on PaceMaker. There is plenty information about that on the web, e.g. this site, which helped me setting up the cluster too.

Well, I hope this helps for anyone that is trying to setup an HA administration server.

Advertisements

Written by Jacco H. Landlust

March 3, 2013 at 11:39 pm

5 Responses

Subscribe to comments with RSS.

  1. Dear Jacco:

    First, I have tried your script using WL8.1 running fine. Then I tried WL12.1, but I had to change corosync timeout value from 90s to 210s and finally it runs.

    I have other issue:
    When WL 12 is starting up, CPU load rises too much and the machine . Top shows java process on top (Machines are Suse over EXSi)

    if I run manually area51/weblogic script WL12 starts up normally and the machine CPU usage is normal.

    ÂżAny clue?

    Thank you very much for your post it helped me a lot.

    Sincerely,

    Rodrigo Avila.

    Rodrigo Avila

    July 11, 2014 at 3:21 pm

    • after a lenghty email conversation with Rodrigo outcome was that for some reason on Suse this is an issue; excerpt from the email:

      Last Saturday, using top, I saw that weblogic’s java process had “System priority value = -2” apparently inherited from corosync.

      Then I tried to insert “nice -20” in your script with no success.

      So I have made a hack, invoking startweblogic.sh via ssh to localhost.

      Now. I have weblogic’s java process showing PRiority=0 an Nice=20.

      Jacco H. Landlust

      July 31, 2014 at 10:47 pm

  2. Good shot!
    Is it possible use it with OHS instances?

    Regards!

    Marcello Morettoni

    May 9, 2016 at 8:52 am

    • Sure is. Obviously you would need to create your own OCF resource, but that’s no rocket science. However, why would you? You could run 2 OHS instances just as easy?

      Jacco H. Landlust

      May 9, 2016 at 12:13 pm

      • Hi, Thanks for respond. two instances, is what i always do. But i had a customer which refused to use any loadbalance before the HTTP server layer. On that time I used ldirectord – had no chance to convince then. Reading this I remembered that case.
        Cheers!
        Marcello

        Marcello Morettoni

        May 11, 2016 at 12:38 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: