Oracle MVA

Tales from a Jack of all trades

Archive for March 2013

What happens when you change WebLogic configuration

with 4 comments

This post is a little bit a brain dump. It is based on some experiences while changing configuration for WebLogic. Most of the post is a copy of an email I send to a friend that was struggling with unexpected configuration changes (changes that got reversed all of a sudden, etc.). All statements are based on investigation, I did not decompile and read WLS source code. If I am wrong with some statement, please let me know by commenting.

First of let me state this: You are running in production mode, right? Please always do. Fix setDomainEnv.sh and set production_mode=true (instead of production_mode= blank), that changes some memory settings for your JVM and changes some settings. Most relevant in my opinion are if you are running SUN -client is for development mode and -server is for production mode. Also auto deployment is disabled.

If you are running production mode you have to lock the configuration. A lockfile is created in the DOMAIN_HOME (edit.lock). The file contains the (encrypted) username of the user holding the lock, the time when the lock was acquired etc. Whenever you make some change to the configuration in subdirectory “pending” in your DOMAIN_HOME you can find the new configuration files. All configuration files that are changed by the edit are temporarily stored there. The current configuration is only overwritten when you click on activate.

When you activate the configuration, the console application sends a request to all the running servers in the domain to overwrite the configuration in DOMAIN_HOME/config (config.xml, plus possibly some extra configuration files e.g. an xml describing a jdbc datasource). Overwriting of the configuration files on the managed servers happens sequentially, the order used for this config push is the order of managed servers in your config.xml . I think this is why the adminserver is always first (that is always the first managed server in config.xml)

Whenever your datasource is not created successfully on any running node of the domain, the complete configuration change is rolled back. One of the reasons that an update in the domain can fail is file locking. This can happen if more than one java server share a domain home, e.g. the adminserver and managed server 1, or because you setup the domain home on shared storage (and therefore all managed servers share the same directory with config files) . If for some reason the config change did succeed on the administration server, but not on a running managed server (possibly because you followed some enterprise deployment guide that guided you into setting up a separate administration server), the console will not give an error. You can find errors in the log files though, usually standard out.

If your managed server is running in managed server independent mode (MSI) it will not receive any configuration changes. For the configuration update process the managed server in MSI mode is considered down. Typically a managed server starts running in MSI mode when your administration server was down during startup of the managed server.

When a managed server (or multiple) are not running when the configuration is changed, the configuration is pushed upon startup time of the managed server. Typically you start the managed server from the console. The console application sends a request to the nodemanager running on the machine where you want to start the managed server. If you start a managed server from the console, in DOMAIN_HOME/servers/SERVER_NAME/data/nodemanager a file called startup.properties is created (if not already existing. Otherwise will be overwritten, unless you made manual changes in that file before). One of the entries in that property file is the reference to the administration server.

Whenever you start a managed server from the console, the admin url setting is checked and/or set. If you start a managed server from node manager directly (e.g. with wlst) the value is considered as “truth”. Obviously you have configured startScriptEnabled=true in nodemanager.properties. This startScriptEnabled, in combination with the name of the managed server, causes nodemanager to call the shellscript startWebLogic.sh with parameters machine_name and adminserver_url. (again: the url is grabbed from the startup.properties file I just mentioned).

As soon as a managed server boots, it calls the adminserver url and checks for configuration changes. If configuration changes exist, the managed server gets the new config.xml (and other configuration changes) and uses those to startup the managed server. If you start multiple managed servers at once, and they share the same disk for configuration (shared storage) this can also cause unsuccessful changes (again: file locking). When using shared storage you should always start 1 managed server first, than you can start the rest in a group (all together). This is because the first will change the configuration, the servers that will be started later will find out that they already have access to the “latest” configuration file.

This also explains why you shouldn’t hack in configuration files on your managed servers: these get overwritten when you reboot. But, and this is the tricky part, some files on the managed servers get updated when you start the managed server the first time. A typical example is system-jazn-data.xml and cwallet.sso. You can prevent copying those buggers around by sticking them in LDAP (OID) or database (reassociateSecureityStore). Which of these options (OID or RDBMS) is valid depends on the middleware product and version you are using.

So far for my brain dump. Hope this helps. Should I put this in a presentation for some conference?

Advertisements

Written by Jacco H. Landlust

March 18, 2013 at 10:27 pm

Posted in Weblogic

mandatory columns

with 5 comments

**UPDATE: Joaquín González pointed out to me that Carlos Sierra has blogged about this too . It was his presentation where the topic came up. **

While attending Hotsos 2013 I learned of a new 11.2.0.3 feature explain to me by Toon Koppelaars. This new feature has to do with adding a mandatory column to an existing table.

Whenever I want to add a mandatory column to a table, I added the column, set the value to whatever I preferred and then set the column to not null.

JACCO@DEMODB > describe test;

Name                                    Null?    Type
--------------------------------------- -------- ---------------------------------
ID                                      NOT NULL NUMBER

JACCO@DEMODB > select * from test;

ID
----------
1
2

2 rows selected.

JACCO@DEMODB > alter table test add string_value varchar2(100);

Table altered.

JACCO@DEMODB > alter table test modify string_value  not null;

alter table test modify string_value  not null

ERROR at line 1:
ORA-02296: cannot enable (JACCO.) - null values found

JACCO@DEMODB > update test set string_value = 'not null';

2 rows updated.

JACCO@DEMODB > commit;

Commit complete.

JACCO@DEMODB > alter table test modify string_value  not null;

Table altered.

So far so good. As of 11.2.0.3 you can also add a column with a default value and set null directly without updating the table first:

JACCO@DEMODB > set timing on
JACCO@DEMODB > alter table test add string_value2 varchar2(100) default 'my value' not null;

Table altered.

Elapsed: 00:00:00.02

Very very fast (and the statement keeps running very fast, even when you have a lot of rows in the table) and it seems a very convenient statement. Especially with very big tables this seems useful.

If I would select from the table, I can actually see the value in there:

JACCO@DEMODB > column string_value format a10
JACCO@DEMODB > column string_value2 format a10
JACCO@DEMODB > select * from test;

ID  STRING_VAL STRING_VAL
--- ---------- ----------
1  not null   my value
2  not null   my value

2 rows selected.

Now this seems al very cool and such, but I think there’s a real caveats. Lets say you would have a somewhat larger table:

JACCO@DEMODB > create table my_objects as
               (select rownum as id, t.object_name from all_objects t);

Table created.

JACCO@DEMODB > alter table my_objects add constraint pk_my_objects primary key (id);

Table altered.

JACCO@DEMODB > select count(*) from my_objects;

COUNT(*)
----------
66883
1 row selected.

lets say you would add a column to this table, and make it mandatory

JACCO@DEMODB > alter table my_objects add null_test_column char(1) default 'N' not null;

Table altered.

JACCO@DEMODB > select null_test_column, count(*) from my_objects group by null_test_column;

N   COUNT(*)
- ----------
N      66883

1 row selected.

Now lets add another row (to create some skew), add an index and gather stats:

JACCO@DEMODB > insert into my_objects (id,object_name,null_test_column) values (66884,'test','Y');

1 row created.

JACCO@DEMODB > commit;

Commit complete.

JACCO@DEMODB > create index idx_null_test on my_objects (null_test_column);

Index created.

JACCO@DEMODB > exec dbms_stats.gather_table_stats(ownname => user, tabname => 'MY_OBJECTS', cascade => true, method_opt => 'for all columns size auto');

PL/SQL procedure successfully completed.

When I analyze the results, the following appears:

JACCO@DEMODB > l
1  select num_distinct, num_nulls, num_buckets, histogram,
2         utl_raw.cast_to_varchar2(low_value) as low_value,
3         utl_raw.cast_to_varchar2(high_value) as high_value
4*   from user_tab_columns where table_name = 'MY_OBJECTS' and column_name = 'NULL_TEST_COLUMN'
JACCO@DEMODB > /

NUM_DISTINCT  NUM_NULLS NUM_BUCKETS HISTOGRAM       LOW_VALUE       HIGH_VALUE
------------ ---------- ----------- --------------- --------------- ---------------
           2          0           1 NONE            N               Y

1 row selected.

The number of distinct values is correct, the number of nulls sounds correct, but given the skew in the data I would have expected a histogram. So something seems wrong. Also when I run a query on the table, I get some funny results

JACCO@DEMODB > explain plan for
2  select * from my_objects where null_test_column = 'Y';

Explained.

JACCO@DEMODB > select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------

Plan hash value: 880823944

--------------------------------------------------------------------------------
| Id  | Operation         | Name       | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |            | 33442 |  1045K|   106   (1)| 00:00:02 |
|*  1 |  TABLE ACCESS FULL| MY_OBJECTS | 33442 |  1045K|   106   (1)| 00:00:02 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("NULL_TEST_COLUMN"='Y')

13 rows selected.

There must be something wrong, it’s a wrong access path. Why is this query running a full table scan when it should run over an index? The clue is the predicate information, it’s a filter. Now where does this filter come from? Or is dms_xplan lying to me again?

To find out I ran an 10053 trace:

JACCO@DEMODB > alter session set events '10053 trace name context forever, level 1';

Session altered.

JACCO@DEMODB > select * from my_objects where null_test_column = 'Y';

ID         OBJECT_NAME                    N
---------- ------------------------------ -
66884      test                           Y

1 rows selected.

When I read the trace file I noticed this:

----- Plan Table -----
============
Plan Table
============
---------------------------------------+-----------------------------------+
| Id  | Operation          | Name      | Rows  | Bytes | Cost  | Time      |
---------------------------------------+-----------------------------------+
| 0   | SELECT STATEMENT   |           |       |       |   106 |           |
| 1   |  TABLE ACCESS FULL | MY_OBJECTS|   33K | 1045K |   106 |  00:00:02 |
---------------------------------------+-----------------------------------+
Predicate Information:
----------------------
1 - filter(NVL("NULL_TEST_COLUMN",'N')='Y')

Where did that NVL come from? That just appeared because of the default value that I was set on the column without updating the values first. This is rather unexpected and it can bite you when you don’t expect this (especially when you run into this issue 6 months after the column was added to the table). A very unpleasant side-effect of a new feature that seems very promising upon first usage. This shows again that you really have to be careful when using new features.

Hope this helps.

http://carlos-sierra.net/2013/03/12/interesting-case-where-a-full-table-scan-is-chosen-instead-of-an-index

Written by Jacco H. Landlust

March 13, 2013 at 11:15 pm

Posted in Uncategorized

setting up EDG style HA administration server with corosync and pacemaker

with 5 comments

Most of the Enterprise Deployment Guide’s (EDG) for Fusion Middleware products consider setting up WebLogic’s Administration Server HA. All of these EDG’s describe a manual failover. None of my clients find that a satisfactory solution. Usually I advise my clients to use Oracle’s clustering software to automate failover (Grid Infrastructure / Cluster Ready Services). This works fine if your local DBA is managing the WebLogic layer, although the overhead is large for something “simple” like an HA administration server. Also this requires the failover node to run in the same subnet (network wise) as the primary node. All this led me into investigating other options. One of the viable options I POC’d is Linux clustering with CoroSync and PaceMaker. I considered CoroSync and PaceMaker because this seems to be the RedHat standard for clustering nowadays.

This example is configured on OEL 5.8. The example is not production ready, please don’t install this on production without thorough testing (and some more of the usual disclaimers 🙂 ) I will assume basic knowledge of clustering and linux for this post, not all details will be configured in great depth.

First you need to understand a little bit about my topology. I have a small linux server running a software loadbalancer (Oracle Traffic Director) which is also functioning as NFS server. When configuring this for an enterprise these components will most likely be provided for you (F5’s or Cisco with some NetAPP or alike). In this specific configuration the VIP for the administration server runs on the loadbalancer. The NFS server on the loadbalancer server provides shared storage that hosts the domain home. This NFS share is mounted on both the servers that will run my administration server.

Back to the cluster. To install CoroSync and PaceMaker, first install the EPEL repository for packages that don’t exist in vanilla Redhat/CentOS and add the cluster labs repository.

rpm -ivh http://mirror.iprimus.com.au/epel/5/x86_64/epel-release-5-4.noarch.rpm
wget -O /etc/yum.repos.d/pacemaker.repo http://clusterlabs.org/rpm/epel-5/clusterlabs.repo

Then install Pacemaker 1.0+ and CoroSync 1.2+ via yum

yum install -y pacemaker.$(uname -i) corosync.$(uname -i)

When all software and dependencies are installed, you can configure CoroSync. My configuration file is rather straight forward. I run a cluster over network 10.0.0.0

cat /etc/corosync/corosync.conf
# Please read the corosync.conf.5 manual page
compatibility: whitetank

totem {
	version: 2
	secauth: on
	threads: 0
	interface {
		ringnumber: 0
		bindnetaddr: 10.0.0.0
		mcastaddr: 226.94.1.1
		mcastport: 5405
	}
}

logging {
	fileline: off
	to_stderr: no
	to_logfile: yes
	to_syslog: yes
	logfile: /var/log/cluster/corosync.log
	debug: off
	timestamp: on
	logger_subsys {
		subsys: AMF
		debug: off
	}
}

amf {
	mode: disabled
}

quorum {
           provider: corosync_votequorum
           expected_votes: 2
}

aisexec {
        # Run as root - this is necessary to be able to manage resources with Pacemaker
        user:        root
        group:       root
}

service {
    # Load the Pacemaker Cluster Resource Manager
    name: pacemaker
    ver: 0
}

Now, you can start CoroSync and check the configuration of the cluster.

service corosync start
corosync-cfgtool -s
Printing ring status.
Local node ID 335544330
RING ID 0
	id	= 10.0.0.20
	status	= ring 0 active with no faults


crm status
============
Last updated: Sun Mar  3 21:30:42 2013
Stack: openais
Current DC: wls1.area51.local - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ wls2.area51.local wls1.area51.local ]

For production usage you should configure stonith, which is beyond this example. So for testing purposes I disabled stonith

crm configure property stonith-enabled=false
crm configure property no-quorum-policy=ignore

Also I configure resources not to fail back when the resource running the resource comes back online.

crm configure rsc_defaults resource-stickiness=100

Now your cluster is ready, although it doesn’t run WebLogic yet. There is no WebLogic cluster resource, so I wrote one myself. To keep it separated from other cluster resources I setup my own OCF resource tree (just mkdir and you are done). A OCF resource requires certain functions to be in the script, the The OCF Resource Agent Developer’s Guide can help you with that one.

Here’s my example WebLogic cluster resource:

cat /usr/lib/ocf/resource.d/area51/weblogic 
#!/bin/bash
#
# Description:  Manages a WebLogic Administration Server as an OCF High-Availability
#               resource under Heartbeat/LinuxHA control
# Author:	Jacco H. Landlust <jacco.landlust@idba.nl>
# 		Inspired on the heartbeat/tomcat OCF resource
# Version:	1.0
#

OCF_ROOT=/usr/lib/ocf

. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs
#RESOURCE_STATUSURL="http://127.0.0.1:7001/console"

usage()
{
	echo "$0 [start|stop|status|monitor|migrate_to|migrate_from]"
	return ${OCF_NOT_RUNNING}
}

isrunning_weblogic()
{
        if ! have_binary wget; then
		ocf_log err "Monitoring not supported by ${OCF_RESOURCE_INSTANCE}"
		ocf_log info "Please make sure that wget is available"
		return ${OCF_ERR_CONFIGURED}
        fi
        wget -O /dev/null ${RESOURCE_STATUSURL} >/dev/null 2>&1
}

isalive_weblogic()
{
        if ! have_binary pgrep; then
                ocf_log err "Monitoring not supported by ${OCF_RESOURCE_INSTANCE}"
                ocf_log info "Please make sure that pgrep is available"
                return ${OCF_ERR_CONFIGURED}
        fi
        pgrep -f weblogic.Name > /dev/null
}

monitor_weblogic()
{
        isalive_weblogic || return ${OCF_NOT_RUNNING}
        isrunning_weblogic || return ${OCF_NOT_RUNNING}
        return ${OCF_SUCCESS}
}

start_weblogic()
{
	if [ -f ${DOMAIN_HOME}/servers/AdminServer/logs/AdminServer.out ]; then
		su - ${WEBLOGIC_USER} --command "mv ${DOMAIN_HOME}/servers/AdminServer/logs/AdminServer.out ${DOMAIN_HOME}/servers/AdminServer/logs/AdminServer.out.`date +%Y-%M-%d-%H%m`"
	fi
	monitor_weblogic
	if [ $? = ${OCF_NOT_RUNNING} ]; then
		ocf_log debug "start_weblogic"
		su - ${WEBLOGIC_USER} --command "nohup ${DOMAIN_HOME}/bin/startWebLogic.sh > ${DOMAIN_HOME}/servers/AdminServer/logs/AdminServer.out 2>&1 &"
		sleep 60
		touch ${OCF_RESKEY_state}
	fi
	monitor_weblogic
	if [ $? =  ${OCF_SUCCESS} ]; then
		return ${OCF_SUCCESS}
	fi
}

stop_weblogic()
{
#	monitor_weblogic
#	if [ $? =  $OCF_SUCCESS ]; then
		ocf_log debug "stop_weblogic"
		pkill -KILL -f startWebLogic.sh
		pkill -KILL -f weblogic.Name
		rm ${OCF_RESKEY_state}
#	fi
	return $OCF_SUCCESS
}

meta_data() {
        cat <<END
<?xml version="1.0"?>
<!DOCTYPE resource-agent SYSTEM "ra-api-1.dtd">
<resource-agent name="weblogic" version="0.9">
	<version>1.0</version>
	<longdesc lang="en"> This is a WebLogic Resource Agent </longdesc>
	<shortdesc lang="en">WebLogic resource agent</shortdesc>
        
	<parameters>
		<parameter name="state" unique="1">
			<longdesc lang="en">Location to store the resource state in.</longdesc>
			<shortdesc lang="en">State file</shortdesc>
			<content type="string" default="${HA_VARRUN}{OCF_RESOURCE_INSTANCE}.state" />
		</parameter>
		<parameter name="statusurl" unique="1">
			<longdesc lang="en">URL for state confirmation.</longdesc>
			<shortdesc>URL for state confirmation</shortdesc>
			<content type="string" default="" />
		</parameter>
		<parameter name="domain_home" unique="1">
			<longdesc lang="en">PATH to the domain_home. Should be a full path</longdesc>
			<shortdesc lang="en">PATH to the domain.</shortdesc>
			<content type="string" default="" required="1" />
		</parameter>
		<parameter name="weblogic_user" unique="1">
			<longdesc lang="en">The user that starts WebLogic</longdesc>
			<shortdesc lang="en">The user that starts WebLogic</shortdesc>
			<content type="string" default="oracle" />
		</parameter>
	</parameters>   
        
	<actions>
		<action name="start"        timeout="90" />
		<action name="stop"         timeout="90" />
		<action name="monitor"      timeout="20" interval="10" depth="0" start-delay="0" />
		<action name="migrate_to"   timeout="90" />
		<action name="migrate_from" timeout="90" />
		<action name="meta-data"    timeout="5" />
	</actions>      
</resource-agent>
END
}

# Make the resource globally unique
: ${OCF_RESKEY_CRM_meta_interval=0}
: ${OCF_RESKEY_CRM_meta_globally_unique:="true"}

if [ "x${OCF_RESKEY_state}" = "x" ]; then
        if [ ${OCF_RESKEY_CRM_meta_globally_unique} = "false" ]; then
                state="${HA_VARRUN}${OCF_RESOURCE_INSTANCE}.state"
                
                # Strip off the trailing clone marker
                OCF_RESKEY_state=`echo $state | sed s/:[0-9][0-9]*\.state/.state/`
        else
                OCF_RESKEY_state="${HA_VARRUN}${OCF_RESOURCE_INSTANCE}.state"
        fi
fi

# Set some defaults
RESOURCE_STATUSURL="${OCF_RESKEY_statusurl-http://127.0.0.1:7001/console}"
DOMAIN_HOME="${OCF_RESKEY_domain_home}"
WEBLOGIC_USER="${OCF_RESKEY_weblogic_user-oracle}"

# MAIN
case $__OCF_ACTION in
	meta-data)      meta_data
       	         exit ${OCF_SUCCESS}
	                ;;
	start)          start_weblogic;;
	stop)           stop_weblogic;;
	status)		monitor_weblogic;;
	monitor)        monitor_weblogic;;
	migrate_to)     ocf_log info "Migrating ${OCF_RESOURCE_INSTANCE} to ${OCF_RESKEY_CRM_meta_migrate_to}."
			stop_weblogic
			;;
	migrate_from)   ocf_log info "Migrating ${OCF_RESOURCE_INSTANCE} to ${OCF_RESKEY_CRM_meta_migrated_from}."
			start_weblogic
			;;
	usage|help)     usage
			exit ${OCF_SUCCESS}
	       	         ;;
	*)		usage
			exit ${OCF_ERR_UNIMPLEMENTED}
			;;
esac
rc=$?

# Finish the script
ocf_log debug "${OCF_RESOURCE_INSTANCE} $__OCF_ACTION : $rc"

exit $rc

Please mind that you would have to copy the script to both nodes of the cluster.

Next up is configuring the WebLogic resource in the cluster. In my example I mounted the NFS share with domain homes on /domains and the domain is called ha-adminserver. My WebLogic is running as oracle and the administration server listens on all addresses at port 7001. Therefore the parameter weblogic_user is left at default (oracle) and the status_url to check if the administration server is running is left at default too (http://127.0.0.1:7001/console). The domain home is parsed to the cluster.

crm configure primitive weblogic ocf:area51:weblogic params domain_home="/domains/ha-adminserver" op start interval="0" timeout="90s" op monitor interval="30s"

When the resource is added, the cluster starts is automatically. Please keep in mind this takes some time, therefore you might not see results instantly. Also the scripts has a sleep configured, if your administration server takes longer to boot you might want to fiddle with the values. For an example like in this blogpost it works.

Next you can check the status of your resource:

crm status
============
Last updated: Sun Mar  3 21:35:47 2013
Stack: openais
Current DC: wls1.area51.local - partition with quorum
Version: 1.0.12-unknown
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ wls2.area51.local wls1.area51.local ]

 weblogic	(ocf::area51:weblogic):	Started wls2.area51.local

To check the configuration of your resource, run the configure show command

crm configure show
node wls1.area51.local
node wls2.area51.local
primitive weblogic ocf:area51:weblogic \
	params domain_home="/domains/ha-adminserver" \
	op start interval="0" timeout="90s" \
	op monitor interval="30s" \
	meta target-role="Started"
property $id="cib-bootstrap-options" \
	dc-version="1.0.12-unknown" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="2" \
	stonith-enabled="false" \
	no-quorum-policy="ignore" \
	last-lrm-refresh="1362338797"
rsc_defaults $id="rsc-options" \
	resource-stickiness="100"

You can test failover by either stopping CoroSync, killing the linux node, etc. etc.

Other useful commands:

# start resource
crm resource start weblogic

# stop resource
crm resource stop weblogic

# Cleanup errors
crm_resource --resource weblogic -C

# Move resource to other node, mind you: that means pinning and taking control of the cluster. Fail-back is automatically introduced
crm resource move weblogic wls1.area51.local
# give authority over resource back to cluster
crm resource unmove weblogic

# delete cluster resource
crm configure delete weblogic

If you want your WebLogic administration server resource to be bound to a vip, just google for setting up an HA apache on PaceMaker. There is plenty information about that on the web, e.g. this site, which helped me setting up the cluster too.

Well, I hope this helps for anyone that is trying to setup an HA administration server.

Written by Jacco H. Landlust

March 3, 2013 at 11:39 pm