Contention problems with AQ
While doing a routine checkup I found that four processes are eating up CPU’s on one node of a two (2) node RAC cluster. These processes are ora_q000_SID2, ora_Q001_SID2, ora_Q002_SID2 & ora_Q003_SID2. This suggests there is some extra activity with AQ on the SID2 instance. (This is a 10.2.0.4 RDBMS RAC running on 64-bit OEL 4.7, CRS and ASM on 11.1.0.7)
So first thing I investigated were the queues:
SELECT q.name, s.*
FROM dba_queues q, v$aq s
WHERE q.qid = s.qid
/
NAME QID WAITING READY EXPIRED TOTAL_WAIT AVERAGE_WAIT
------------------------------ ---------- ---------- ---------- ---------- ---------- ------------
<SNIP>
AQ$_ALERT_QT_E 8877 0 0 0 0 0
ALERT_QUE 8878 0 0 0 0 0
AQ$_SYS$SERVICE_METRICS_TAB_E 9838 0 0 0 0 0
SYS$SERVICE_METRICS 9839 0 478713 0 0 0
AQ$_WM$EVENT_QUEUE_TABLE_E 10268 0 0 0 0 0
WM$EVENT_QUEUE 10269 0 0 0 0 0
AQ$_WF_DEFERRED_TABLE_M_E 50976 0 0 0 0 0
<SNIP>98 rows selected.
The number of ready messages on the SYS$SERVICE_METRICS is much larger than suspected. So next thing is to investigate what’s happening on this queue. First I needed to find the queue table:
SELECT queue_table
FROM dba_queues
WHERE name = 'SYS$SERVICE_METRICS'
/QUEUE_TABLE
——————————————————————————————
SYS$SERVICE_METRICS_TAB
Next I checked for the data spreading:
SELECT TRUNC(enq_time), COUNT(*)
FROM sys$service_metrics_tab
GROUP BY TRUNC(enq_time)
ORDER BY 1
/TRUNC(ENQ_TIME) COUNT(*)
——————- ———-
11-11-2009:00:00:00 123
12-11-2009:00:00:00 8628
13-11-2009:00:00:00 8631
14-11-2009:00:00:00 8631
15-11-2009:00:00:00 8628
16-11-2009:00:00:00 8631
17-11-2009:00:00:00 8628
18-11-2009:00:00:00 8628
19-11-2009:00:00:00 8625
20-11-2009:00:00:00 8628
21-11-2009:00:00:00 8628
22-11-2009:00:00:00 8631
23-11-2009:00:00:00 8628
24-11-2009:00:00:00 8634
25-11-2009:00:00:00 8625
26-11-2009:00:00:00 8631
27-11-2009:00:00:00 8625
28-11-2009:00:00:00 8628
29-11-2009:00:00:00 8634
30-11-2009:00:00:00 8631
01-12-2009:00:00:00 8631
02-12-2009:00:00:00 8628
03-12-2009:00:00:00 8628
04-12-2009:00:00:00 8628
05-12-2009:00:00:00 8631
06-12-2009:00:00:00 8628
07-12-2009:00:00:00 8631
08-12-2009:00:00:00 8628
09-12-2009:00:00:00 8628
10-12-2009:00:00:00 8631
11-12-2009:00:00:00 8628
12-12-2009:00:00:00 8631
13-12-2009:00:00:00 8652
14-12-2009:00:00:00 8631
15-12-2009:00:00:00 8628
16-12-2009:00:00:00 8628
17-12-2009:00:00:00 8628
18-12-2009:00:00:00 8625
19-12-2009:00:00:00 8625
20-12-2009:00:00:00 8628
21-12-2009:00:00:00 8628
22-12-2009:00:00:00 8628
23-12-2009:00:00:00 8628
24-12-2009:00:00:00 8628
25-12-2009:00:00:00 8628
26-12-2009:00:00:00 8625
27-12-2009:00:00:00 8628
28-12-2009:00:00:00 8628
29-12-2009:00:00:00 8631
30-12-2009:00:00:00 8631
31-12-2009:00:00:00 8628
01-01-2010:00:00:00 8628
02-01-2010:00:00:00 8631
03-01-2010:00:00:00 8628
04-01-2010:00:00:00 8628
05-01-2010:00:00:00 8628
06-01-2010:00:00:00 411657 rows selected.
And checked some more timing issues:
SELECT MIN(enq_time) enq_time
FROM sys$service_metrics_tab
/ENQ_TIME
——————-
11-11-2009:00:00:00
That’s interesting! Somehow on 11-11-2009 the messages are not (or not completely) removed from the queue. Since I don’t know what process needs to get messages from this queue, I checked the v$access view (WARNING! performance penalty!) for activity:
SELECT s.username, s.program, a.*
FROM v$access a, v$session s
WHERE a.object='SYS$SERVICE_METRICS_TAB'
AND a.sid = s.sid
/
USERNAME PROGRAM SID OWNER OBJECT TYPE
---------- -------------------------------------------------- ---------- ---------- ------------------------------ ---------------
SYS racgimon@someserver.someclient.nl (TNS V1-V3) 117 SYS SYS$SERVICE_METRICS_TAB TABLE
oracle@someserver.someclient.nl (MMON) 266 SYS SYS$SERVICE_METRICS_TAB TABLE
Somehow it seems that racgimon or MMON is supposed to get messages from this queue and racgimon (or MMON) is not capable of clearing the queue.
Since this queue is normally (almost) empty this sounds like an accident waiting to happen. More and more blocks are used in the queue table (which has no indexes), transforming a quick full tablescan for a dequeue to a slow full tablescan over a couple of hundredth megabyte of data. The lack of indexes also causes locking issues. If this is the case and if this is causing the CPU problems, I should see some contention issues on the SYS$SERVICE_METRICS_TAB table.
SELECT s.username, s.program, w.event, w.p1, w.p2, o.object_type, o.object_name
) enq: TQ – TM contention 1414594566 8856 TABLE ALERT_QTq000(
FROM v$session_wait w, v$session s, dba_objects o
WHERE w.event LIKE '%contention%'
AND s.sid = w.sid
AND w.p2 = o.object_id
/USERNAME PROGRAM EVENT P1 P2 OBJECT_TYPE OBJECT_NAME
———- —————————————- —————————— ———- ———- ——————– ———–
oracle@someserver.someclient.nl (q003) enq: TQ – TM contention 1414594566 8856 TABLE ALERT_QT
oracle@someserver.someclient.nl (q002) enq: TQ – TM contention 1414594566 8856 TABLE ALERT_QT
oracle@someserver.someclient.nl (q001) enq: TQ – TM contention 1414594566 8856 TABLE ALERT_QT
oracle@someserver.someclient.nl
This was [i]almost[/i] what I expected to find, the contention is on the ALERT_QT instead of the SYS$SERVICE_METRICS queue table !?. Unless there is some direct relation between data in the SYS$SERVICE_METRICS queue and the ALERT_QUE I am actually looking at two different problems !?
Time to go MOS-ing. On MOS I found one bug that just might be the problem I am experiencing: Bug 8585521 (RACGIMON NOT PROCESSING MESSAGES FROM THE SYS$SERVICE_METRICS QUEUE) but this bug is for 11.1 on AIX. So I filed two SR’s, since the problems might not be related.
It took some time to get the correct support officer, and some more time explaining the problem. First advise I received was to set the aq_tm_processes to 1 (it was at 4). This should clear the contention. Well… NOT! It just lowered CPU usage, but the contention remained. Next I was asked to run oradebug ON A PRODUCTION SYSTEM for at least 20 minutes! (oradebug Event 10046 trace name context forever, level 12) Obviously I was reluctant to comply, but since there’s nothing else I can do I had oradebug running for a minute or two. The oradebug basically told me the same as the AWR report. One query killing the system:
SQL ordered by Executions DB/Inst: DBNAME2/DBNAME22 Snaps: 6527-6528
-> Total Executions: 11,070,161
-> Captured SQL account for 99.9% of TotalCPU per Elap per
Executions Rows Processed Rows per Exec Exec (s) Exec (s) SQL Id
———— ————— ————– ———- ———– ————-
2,744,039 2,744,039 1.0 0.00 0.00 cgb6kp2umq52a
select t.schema, t.name, t.flags, q.name, t.timezone from system.aq$_queue_tabl
es t, system.aq$_queues q where t.objno = :1 and q.table_objno = t.objno and q
.usage = 0 and NOT ( t.name in (‘DEF$_AQCALL’ , ‘DEF$_AQERROR’)
and t.schema = ‘SYSTEM’)
Some more bouncing around EM agents, and the advice to drop the alter_qt and recreate it (on a freaking production system, what do these guys think !? I have no clue at all what might be the consequences of re-creating this queue) I finally took the decision to set aq_tm_processes to 0 and the set it back to 4 (sequentially). Suddenly, all of the sudden the contention disappeared.
Now, it seems that problem one was resolved, time to check the sys$service_metrics queue. It turns out that on 322 out of close to 400 databases at this customer there are rows remaining in the sys$service_metrics queue. The messages in the queues have different end_times, which shows me this was not caused by some CPU or PSU patch. Filing another SR at MOS only gave me the advise to manually dequeue the messages, since the messages are not relevant anymore. The bug mentioned before (8585521) was closed because of lack of information (hallo !?), but the issue will be solved in the next release (quiet interesting, since there is not enough information for the bug !?)
Anyway, there’s a EM job running for all databases with this code:
declare
po dbms_aqadm.aq$_purge_options_t;
begin
po.block := true;
dbms_aqadm.purge_queue_table(queue_table => 'SYS$SERVICE_METRICS_TAB',
purge_condition => 'qtview.msg_state=''PROCESSED''',
purge_options => po);
end;
/
If anyone can tell me more about either the contention issue, or the service_metrics queue just drop a reply.
Hope this helps.
[…] 16-How to diagnose Advance Queue problem ? (problem with process of messages from SYS$SERVICE_METRICS QUEUE) Jacco H. Landlust-Contention problems with AQ […]
Blogroll Report 19/02/2010 – 26/02/2010 « Coskan’s Approach to Oracle
March 21, 2010 at 5:30 am
Hi J,
Did you see bug 6057422 on MOS?
This bug is for Oracle 10.2.0.3 and it should be fixed in 10.2.4.
As you’ve said you are working on 10.2.0.4 but it would probably be a good idea to look in the racgimon session trace file to see if there are any errors on clearing the SYS.SYS$SERVICE METRICS.
There is also a note (785689.1) on upgrading from 10.2.0.1 to 10.2.0.5 where the problem is defined as “Upgrade process appears to hang at dbms_aqadm.drop_queue(‘SYS$SERVICE_METRICS’)” due to excessive messages in the AQ tables.
So Note 785689.1 does acknowledge the problems existence.
V.kolmer
March 26, 2010 at 4:03 pm
We are not hitting bug 6057422, the mentioned errors are not in the tracefiles.
Oracle already acknowledged the problem, there is simply no other fix then the workaround to manually clear the messages from the queue (well… not using TAF for the RAC services might also help, especially since most applications can’t handle TAF anyway 😉 ).
Jacco H. Landlust
March 27, 2010 at 3:14 pm
Hello,
I’m seeing a similar issue with cjq0 processes hogging CPU in 11.2.0.2 RAC. The cjq0 session is waiting on gc cr request while executing this SQL.
SELECT COUNT (*)
FROM SYS.AQ$_SYS$SERVICE_METRICS_TAB_S
WHERE name = :1
Anyone else seen this and fixed it ?
Dave M
November 10, 2011 at 10:30 pm