Oracle MVA

Tales from a Jack of all trades

Archive for the ‘ACFS’ Category

when AFCS crashes….

leave a comment »

Today three out of five nodes of a cluster crashed while a loadtest was running on two of the nodes of this cluster. The cluster is a ACFS cluster with OSB and SOA productions on top of it. It uses the ACFS disk for logging and configuration. All binaries are on local disk. The version of GI used is 11.2.0.1 running on 64-bit OEL 5.5.

This blogpost is mostly a note to myself, but I might help some other people with the content.

While looking in the logfiles of CRS for the cause of nodefailure I found this error:


view /u01/app/grid/log/some_server/agent/crsd/oraagent_oracle/oraagent_oracle.l01


2010-10-12 10:21:05.060: [ora.DGGRID.dg][1536899392] [check] InstConnection::connectInt (2) Exception OCIException
2010-10-12 10:21:05.060: [ora.DGGRID.dg][1536899392] [check] Exception type=2 string=ORA-01034: ORACLE not available
ORA-27102: out of memory
Linux-x86_64 Error: 12: Cannot allocate memory
Additional information: 1
Additional information: 491521
Additional information: 8
Process ID: 0
Session ID: 0 Serial number: 0

The ASM instance had 1 GB set for both memory_target as well as memory_max_target. So somehow ACFS uses more memory while on heavy load. I am not aware of any formula’s or best practice to calculate the memory_target for an ASM instance that is just running ACFS. The 1 GB was a guesstimate based on 11.1 knowledge. If anyone has some handles for me regarding memory settings for ASM with just ACFS, please comment on this blogpost.

Some more checking, in this case of Linux (OEL 5) showed some more:


dmesg


[Oracle ACFS] FSCK-NEEDED set for volume /dev/asm/v_disk-170 . Internal ACFS Location: 916 .
[Oracle ACFS] A problem has been detected with
[Oracle ACFS] the file system metadata in /dev/asm/v_disk-170 .
[Oracle ACFS] Normal operation can continue, but it is advisable
[Oracle ACFS] to run fsck on the file system as soon as it is
[Oracle ACFS] feasible to do so.  See the Storage Admin
[Oracle ACFS] Guide for more information about FSCK-NEEDED.

Now this seems like trouble, so I stopped all nodes of the cluster (*AIKS*) and started up an fsck. This ran for ages, just do this:


lseek(4, 9781714944, SEEK_SET) = 9781714944
read(4, "\202\1\6P\17dG\26o\315\324_\3262\363 \tG\2"..., 4096) = 4096
lseek(4, 9781706752, SEEK_SET) = 9781706752
read(4, "\202\1\6P\17dG\26o\315\324(\6\363\23\tG\2"..., 4096) = 4096
lseek(4, 9781649408, SEEK_SET) = 9781649408
read(4, "\202\1\6P\17dG\26o\315\324\262\226\352\20 \10G\2"..., 4096) = 4096
lseek(4, 9781739520, SEEK_SET) = 9781739520
read(4, "\202\1\5P\17dG\26o\315\324\257\27=\233\200\tG\2"..., 4096) = 4096
lseek(4, 2315993088, SEEK_SET) = 2315993088
read(4, "\202\1\5P\17dG\26o\315\324\340O\10\310@\v\212"..., 4096) = 4096

Now I’m no C programmer, nor an filesystem specialist so I don’t exactly know what’s going on (yet). After 4 hours I did decide that waiting longer was futile, it’s just a freaking 10 GB disk!

I started fsck again, only this time with some extra parameters:


$ fsck -a -v -y -t acfs /dev/asm/v_disk-170


OfsCheckOnDiskGBM entered
fsck.acfs: OfsReadMeta at offset: 67112960 (0x4001000)    size: 327680 (0x50000)
OfsCheckFileEntry entered for:
ACFS Internal File: [ACFS Snap Map]
fenum: 19 (0x13)   disk offset: 79360 (0x13600)


fsck.acfs: OfsReadMeta at offset: 79360 (0x13600)    size: 512 (0x200)
OfsCheckFileExtents entered for:
ACFS Internal File: [ACFS Snap Map]
fenum: 19 (0x13)   disk offset: 79360 (0x13600)


fsck.acfs: OfsReadMeta at offset: 67440640 (0x4051000)    size: 512 (0x200)


Checking if any files are orphaned...


Phase 1 Orphan check...


fsck.acfs: OfsReadMeta at offset: 81920 (0x14000)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 82432 (0x14200)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 82944 (0x14400)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 83456 (0x14600)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 83968 (0x14800)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 84480 (0x14a00)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 84992 (0x14c00)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 85504 (0x14e00)    size: 512 (0x200)


Phase 2 Orphan check...


fsck.acfs: OfsReadMeta at offset: 81920 (0x14000)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 82432 (0x14200)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 82944 (0x14400)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 83456 (0x14600)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 83968 (0x14800)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 84480 (0x14a00)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 84992 (0x14c00)    size: 512 (0x200)
fsck.acfs: OfsReadMeta at offset: 85504 (0x14e00)    size: 512 (0x200)


0 orphans found


fsck.acfs: fsck.acfs: Checker completed with the following results:
File System Errors:   2
Fixed:            2
Not Fixed:        0

This caused fsck to finish in a couple of minutes, after which I could mount the ACFS disk on the cluster again.

Advertisements

Written by Jacco H. Landlust

October 12, 2010 at 4:49 pm