Wednesday, 29 June 2011

Bug 9572787 Long waits for ‘enq: AM – disk offline’ following cell or disk crash (can cause further instance crashes)

If anyone is implementing Oracle 11g R2 (11.2.0.2) with extended cluster might get affected by this bugs.
In my case, this bug comes out when we tested to offline the whole second Disk array on the Failgroup 2 which we using ASM normal redundancy.

Until now we still waiting for the merge patch for this issue. Hope they will release the patch soon. For more details about this bugs please check on Oracle support portal.

Update 8 July 2011:
Oracle released the merge patch. The test result is much better, the instances was still rebooted. Only until we set the hidden parameter ,recommend by Oracle support then the instances survived. The parameter is
_lm_rcvr_hang_kill=FALSE

 Do not use this parameter unless Oracle support told us to do so.

Tuesday, 28 June 2011

ORA-12514 during Switchover using DG broker with multiple IP segment

Our new system 2 node RAC Oracle 11.2.0.2 are setup with 1 instance of Standby database. Yesterday we tested to switchover to standby database using 'dgmgrl'. The switchover was ok , only that my Primary Node (Node1) was not automatically restarted instead throwing ORA-12514 - TNS:listener does not currently know of service requested in connect descriptor. 
I checked all my listener for DG broker have been setup with static listener, which usually this is the cause of this error. Everything looks properly setup until I checked each instance Broker configuration "StaticConnectIdentifier"(New parameter in 11g) property :
dgmgrl > show instance verbose "PRD1"
...
StaticConnectIdentifier='(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=172.22.2.29)(PORT=1532))(CONNECT_DATA=(SERVICE_NAME=PRD_DGMGRL)(INSTANCE_NAME=PRD1)(SERVER=DEDICATED)))';

dgmgrl > show instance verbose "PRD2"
StaticConnectIdentifier='(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=172.22.2.30)(PORT=1532))(CONNECT_DATA=(SERVICE_NAME=PRD_DGMGRL)(INSTANCE_NAME=PRD2)(SERVER=DEDICATED)))';

Well, we configure all the static listener on 172.22.12.7 (VIP1 - Segment 12) and 172.22.12.9 (VIP2 - Segment 12). I remember that IP segment 172.22.2.xx was the initial VIP segment during the installation. 

Then I just change it to the correct IP and port number.
dgmgrl> edit instance 'PRD1' set property StaticConnectIdentifier='(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=172.22.12.7)(PORT=1523))(CONNECT_DATA=(SERVICE_NAME=PRD_DGMGRL)(INSTANCE_NAME=PRD1)(SERVER=DEDICATED)))';
dgmgrl> edit instance 'PRD2' set property StaticConnectIdentifier='(DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=172.22.12.9)(PORT=1523))(CONNECT_DATA=(SERVICE_NAME=PRD_DGMGRL)(INSTANCE_NAME=PRD2)(SERVER=DEDICATED)))';

After this changes my switchover was running smoothly without any error. Problem solved!

Thursday, 23 June 2011

Running Catalog/Catproc.sql on Primary DB with Dataguard, will this invalid my Physical standby?

A few days back, when I checked one of my newly created DB was having many invalid objects. I checked dba_registry and giving me some of the component are invalid. After a several time running utlrp.sql still didn't help to compile all the SYS invalid objects. There are a few objects of Advanced Queuing objects that keep invalid even after manual compile. I raise an SR to oracle to check if this objects are corrupted. 
Then Oracle support ask me to clean up those AQ objects and run Catalog/Catproc.sql on my Primary DB. I just wondering if this will make my physical standby DB become invalid. Since this is Pre-Production environment and many people are testing on it now, then I try to google and check in oracle support if any article about this. I give up and open a new SR about this issue. They confirmed it is fine to rerun catalog/catproc.sql in my environment, everything will be replicated to the standby site. 

I tested and it run's well without any issue.Of course, in my case I need to turn off cluster parameter temporary before running it and startup the db in upgrade mode(startup upgrade).  Now all my missing corrupted objects are back and no more invalid component in the dba_registry. I check my dataguard for the whole day looks good.

Oracle 11gR2 RAC stuck on rootcrs.pl -unlock when applying PSU

About 2 weeks ago I did run this patch PSU 12311357 on my UAT environment which having about the same setup with my Pre -production , 2 node RAC running on 11.2.0.2. Everything was went fine on UAT. Then the next day I applied this patch on Production Environment include single instance Dataguard.

First patch on Dataguard site was running very smooth, then continue with the Pre-Production on 1st node which I got stuck when doing "rootcrs.pl -unlock", I wait for about 1 hour and still not moving. Well there must be something wrong with this process. I checked all my CRS stack are gone and shutdown properly. On which file it still trying hard to unlock then ?
What I did that time is, I abort that session (CTRL+C). Then I try to run again the unlock command. Still it stucks there. 
But this time I keep moving on to apply the patch since I don't have anymore time to open an SR, due to a critical fix that we need from this patch(I believe many people don't agree with me on my decision here). Well the patch was running fine without any issue. It also happen to 2nd node. 
I open SR about this issue and turns out that it was hang due to many log files under GRID_HOME to unlock. 
Here is the error message when that unlock command hang :
rootcrs_prd05istl1swmadb.log

....
2011-05-29 18:40:45: ###### Begin DIE Stack Trace ######
2011-05-29 18:40:45: Package File Line Calling
2011-05-29 18:40:45: --------------- -------------------- ---- ----------
2011-05-29 18:40:45: 1: main rootcrs.pl 325 crsconfig_lib::dietrap
2011-05-29 18:40:45: 2: crsconfig_lib s_crsconfig_lib.pm 350 main::__ANON__
2011-05-29 18:40:45: 3: crsconfig_lib s_crsconfig_lib.pm 350 (eval)
2011-05-29 18:40:45: 4: File::Find Find.pm 886 crsconfig_lib::reset_perms
2011-05-29 18:40:45: 5: File::Find Find.pm 700 File::Find::_find_dir
2011-05-29 18:40:45: 6: File::Find Find.pm 1229 File::Find::_find_opt
2011-05-29 18:40:45: 7: crsconfig_lib s_crsconfig_lib.pm 332 File::Find::finddepth
2011-05-29 18:40:45: 8: crsconfig_lib crsconfig_lib.pm 11198 crsconfig_lib::s_reset_crshome1
2011-05-29 18:40:45: 9: crsconfig_lib crsconfig_lib.pm 11209 crsconfig_lib::unlockCRSHomefordeinstall
2011-05-29 18:40:45: 10: main rootcrs.pl 382 crsconfig_lib::unlockCRSHome
2011-05-29 18:40:45: ####### End DIE Stack Trace #######
I still hold this issue, since so far I checked there is no way to just simulate running this unlock command without affecting the GRID stack. I will update again this article after I apply my next patch to see if this is related to my previous post "Oracle 11g R2 ASM Audit log generate a lot of files"

Update:
It is confirm due to a lot of file on Audit log cause the issue. After removing the files, I applied the next patch everything went very smooth. The rootcrs.pl -unlock actually doing permission change on files listed under GRID_HOME.

Oracle 11g R2 ASM Audit log generate a lot of files

I recently noticed, in my 2 Node RAC 11gR2(11.2.0.2) my disk usage on /u01 increase rapidly when no process are running. This mount point contains GRID_HOME and ORACLE_HOME on each node.
I manage to find which directory that causing this. It directed me to GRID_HOME under /u01/app/11.2/grid/rdbms/audit/ which generate a lot of files (+asm1_ora*.aud) every minutes and when I count just guess how many files are there ? around 200K files and keep generated.
 I remember only enable SYS auditing on database level which write directly to the OS syslog. Then I checked on my ASM instance if there is Auditing enabled and guess what? All audit parameter on all my ASM instance are disabled('FALSE' value).
 Let me see if anything written in documentation about this.
 http://oracle.su/docs/11g/network.112/e10574/auditing.htm#CEGJADDC

Activities That Are Always Audited for All Platforms

Oracle Database always audits certain database-related operations and writes them to the operating system audit files. It includes the actions of any user who is logged in with the SYSDBA or SYSOPER privilege. This is called mandatory auditing. Even if you have enabled the database audit trail (that is, setting the AUDIT_TRAIL parameter to DB), Oracle Database still writes mandatory records to operating system files.
By default, the operating system files are in the $ORACLE_HOME/admin/$ORACLE_SID/adump directory on UNIX systems. On Windows systems, Oracle Database writes this information to the Windows Event Viewer. You can change the location of this directory by setting the AUDIT_FILE_DEST initialization parameter, which is described in "Specifying a Directory for the Operating System Audit Trail".
Mandatory auditing includes the following operations:
  • Database startup. An audit record is generated that lists the operating system user starting the instance, the user terminal identifier, and the date and time stamp. This data is stored in the operating system audit trail because the database audit trail is not available until after the startup has successfully completed.
  • SYSDBA and SYSOPER logins. Oracle Database records all SYSDBA and SYSOPER connections.
  • Database shutdown. An audit record is generated that lists the operating system user shutting down the instance, the user terminal identifier, and the date and time stamp.

But in my case, it is generated too aggresive and I did trace level 12 which come up with nothing else than user SYS or SYSASM login information on those files. I open SR with oracle support about this, to check if I hit a bug. Came back with a workarround to put a cron job to delete the files periodically as per needed. For me I don't need the audit files under ASM instances, so I put a cron job to delete every hour.
Well maybe this only happen in my environment, but hope this will help someone that hit the same issue like me.
My environment :
AIX 6.1
Oracle RAC 11gR2(11.2.0.2) PSU2 - 2 Node (non shared home)
ASM
Extended Cluster