June 2017 – A DBA's blog

Missing an hourly snapshot while running AWR?

Have you ever tried to run an AWR report only to find that your hourly snapshots did not get created? A colleague of mine ran into this and worked with Oracle to identify a solution:

select snap_id,BEGIN_INTERVAL_TIME,END_INTERVAL_TIME 
from dba_hist_snapshot 
where BEGIN_INTERVAL_TIME > sysdate - 1 
and instance_number = 1 
order by 1; 

SNAP_ID BEGIN_INTERVAL_TIME END_INTERVAL_TIME 
---------- ------------------------------ ------------------------------ 
239690 2017-06-28-01.17.23.246000 2017-06-28-02.00.34.302000 
239691 2017-06-28-02.00.34.302000 2017-06-28-03.00.28.007000 
239692 2017-06-28-03.00.28.007000 2017-06-28-04.00.39.460000 
239693 2017-06-28-04.00.39.460000 2017-06-28-05.00.03.694000 
239694 2017-06-28-05.00.03.694000 2017-06-28-06.00.13.452000 <--missing 
239696 2017-06-28-07.00.10.342000 2017-06-28-08.00.20.270000 
239697 2017-06-28-08.00.20.270000 2017-06-28-09.00.09.222000 
239698 2017-06-28-09.00.09.222000 2017-06-28-10.00.39.143000 
239699 2017-06-28-10.00.39.143000 2017-06-28-11.00.55.633000 
239700 2017-06-28-11.00.55.633000 2017-06-28-12.00.02.801000 
239701 2017-06-28-12.00.02.801000 2017-06-28-13.00.12.972000 
239702 2017-06-28-13.00.12.972000 2017-06-28-19.21.20.515000 <--missing
12 rows selected.

The issue is caused by MMON being suspended – confirmed by this command:

oradebug unit_test kebm_dmp_slv_attrs kewrmafsa_ 
Status: 3 
Flags: 0 
Runtime limit: 900 
CPU time limit: 300 
Violations: 4 
Suspended until: 1498817730  <-- non-zero indicated MMON is suspended

To clear the suspension issue this command:

oradebug unit_test kebm_set_slv_attrs kewrmafsa_ retain retain retain retain 0 0
Modified attributes of kewrmafsa_ (slave id 12)

And now it shows zero and snapshots are being generated again:

oradebug unit_test kebm_dmp_slv_attrs kewrmafsa_
Status: 3
Flags: 0
Runtime limit: 900
CPU time limit: 300
Violations: 0
Suspended until: 0  zero indicates MMON is running normally

The reason for the missing snapshots was SYSAUX cleanup work which locked up some of the tables that snapshot generation needs. But it can happen for a variety of reasons.

Automate the move of a single instance Oracle database to a new EC2

This week we were challenged to come up with a solution that would allow a database administrator to move a single instance of an Oracle database on an AWS EC2 machine to a different EC2 machine. We were tasked with coming up with as much automation as possible to accomplish this goal.

A colleague had already written the required cloud formation template to create the new EC2 instance and to install Oracle on it via an Ansible playbook. All of this was executed via Jenkins. The challenge now was to be able to restore the database on the new EC2 instance. In order to automate the backup and restore part of the process I created a python script that accepted as input the parameters:

Name of the database
Location of the script directory that contained maintenance scripts for the database
An AWS S3 bucket to be used for transporting the data

The python script could be downloaded from an S3 bucket that we had set up as a code repository. Once the python script had been downloaded onto the source server, the database administrator could run the script and provide the above parameters. Python would then create the required backup and restore scripts, copy the crontab entries and stage the maintenance scripts for transfer to the new server.

The backup script backs up the database via RMAN commands (including the database, archive logs, spfile, control file) and then uploads all of this to an S3 bucket.

The next step would be to run the cloud formation template (to create a new EC2 instance) and the Ansible playbook (to install Oracle, ASM, etc.). We added a parameter to the Ansible playbook to accept the name of the S3 bucket that holds the backups. At the end of the creation and install, the playbook would

Copy the script generated on the source server from the S3 bucket
Run the restore script which would
- Download all the files from the S3 bucket
- Create the adump directory
- Create a spfile from the backup
- Create a control file
- Restore and recover the database
- Register the database with the listener
- Add entry in /etc/oratab
- Restore cron entries to the crontab
- Copy the scripts from the source server to the target server into the same directory
- Open the database in read/write mode
- Log all of the above to a file for review

All of this worked as expected. The one weak link is the amount of time that it would take to transfer the information to and from the S3 bucket. An alternative would be to FTP the data between the servers but I wanted to try using an S3 bucket.

ORA-04031 in the middle of upgrade to 11.2.0.4

A colleague of mine was performing an upgrade of a four node Oracle RAC from 11.2.0.3 to 11.2.0.4 using DBUA silent mode. About halfway through the upgrade the following error was reported and DBUA failed:

Upgrading Oracle interMedia 
UPGRADE_PROGRESS : 46% 
ORA-00604: error occurred at recursive SQL level 1 
ORA-04031: unable to allocate 4096 bytes of shared memory ("java

I suspected that the solution was to increase the below parameters and restart the upgrade.

ALTER SYSTEM SET SGA_TARGET=2048M scope=spfile sid='*';
ALTER SYSTEM SET pga_aggregate_target=500M  scope=spfile sid='*';
ALTER SYSTEM SET sga_max_size =2048M scope=spfile sid='*';

Also set cluster to false for the upgrade

alter system set cluster_database=true scope=spfile;

However, because this was a production system, I opened a severity one service request with our Oracle to get their opinion before we continued. Oracle agreed with the with the above. The steps we followed were:

01.     Set Oracle home to 11.2.0.4
02.     Copy initora file from 11.2.0.3 home to 11.2.0.4 home
        (file contains 1 line pointing to spfile in asm)
03.     Some DB processes were running on node 4. Killed 
        these via Unix kill commands
04.     Startup no mount on node 4 via SQL (not srvctl)
05.     Make above alters
06.     Shutdown immediate
07.     Startup no mount on node 4 via SQL (not srvctl)
08.     Confirm above parameter changes were in place
09.     Shutdown immediate
10.     startup upgrade
11.     Exit this session
12.     cd to oracle home
13.     Start a new session as sqlplus " / as sysdba " 
14.     spool /tmp/upgrade.log 
15.     set echo on 
16.     @$ORACLE_HOME/rdbms/admin/catupgrd.sql; 
17.     spool off 
18.     Shutdown immediate (seemed to have been issued 
        in the catupgrd.sql)
19.     Check catupgrd.sql spool file for errors
20.     Restart the database in normal mode
21.     @$ORACLE_HOME/rdbms/admin/catuppst.sql; 
22.     @$ORACLE_HOME/rdbms/admin/utlrp.sql; 
23.     Execute dbupgdiag.sql

After the above is complete, upgrade the Oracle Clusterware Configuration

srvctl upgrade database -d  -o

Our time zone was already current so we did not have to upgrade to the latest version.

After all of the above was complete, the database was ready to use. We had to complete some routine tasks such as turning of flash back, confirming that all scripts had been updated to use the new oracle home etc.

Refreshing a database from one LPAR/Subsystem to another

I am often tasked with cloning a production DB2 database from one LPAR/Subsystem into another LPAR/Subsystem. This has to be accomplished with minimum impact to the production database. In order to accomplish this, I created a process based on REXX generating the required JCL. In summary, my approach is:

Generate a list of the latest image copy files for the production database via SQL/SPUFI

A REXX routine reads in the above list and generates one unload JCL per table space (we have 1 table per table space) that
– unloads the Image copy
– edits the punch file to change the load parameters
– submits MFT processes for the punch and data
– submits the respective load job on the target

The REXX also creates the required load JCL that has logic to
– display the status of the database before and after the load
– Load the data
– Repair any check pending
– Perform an image copy (There is a requirement to recover the table multiple times during the testing cycle)

The PDS containing the load JCL is copied to the target before starting any of the unloads. As each unload finishes, it starts the transfer process about the data and punch files. After the transfer of the files is complete the process automatically starts the load.

The process also contains logic to throttle the unload and load processes via the use of jes agents. LOB tables are handle via cursor loads.

Using this process I am able to copy approximately 600 tables containing approximately 2.25 billion rows of data (about 1 TB compressed) in about 10 hours without impacting the production system.

Upgrading an Oracle database with Oracle Wallet Manager

Today we upgraded a 4 node RAC from 11.2.0.3 to 11.2.0.4. Unfortunately, we did not know the database was using encryption and the Oracle Wallet Manager. The users began to notice issues accessing the encrypted table. I ran the below SQL and the status indicated there was an issue:

SQL> select wrl_type wallet,status,wrl_parameter wallet_location from v$encryption_wallet;
 
WALLET  STATUS              WALLET_LOCATION 
------  ------------------  ------------------------
file    OPEN_NO_MASTER_KEY /opt/oracle/product/cost

The solution for this was to

back up the sqlnet.ora file in the ora_home_11.2.0.4/network/admin directory
 
On each node of the RAC, copy the sqlnet.ora file 
from the ora_home_11.2.0.3/network/admin directory 
to   the ora_home_11.2.0.4/network/admin directory
 
bounce the RAC with srvctl commands

I am not sure if a rolling bounce would have resolved the issue or not. We were in a hurry to get the application back into service and hence bounced the database.

Oracle – Performing silent patching

As part of the process to automate the creation of an Oracle AMI based on an Amazon Linux AMI, I had to create a yml that would install Oracle and patch it. Since this is being done via Jenkins/Ansible/yml etc., all the patching had to be performed in silent mode. In order to run opatch in silent mode, you need a response file. This is created as follows:

change directory to any directory that you want to create the response file output

cd /tmp

Invoke the command to create the file

oracle_home/OPatch/ocm/bin/emocmrsp

You will be prompted as follow:

OCM Installation Response Generator 10.3.7.0.0 - Production
Copyright (c) 2005, 2012, Oracle and/or its affiliates.  All rights reserved.
 
Provide your email address to be informed of security issues, install and
initiate Oracle Configuration Manager. Easier for you if you use your My
Oracle Support Email address/User Name.
Visit http://www.oracle.com/support/policies.html for details.
Email address/User Name: 
 
You have not provided an email address for notification of security issues.
Do you wish to remain uninformed of security issues ([Y]es, [N]o) [N]:  y
The OCM configuration response file (ocm.rsp) was successfully created.

A file named ocm.rsp will be create in /tmp.

You can use this file to perform silent patching. As an example, I am applying patch 16188701

cd /software/11204/11204_160416_RDBMS_ONEOFF/03_parallel/16188701
opatch apply -local -silent -ocmrf /software/ocm.rsp