The recovery I did not want to do…but ended up doing anyway! (Oracle Point in Time Recovery)

For many years now, I have been secretly dreading to ever have to recover one of the databases that I support as it has allocated space of 3.1 TB and used space of 1.3 TB. However, today the day finally arrived when a member of the application team ran a script that accidently deleted data from multiple tables. The impact of the deletes was made worse by cascading RI. Bottom line, we had to recover the database to a moment in time before the bad script was executed.

We checked the log history with the below SQL:

alter session set nls_date_format = 'dd-MON-yy hh24.mi.ss';

select * from V$LOG_HISTORY 
where first_time > to_date('15-OCT-18 18.00','dd-MON-yy hh24.mi')
order by first_time, thread#
;

The database was running on a 4 node RAC. We shut down the RAC with:

srvctl stop database –d PRODDB –o immediate

And confirmed that the database was down with:

srvctl status database –d PRODDB

When the database was down, started node 4 of the RAC up with:

STARTUP MOUNT;

And began the recovery with the below script (run as nohup as it would take about 4 hours to restore a database of this size)

#!/bin/ksh
PATH=$PATH:/oracle/home/product/bin
export ORACLE_SID=PRODDB4
export ORACLE_HOME=/oracle/home/product
rman <<EOF
connect target /
run
{
    ALLOCATE CHANNEL disk1 DEVICE TYPE DISK FORMAT '/backup02/PRODDB/backup';
    ALLOCATE CHANNEL disk2 DEVICE TYPE DISK FORMAT '/backup02/PRODDB/backup';
    ALLOCATE CHANNEL disk3 DEVICE TYPE DISK FORMAT '/backup03/PRODDB/backup';
    ALLOCATE CHANNEL disk4 DEVICE TYPE DISK FORMAT '/backup03/PRODDB/backup';
    SET UNTIL TIME "TO_DATE('15-OCT-2018 19:00:00','DD-MON-YYYY  HH24:MI:SS')"; 
    restore database;
    recover database;
    release channel disk1;
    release channel disk2;
    release channel disk3;
    release channel disk4;
}
 
exit;
EOF

The nohup.out will contain output such as:

channel disk3: restored backup piece 1
channel disk3: restore complete, elapsed time: 00:07:35
channel disk3: starting datafile backup set restore
channel disk3: specifying datafile(s) to restore from backup set
channel disk3: restoring datafile 00028 to +ASM/proddb/perfx_08.dbf
channel disk3: restoring datafile 00058 to +ASM/proddb/perfd_24.dbf
channel disk3: reading from backup piece /backup03/PRODDB/backup/PRODDB20181015cutfmu04_1_1.bus
channel disk2: piece handle=/backup02/PRODDB/backup/PRODDB20181015cstfmtrd_1_1.bus tag=HOTBKUPFULL
channel disk2: restored backup piece 1
channel disk2: restore complete, elapsed time: 00:07:45
channel disk2: starting datafile backup set restore
channel disk2: specifying datafile(s) to restore from backup set
channel disk2: restoring datafile 00005 to +ASM/proddb/prefd_20.dbf
channel disk2: restoring datafile 00025 to +ASM/proddb/perfx_05.dbf
channel disk2: reading from backup piece /backup03/PRODDB/backup/PRODDB20181015cvtfmu2q_1_1.bus
channel disk4: piece handle=/backup03/PRODDB/backup/PRODDB20181015cktfmtg5_1_1.bus tag=HOTBKUPFULL

You can also check the status with:

select inst_id, sid, round((elapsed_seconds/60),0) as elap, round((time_remaining/60),0) as remain, message 
from gv$session_longops 
where time_remaining > 0 
and target_desc = 'restore'
order by start_time asc
;

If you want to see what is occurring on each data file, you can comment out the “AND” as shown below:

select inst_id, sid, round((elapsed_seconds/60),0) as elap, round((time_remaining/60),0) as remain, message 
from gv$session_longops 
where time_remaining > 0 
--and target_desc = 'restore'
order by start_time asc
;

In the alert log, you should see messages like:

Tue Oct 16 02:19:39 2018
Incomplete Recovery applied until change 141794746611 time 10/15/2018 19:00:00
Media Recovery Complete (PRODDB4)
Completed: alter database recover logfile '+ASM/proddb/archivelog/4_20404_972607420.arc'
Tue Oct 16 02:22:36 2018

After the recover is complete, open the database with a resetlogs:

alter database open resetlogs

The alert log will show messages such as:

RESETLOGS after incomplete recovery UNTIL CHANGE 141794757937583
Archived Log entry 164482 added for thread 1 sequence 20270 ID 0x4d158c31 dest 1:
Archived Log entry 164483 added for thread 2 sequence 20003 ID 0x4d158c31 dest 1:
Archived Log entry 164484 added for thread 3 sequence 20643 ID 0x4d158c31 dest 1:
Archived Log entry 164485 added for thread 4 sequence 20420 ID 0x4d158c31 dest 1:
Resetting resetlogs activation ID 1293257777 (0x4d158c31)

The instance will be shut down and opened in read write mode. Since this is a RAC, I issued a

shut down immediate;

and restarted with SRVCTL

srvctl start database –d PRODDB

and monitored the alert logs to see a successful start of all nodes.

Important – Take a backup of the database as soon as possible as you no longer have a recovery point after the reset logs.

Oracle 12c – Copy password file in ASM on primary to ASM on physical standby

Prior to Oracle 12c, the password file was stored in ORACLE_HOME/dbs directory. In case of a RAC environment (without ACFS) we would copy this file to each node of the RAC. In the case of a RAC standby this file would also be copied to all the nodes of the standby RAC. 12c now stores this file in ASM which means we only need one copy of the file per RAC cluster.

If you display the database configuration with command

srvctl config database -d CUSTPR

one of the displayed parameters will be the location of the file in ASM:

Password file: +DG_ORA_DATA/CUSTPR/orapwCUSTPR

Despite the above file name, the actual physical file will be in a different location pointed to by a link such as:

ASMCMD> ls -l +DG_ORA_DATA/CUSTPR/orapwCUSTPR
Type      Redund  Striped  Time             Sys  Name
PASSWORD  UNPROT  COARSE   SEP 28 17:00:00  N    orapwCUSTPR => +DG_ORA_DATA/CUSTPR/PASSWORD/pwdCUSTPR.347.967084532

Storing this file in ASM adds a small complexity in moving the file to a standby RAC that houses the physical (data guard) standby as we can no longer just FTP the file directly. The file has to be copied from ASM to a regular unix file system, transferred to a destination file system and then put into ASM. This is accomplished with the ORAPWD command.

On the primary RAC, copy the file to a unix file system with the command

This command is issued from within ASMCMD

pwcopy +DG_ORA_DATA/CUSTPR/orapwCUSTPR /oracle_home/prod/12/db_1/dbs/orapwCUSTDR

Note that the destination file has been renamed to the file (database) name required on the standby; orapwCUSTPR changed to orapwCUSTDR.

After the file has been extracted from ASM, transfer it with any method such as SFTP, SCP etc. to a destination unix folder. After it is transferred, copy it to ASM with the below command

This command is issued from within ASMCMD

pwcopy /oracle_home/prod/12/db_1/dbs/ orapwCUSTDR +DG_UIIOMPDCLU33_ORA1/A7VZRPR/PASSWORD/ orapwCUSTDR

As a precautionary measure, I shutdown the standby RAC with the appropriate SRVCTL commands. After the above copy was completed, I restarted the standby and tested connectivity from the primary with

sqlplus sys@CUSTDR as sysdba

Excessive growth in SYSAUX tablespace

The SYSAUX tablespace on one of our production RACs had grown to over 28 GB and was continuing to alarm for space. I ran the below report:

@?/rdbms/admin/awrinfo.sql

which indicated that the space was being used by “SM/AWR”:

########################################################
(I) AWR Snapshots Information
########################################################

*****************************************************
(1a) SYSAUX usage - Schema breakdown (dba_segments)
*****************************************************
|
| Total SYSAUX size                     29,967.0 MB ( AUTOEXTEND ON for 1 out of 2 files )
| Fixed limit                              499.0 MB
| Auto Extent limit                     32,768.0 MB
|
| Schema  SYS          occupies         29,684.9 MB (  99.1% )
| Schema  SYSTEM       occupies             94.2 MB (   0.3% )
|
********************************************************
(1b) SYSAUX occupants space usage (v$sysaux_occupants)
********************************************************
|
| Occupant Name        Schema Name               Space Usage
| -------------------- -------------------- ----------------
| SM/AWR               SYS                       28,805.8 MB
| SM/OPTSTAT           SYS                          522.6 MB
| SM/ADVISOR           SYS                          234.1 MB
| LOGMNR               SYSTEM                        92.8 MB

Since we were keeping the AWR data for 30 days, we tried purging a few days at a time to see if that would make a difference with the following SQL:

exec DBMS_STATS.PURGE_STATS(SYSDATE-29);

But that did not seem to reduce the space used either. Purging all the data did not seem to make any difference:

exec DBMS_STATS.PURGE_STATS(DBMS_STATS.PURGE_ALL)

I suspect that we were encountering the issue documented in “Excessive AWR Growth From Partitioned Objects Such as SYS.WRH$_EVENT_HISTOGRAM Causing Sysaux to Grow (Doc ID 1912201.1)”.

I then decided to look at the sizes of the individual objects with the SQL:

select sum(bytes/1024/1024) Mb, segment_name,segment_type from dba_segments
where  tablespace_name = 'SYSAUX'
group by segment_name,segment_type order by 1 desc
;

This indicated that objects such as WRH$_EVENT_HISTOGRAM, WRH$_LATCH etc. were using up a lot of space.

Before you proceed with the following, please understand that you will be losing AWR data that could be needed for performance tunning etc.

Based on the output from the above SQL, I started truncating the largest objects (tables) with SQL such as:

truncate table WRH$_EVENT_HISTOGRAM;
truncate table WRH$_LATCH;
truncate table WRH$_SQLSTAT;
truncate table WRH$_SYSSTAT;
truncate table WRH$_ACTIVE_SESSION_HISTORY;

After this completed the space utilized was significantly reduced:

(1a) SYSAUX usage - Schema breakdown (dba_segments)
*****************************************************
|
| Total SYSAUX size                     12,036.6 MB ( AUTOEXTEND ON for 1 out of 2 files )
| Fixed limit                              499.0 MB
| Auto Extent limit                     32,768.0 MB
|
| Schema  SYS          occupies         11,754.4 MB (  97.7% )
| Schema  SYSTEM       occupies             94.2 MB (   0.8% )
|
********************************************************
(1b) SYSAUX occupants space usage (v$sysaux_occupants)
********************************************************
|
| Occupant Name        Schema Name               Space Usage
| -------------------- -------------------- ----------------
| SM/AWR               SYS                       11,394.5 MB
| SM/ADVISOR           SYS                          234.1 MB
| LOGMNR               SYSTEM                        92.8 MB

ORA-28374: typed master key not found in wallet with Oracle 12c

A user reported the below error:

Select * from users;
ORA-28374: typed master key not found in wallet

gv$encryption_wallet indicated that the wallet was open and backed up on both nodes. It turned out that we had two copies of the wallet files in different locations. In order to resolve the issue, we
– stopped the RAC
– ensured that the wallet files matched and were in the correct location on both nodes
– updated the sqlnet.ora to indicate the wallet location on both nodes
And restarted the RAC. After this, the user was able to access the USERS table.

ORA-04091: table is mutating, trigger/function may not see it

A colleague of mine ran into the annoying mutating trigger issue and solved it as follows. The requirement was to update TABLE_A with data from TABLE_B. There is no relationship between the two tables other than they get inserted during the same unit of work.

Original trigger:

CREATE OR REPLACE TRIGGER TABLE_A_TRIGGER_01
before INSERT ON TABLE_A 
FOR EACH ROW 
begin 
update TABLE_A  set customer_id = 
(select customer_id from TABLE_B
where master_key = :new.master_key);
END; 
/

This resulted in error:

insert into TABLE_A select * from TABLE_A_TEST_DATA
            *
ERROR at line 1:
ORA-04091: table TABLE_A is mutating, trigger/function may not see it
ORA-06512: at "TABLE_A_TRIGGER_01", line 2
ORA-04088: error during execution of trigger 'TABLE_A_TRIGGER_01'

The workaround was to code it as follows:

CREATE OR REPLACE TRIGGER TABLE_A_TRIGGER_01
BEFORE INSERT ON TABLE_A
FOR EACH ROW
DECLARE
 V_CUSTOMER_ID number(9);
 V_MASTER_KEY varchar2(16);
begin
V_MASTER_KEY := :new.master_key;
select h.customer_id INTO V_CUSTOMER_ID from TABLE_B h
where h.master_key = V_MASTER_KEY;
:new.customer_id := V_CUSTOMER_ID;
END;
/

ORA-16139 Physical standby with a switchover status of “NOT ALLOWED”

A colleague requested assistance with an issue he encountered while attempting to perform a switchover from a physical standby to a primary. The environment was two sets of 4 node RACs each running Oracle 11.2.0.4. Both sets of RACs had been started as single node databases before the switchover had been attempted.

The status on the standby (being switched to a primary) was as follows:

SELECT database_role, switchover_status FROM v$database;

DATABASE_ROLE    SWITCHOVER_STATUS
PHYSICAL STANDBY NOT ALLOWED

The alert log indicated that not all logs had been applied to this standby before it was converted to a primary:

Database not available for switchover
  End-Of-REDO archived log file has not been recovered
  Incomplete recovery SCN:0:115789224 archive SCN:0:115809474
Database not available for switchover
  End-Of-REDO archived log file has not been recovered
  Incomplete recovery SCN:0:115789224 archive SCN:0:115809474
Switchover: Media recovery required - standby not in limbo
ORA-16139 signalled during: ALTER DATABASE COMMIT TO SWITCHOVER TO PRIMARY WITH SESSION SHUTDOWN...

At this point we brought up the database as a standby with the below commands:

ALTER DATABASE MOUNT STANDBY DATABASE;

alter database recover managed standby database finish force; 

This caused the last archive logs from the primary to be applied and the following to be displayed in the alert log:

Attempt to do a Terminal Recovery (BOSTON1)
Media Recovery Start: Managed Standby Recovery (BOSTON1)
 started logmerger process
Wed Jul 25 22:24:53 2018
Managed Standby Recovery not using Real Time Apply
Parallel Media Recovery started with 64 slaves
Wed Jul 25 22:24:54 2018
Media Recovery Log +DG/BOSTON/1_4912_948659911.arc
Identified End-Of-Redo (switchover) for thread 1 sequence 4912 at SCN 0x0.6e71cc2
Resetting standby activation ID 2219145217 (0x84457801)
Media Recovery End-Of-Redo indicator encountered
Media Recovery Applied through change 115809474
Media Recovery Complete:lter database recover managed standby database finish force End-Of-REDO (BOSTON1)
Attempt to set limbo arscn 0:115809474 irscn 0:115809474
Completed: alter database recover managed standby database finish force

After the above, we issued:

alter database commit to switchover to primary with session shutdown;

And the alert log indicated that it was successful:

alter database commit to switchover to primary with session shutdown
ALTER DATABASE SWITCHOVER TO PRIMARY (BOSTON1)
Maximum wait for role transition is 15 minutes.
Backup controlfile written to trace file /opt/oracle/product/diag/rdbms/BOSTON/BOSTON1/trace/BOSTON1_ora_45373.trc
SwitchOver after complete recovery through change 115809474
Online log +DG/BOSTON/redo01.log: Thread 1 Group 1 was previously cleared
Online log +DG/BOSTON/redo02.log: Thread 1 Group 2 was previously cleared
Online log +DG/BOSTON/redo03.log: Thread 2 Group 3 was previously cleared
Online log +DG/BOSTON/redo04.log: Thread 2 Group 4 was previously cleared
Online log +DG/BOSTON/redo05.log: Thread 3 Group 5 was previously cleared
Online log +DG/BOSTON/redo06.log: Thread 3 Group 6 was previously cleared
Online log +DG/BOSTON/redo07.log: Thread 4 Group 7 was previously cleared
Online log +DG/BOSTON/redo08.log: Thread 4 Group 8 was previously cleared
Standby became primary SCN: 115809472
Switchover: Complete - Database mounted as primary
Completed: alter database commit to switchover to primary with session shutdown

At this point the database was in the below mode and switchover status:

SELECT open_mode, database_role, switchover_status FROM v$database;

OPEN_MODE            DATABASE_ROLE    SWITCHOVER_STATUS
MOUNTED              PRIMARY          NOT ALLOWED

We then enabled the log_archive_dest_state_2 as this was now the primary and issued the below commands:

shutdown immediate;

startup;

The database was now in the below mode and switchover status:

SELECT open_mode, database_role, switchover_status FROM v$database;

OPEN_MODE            DATABASE_ROLE    SWITCHOVER_STATUS
READ WRITE           PRIMARY          RESOLVABLE GAP

After a couple of minutes, the status changed to:

SELECT open_mode, database_role, switchover_status FROM v$database;

OPEN_MODE            DATABASE_ROLE    SWITCHOVER_STATUS
READ WRITE           PRIMARY          SESSIONS ACTIVE

The standby status was as follows:

SELECT open_mode, database_role, switchover_status FROM v$database;

OPEN_MODE            DATABASE_ROLE    SWITCHOVER_STATUS
MOUNTED              PHYSICAL STANDBY NOT ALLOWED

This is an expected status as per “Physical Standby Switchover_status Showing Not Allowed. (Doc ID 1392763.1)”. After the command to switch roles is issued on the primary, the primary will generate a special marker called EOR (end-of-redo) that is placed in the header of online redo log sequence. So this online redo log sequence will be archived locally and sent to all standby databases. Only upon receiving and applying EOR (end-of-redo), v$database.switchover_status will change from “not allowed” to “to primary” or “sessions active”.

We cut a few logs on the new primary and checked to ensure that they were applied on the standby. We then shut down the standby and brought it up as a 4 node RAC. Then we shut down the primary and brought it up as a 4 node RAC.

ORA-10458, ORA-01196, ORA-01110 errors on a physical standby

This morning we had an issue with a physical standby server. The first notification was that log apply was falling behind. It turned out that the standby server had been restarted and the listener and the database not restarted. When we started the database, we saw the below errors in the alert log:

Standby crash recovery aborted due to error 16016.
Errors in file /opt/app/oracle/diag/rdbms/dr/PRODDR/trace/PRODDR_ora_21598.trc:
ORA-16016: archived log for thread 1 sequence# 16661 unavailable
Recovery interrupted!
Some recovered datafiles maybe left media fuzzy
Media recovery may continue but open resetlogs may fail
Completed standby crash recovery.
Errors in file /opt/app/oracle/diag/rdbms/dr/PRODDR/trace/PRODDR_ora_21598.trc:
ORA-10458: standby database requires recovery
ORA-01196: file 1 is inconsistent due to a failed media recovery session
ORA-01110: data file 1: '/appl_ware/oradata/data/PRODDR/system01.dbf'
ORA-10458 signalled during: ALTER DATABASE OPEN...

A lot of posts on google indicated that a recovery was needed. Before we did that we decided to try and start MRP to see if that would resolve the issue. We issued

ALTER DATABASE MOUNT STANDBY DATABASE;

followed by the command to start MRP

alter database recover managed standby database using current logfile disconnect from session;

and the problem was resolved.

ORA-15032, ORA-15017 and ORA-15040 issues with an ASM disk group

While creating a new Oracle database with ASM on an AWS EC2 platform I encountered the below errors while attempting to mount one of the ASM disk groups:

ALTER DISKGROUP VOTE_DISK mount
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15017: diskgroup "VOTE_DISK" cannot be mounted
ORA-15040: diskgroup is incomplete

When I checked the v$asm_diskgroup, I saw

SQL> select NAME,TOTAL_MB,FREE_MB from v$asm_diskgroup;  

NAME                             TOTAL_MB    FREE_MB
------------------------------ ---------- ----------
ASMDATA1                            716796     716726
DBVOT                                    0          0
ASMDATA2                            716796     716726
ASMARCH                             716796     716726

The goal was to create the VOTE_DISK diskgroup on the DBVOT volume that was in turn mapped to the /dev/bqdx disk. The AWS Console indicated that the /dev/bqdx was attached to the EC2 but was not visible in the OS for some reason. I issued the command:

/usr/bin/echo -e "o\nn\np\n1\n\n\nt\n8e\nw" | /usr/sbin/fdisk /dev/bqdx

Which resulted in:

Welcome to fdisk (util-linux 2.23.2).

Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.


Command (m for help): Building a new DOS disklabel with disk identifier 0x29ba6b8d.

Command (m for help): Partition type:
   p   primary (0 primary, 0 extended, 4 free)
   e   extended
Select (default p): Partition number (1-4, default 1): First sector (2048-104857599, default 2048): Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-104857599, default 104857599): Using default value 104857599
Partition 1 of type Linux and of size 50 GiB is set

Command (m for help): Selected partition 1
Hex code (type L to list all codes): Changed type of partition 'Linux' to 'Linux LVM'

Command (m for help): The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.

However, when I attempted to create the disk I got the error:

root@10-118-134-71:/root # /usr/sbin/oracleasm createdisk DBVOT /dev/bqdx1
Device "/dev/bqdx1" is already labeled for ASM disk "DBVOT"

I then attempted to drop the disk group

SQL> drop diskgroup DBVOT;
drop diskgroup DBVOT
*
ERROR at line 1:
ORA-15039: diskgroup not dropped
ORA-15001: diskgroup "DBVOT" does not exist or is not mounted

I then tried with the FORCE option and succeeded

SQL> drop diskgroup DBVOT force including contents;

Diskgroup dropped.

I then deleted the disk

root@10-118-134-71:/root # /usr/sbin/oracleasm deletedisk DBVOT 
Clearing disk header: done
Dropping disk: done

and checked to make sure it was no longer visible

[oracle@10-118-134-71 ~]$ /usr/sbin/oracleasm listdisks
DB1
DG10
DG11
DG12
DG2
DG3
DG4
DG5
DG6
DG7
DG8
DG9

Then I recreated the disk

root@10-118-134-71:/root # /usr/sbin/oracleasm createdisk DBVOT /dev/bqdx1
Writing disk header: done
Instantiating disk: done

and listed the disks to confirm

root@10-118-134-71:/root # /usr/sbin/oracleasm listdisks
DG1
DG10
DG11
DG12
DG2
DG3
DG4
DG5
DG6
DG7
DG8
DG9
DBVOT

Then I created the disk in ASM

SQL> CREATE DISKGROUP DBVOT EXTERNAL REDUNDANCY DISK '/dev/oracleasm/disks/DBVOT';

Diskgroup created.

TNS-12535, ns secondary err code: 12560 and users unable to log on

This afternoon the users of one of my applications reported that they were unable to sign on using a particular user account. The user account in question was open and unexpired. The users were able to sign on to other accounts without any issues. The alert logs on all nodes of the RAC were being flooded with messages such as:

Fatal NI connect error 12170.

  VERSION INFORMATION:
        TNS for Linux: Version 11.2.0.4.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
  Time: 15-JUN-2018 19:17:55
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
    
TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505
    
TNS-00505: Operation timed out
    nt secondary err code: 110
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=xxx.xxx.xxx.xx)(PORT=nnn))
Fri Jun 15 19:18:37 2018

Despite bouncing all nodes of the RAC and flushing the shared pool etc. the users were not able to connect via this particular user account. There were existing connections with this account that were working.

After some false leads, a colleague noticed that there were a number of library cache locks with the user account set to “null”. These were being generated by a set of new servers that the application was attempting to configure using the user account that was experiencing the issue. After these new servers were shut down and the database bounced, normal functionality was restored.

Applying an incremental copy to make a physical standby current

One of our standby RACs had fallen behind in the log apply process:

DESTINATION           THREAD#    APPLIED   NOTAPPLD      DELTA
------------------ ---------- ---------- ---------- ----------
1 - From Primary            1       1187       1231         44
                            2       1520       1572         52
******************            ---------- ---------- ----------
Hash                                2707       2803         96

MRP was waiting on the below log number

   INST_ID PROCESS   STATUS          THREAD#  SEQUENCE#     BLOCK#
---------- --------- ------------ ---------- ---------- ----------
         2 MRP0      WAIT_FOR_GAP          2       1521          0

This log file was no longer available on the primary and hence I had to use an incremental copy to get the standby in synch with the primary. At this point I cancelled MRP on the standby, shut down the RAC via SRVCTL and started node 2 as a standalone in mount mode.

On the primary, I identified the first change number of the n-1th log that the standby had identified was missing. Using this change number, I created an incremental backup:

backup incremental from scn 14521210213 database format '/backup/dean/backup_%U.bak';

I also created a backup of the control file:

backup current controlfile for standby format '/backup/dean/stnd_ctl_file.ctl';

These files were transferred to the standby via FTP commands.

After the files are transferred to the standby, sign into SQLPLUS and stop MRP with

alter database recover managed standby database cancel;

Stop all nodes of the RAC with

srvctl status databs -d PRODDB
srvctl stop database -d PRODDB -o immediate

Bring up the database as a single node in SQLPLUS

startup nomount;
Alter database mount standby database;

Exit from SQLPLUS and go start RMAN and catalog the directory where the backups were FTPed with:

catalog start with '/backup/dean';

And then recovered the database with:

recover database noredo;

After the recover was complete, I restored the control file:

shutdown immediate;
startup nomount;
restore standby controlfile from '/backupb/dean/stnd_ctl_file.ctl';

Since the data file locations on the standby were different from the primary, the standby control file has to be updated. Start the database in mount mode:

alter database mount standby database;

At this point, the instance thinks the data file locations are those of the primary as the control file that was restored was from the primary. Switch the data file locations in RMAN with the below command (The catalog command should point to the data file locations on the standby.):

catalog start with '+DG_CLU_ORA1/D99Xsn';
switch database to copy;

In my experience, the “switch database to copy;” will switch the undo, sysaux, users and system data files successfully. The others (i.e. application-specific database files) tend to fail with the below error:

RMAN> switch database to copy;

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of switch to copy command at 08/27/2018 17:40:53
RMAN-06571: datafile 41 does not have recoverable copy

Check with REPORT SCHEMA to ensure that the data file exists on the standby with:

RMAN>  report schema;

RMAN-06139: WARNING: control file is not current for REPORT SCHEMA
Report of database schema for database with db_unique_name IPTSDV

List of Permanent Datafiles
===========================
File Size(MB) Tablespace           RB segs Datafile Name
---- -------- -------------------- ------- ------------------------
1    0        SYSTEM               ***     +DG_CLU_ORA1/D99Xsn/system.259.841589723
2    0        SYSAUX               ***     +DG_CLU_ORA1/D99Xsn/sysaux.260.841589731
.
.
.
41   1024     emp_stats_DAT_001_DC  ***     +DG_CLU_ORA1/D99Xsn/emp_stats_dat_001_dc.2405.983050989

If the file does exist, catalog it with:

RMAN> catalog datafilecopy '+DG_CLU_ORA1/D99Xsn/emp_stats_dat_001_dc.2405.983050989';

cataloged datafile copy
datafile copy file name=+DG_CLU_ORA1/D99Xsn/emp_stats_dat_001_dc RECID=197 STAMP=985283856

The switch database to copy should work after that

RMAN> switch database to copy;

datafile 1 switched to datafile copy "+DG_CLU_ORA1/D99Xsn/system.1853.983049983"
datafile 2 switched to datafile copy "+DG_CLU_ORA1/D99Xsn/sysaux.1793.983050037"
datafile 3 switched to datafile copy "+DG_CLU_ORA1/D99Xsn/undotbs1.1805.983050783"
.
.
.
datafile 41 switched to datafile copy "+DG_CLU_ORA1/D99Xsn/emp_stats_dat_001_dc.2405.983050989"

After this, I started MRP on the last node with:

alter database recover managed standby database using current logfile disconnect from session;

The standby database was now current with the primary

DESTINATION           THREAD#    APPLIED   NOTAPPLD      DELTA
------------------ ---------- ---------- ---------- ----------
1 - From Primary            1       1233       1234          1
                            2       1575       1575          0
******************            ---------- ---------- ----------
Hash                                2808       2809          1

To make sure everything was perfect, I cut a log on the primary and monitored to ensure that was applied on the standby.

Another issue I faced was that the log files were pointing to the primary’s log file location. This can be changed by running the output of the below SQL:

set line 200
select 'ALTER DATABASE RENAME FILE ' || '''' || member || '''' || ' to ' || '''' || replace(member,'primary_log_location','standby_log_location') || '''' || ';'
from v$logfile where member like '%primary db name%';

Output will look like:

ALTER DATABASE RENAME FILE '+DG1/primary/ONLINELOG/group_2_a.log' to '+DG3/standby/ONLINELOG/group_2_a.log';