The recovery I did not want to do…but ended up doing anyway! (Oracle Point in Time Recovery)

For many years now, I have been secretly dreading to ever have to recover one of the databases that I support as it has allocated space of 3.1 TB and used space of 1.3 TB. However, today the day finally arrived when a member of the application team ran a script that accidently deleted data from multiple tables. The impact of the deletes was made worse by cascading RI. Bottom line, we had to recover the database to a moment in time before the bad script was executed.

We checked the log history with the below SQL:

alter session set nls_date_format = 'dd-MON-yy hh24.mi.ss';

select * from V$LOG_HISTORY 
where first_time > to_date('15-OCT-18 18.00','dd-MON-yy hh24.mi')
order by first_time, thread#
;

The database was running on a 4 node RAC. We shut down the RAC with:

srvctl stop database –d PRODDB –o immediate

And confirmed that the database was down with:

srvctl status database –d PRODDB

When the database was down, started node 4 of the RAC up with:

STARTUP MOUNT;

And began the recovery with the below script (run as nohup as it would take about 4 hours to restore a database of this size)

#!/bin/ksh
PATH=$PATH:/oracle/home/product/bin
export ORACLE_SID=PRODDB4
export ORACLE_HOME=/oracle/home/product
rman <<EOF
connect target /
run
{
    ALLOCATE CHANNEL disk1 DEVICE TYPE DISK FORMAT '/backup02/PRODDB/backup';
    ALLOCATE CHANNEL disk2 DEVICE TYPE DISK FORMAT '/backup02/PRODDB/backup';
    ALLOCATE CHANNEL disk3 DEVICE TYPE DISK FORMAT '/backup03/PRODDB/backup';
    ALLOCATE CHANNEL disk4 DEVICE TYPE DISK FORMAT '/backup03/PRODDB/backup';
    SET UNTIL TIME "TO_DATE('15-OCT-2018 19:00:00','DD-MON-YYYY  HH24:MI:SS')"; 
    restore database;
    recover database;
    release channel disk1;
    release channel disk2;
    release channel disk3;
    release channel disk4;
}
 
exit;
EOF

The nohup.out will contain output such as:

channel disk3: restored backup piece 1
channel disk3: restore complete, elapsed time: 00:07:35
channel disk3: starting datafile backup set restore
channel disk3: specifying datafile(s) to restore from backup set
channel disk3: restoring datafile 00028 to +ASM/proddb/perfx_08.dbf
channel disk3: restoring datafile 00058 to +ASM/proddb/perfd_24.dbf
channel disk3: reading from backup piece /backup03/PRODDB/backup/PRODDB20181015cutfmu04_1_1.bus
channel disk2: piece handle=/backup02/PRODDB/backup/PRODDB20181015cstfmtrd_1_1.bus tag=HOTBKUPFULL
channel disk2: restored backup piece 1
channel disk2: restore complete, elapsed time: 00:07:45
channel disk2: starting datafile backup set restore
channel disk2: specifying datafile(s) to restore from backup set
channel disk2: restoring datafile 00005 to +ASM/proddb/prefd_20.dbf
channel disk2: restoring datafile 00025 to +ASM/proddb/perfx_05.dbf
channel disk2: reading from backup piece /backup03/PRODDB/backup/PRODDB20181015cvtfmu2q_1_1.bus
channel disk4: piece handle=/backup03/PRODDB/backup/PRODDB20181015cktfmtg5_1_1.bus tag=HOTBKUPFULL

You can also check the status with:

select inst_id, sid, round((elapsed_seconds/60),0) as elap, round((time_remaining/60),0) as remain, message 
from gv$session_longops 
where time_remaining > 0 
and target_desc = 'restore'
order by start_time asc
;

If you want to see what is occurring on each data file, you can comment out the “AND” as shown below:

select inst_id, sid, round((elapsed_seconds/60),0) as elap, round((time_remaining/60),0) as remain, message 
from gv$session_longops 
where time_remaining > 0 
--and target_desc = 'restore'
order by start_time asc
;

In the alert log, you should see messages like:

Tue Oct 16 02:19:39 2018
Incomplete Recovery applied until change 141794746611 time 10/15/2018 19:00:00
Media Recovery Complete (PRODDB4)
Completed: alter database recover logfile '+ASM/proddb/archivelog/4_20404_972607420.arc'
Tue Oct 16 02:22:36 2018

After the recover is complete, open the database with a resetlogs:

alter database open resetlogs

The alert log will show messages such as:

RESETLOGS after incomplete recovery UNTIL CHANGE 141794757937583
Archived Log entry 164482 added for thread 1 sequence 20270 ID 0x4d158c31 dest 1:
Archived Log entry 164483 added for thread 2 sequence 20003 ID 0x4d158c31 dest 1:
Archived Log entry 164484 added for thread 3 sequence 20643 ID 0x4d158c31 dest 1:
Archived Log entry 164485 added for thread 4 sequence 20420 ID 0x4d158c31 dest 1:
Resetting resetlogs activation ID 1293257777 (0x4d158c31)

The instance will be shut down and opened in read write mode. Since this is a RAC, I issued a

shut down immediate;

and restarted with SRVCTL

srvctl start database –d PRODDB

and monitored the alert logs to see a successful start of all nodes.

Important – Take a backup of the database as soon as possible as you no longer have a recovery point after the reset logs.

Delete a backup file from RMAN

Yesterday I had a requirement to remove a backup file from the RMAN catalog because we did not want recover to it. After some “research” on Google and the RMAN documentation, I found the below method:

Connect to RMAN an issue command:

list backup;

This will list all the backups cataloged in RMAN. Find the one that you want to “uncatalog”. For example, I wanted to remove the below file:

BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
27249   Full    3.75G      DISK        00:00:00     28-MAR-18      
        BP Key: 27249   Status: AVAILABLE  Compressed: NO  Tag: TAG20180328
        Piece Name: /backup/move/dean/PRODDBbtsuuqm6_1_1.dbc
  List of Datafiles in backup set 27249
  File LV Type Ckp SCN    Ckp Time  Name
  ---- -- ---- ---------- --------- ----
  120     Full 132807363910 29-MAR-18 +DG_ORA64/proddb/perfx_52.dbf

I then issued the below command in RMAN:

CHANGE BACKUPpiece 27249 uncatalog;

Response was

uncataloged backup piece
backup piece handle=/backup/move/dean/PRODDBbtsuuqm6_1_1.dbc RECID=27249 STAMP=972007418
Uncataloged 1 objects

The file was no longer part of the subsequent RECOVER and RESTORE commands.

Stuck in a loop between restore and recover database

This week a colleague of mine was building a physical standby. The method she had chosen was to copy the backups from the primary server to the standby server and then restore the database followed by a recover. The database was on the larger size; approximately 4 TB. After the restore completed successfully, she issued the recover database command and Oracle responded that data file 114 needed to be restored. At this point in time I got involved and issued a restore datafile 114 command which completed successfully.

We then tried the recover database command and were informed that data file 177 needed to be restored. We issued a restore data file 177 command which also completed successfully. The next recover database command pointed to a different file that needed to be restored. This continued alternating between a series of six data files.

At this point in time we looked at the V$datafile table with the below SQL

select name, count(*) from v$datafile group by name having count(*) > 1;

The SQL stated indicated that there were three pairs of data files, each with two identically named filenames. This has occurred because the source database had many more ASM disk groups than the target (i.e. the standby database). As a result of this mismatch between the number of ASM disk groups, three pairs of files were acquiring identical names in the same ASM disk groups on the standby. Each time we restored one of the duplicate pairs the other file of the duplicate pair would then need to be restored causing us to continuously loop between restores and recovers.

We resolve the issues by issuing set new name for file ID into new location commands and then restoring and recovering the database.

Lost connection to your session during a restore database

If you are restoring a database in rman and you lose connection to your session due to a network blip, you can resume the restore after the network connection is reestablished.

For example, the restore is being run as follows:

[prodsrvr4-RACNODE4] rman target=/
Recovery Manager: Release 11.2.0.4.0 - Production on Wed Feb 14 10:01:36 2018
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
connected to target database: PRODDB (DBID=1057984567, not open)
RMAN> restore database;
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00120 to +DG_PRD2/proddb/customer_52.dbf
channel ORA_DISK_1: reading from backup piece /dbbackup06/PRODDB/backup/PRODDB201802137lsr3u55_1_1.bus

and you lose network connectivity, you can log back on and restart (i.e. resume) as follows:

 [prodsrvr4-RACNODE4] rman target=/
 
Recovery Manager: Release 11.2.0.4.0 - Production on Wed Feb 14 10:01:36 2018
 
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
 
connected to target database: PRODDB (DBID=1057984567, not open)
 
RMAN> restore database ;
 
Starting restore at 14-FEB-18
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=177 instance=RACNODE4 device type=DISK
 
skipping datafile 1; already restored to file +DG_PRD2/proddb/system01.dbf
skipping datafile 140; already restored to file +DG_PRD1/proddb/order_65.dbf
.
.
.
channel ORA_DISK_1: starting datafile backup set restore
channel ORA_DISK_1: specifying datafile(s) to restore from backup set
channel ORA_DISK_1: restoring datafile 00069 to +DG_PRD2/proddb/customer_29.dbf
channel ORA_DISK_1: restoring datafile 00203 to +DG_PRD2/proddb/undotbs102.dbf
channel ORA_DISK_1: reading from backup piece /dbbackup05/PRODDB/backup/PRODDB201802136bsr3ph4_1_1.bus
channel ORA_DISK_1: piece handle=/dbbackup05/PRODDB/backup/PRODDB201802136bsr3ph4_1_1.bus tag=HOTBKUPFULL
channel ORA_DISK_1: restored backup piece 1
channel ORA_DISK_1: restore complete, elapsed time: 00:15:55

RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process

I was in the process of building a physical standby that had to be put on hold for a few days while other more pressing priorities were handled. I thought it would be sufficient to change the below parameter:

NAME                                 TYPE        VALUE
log_archive_dest_state_2             string      DEFER

However, I then noticed that the time taken by the archived log backup script had increased from a couple of minutes to over three hours. This was because the archived logs were no longer being deleted as RMAN was displaying the below message:

channel disk2: backup set complete, elapsed time: 00:01:35
channel disk2: deleting archived log(s)
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=/archlogs/1_3426_959714474.arc thread=1 sequence=3426
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=/archlogs/1_3427_959714474.arc thread=1 sequence=3427

In order to resolve this issue I had to remove the entry for the physical standby in the below parameter:

NAME                                 TYPE        VALUE
log_archive_dest_2                   string

After this was done, the next run of the archived log backup process completed in a few minutes and subsequent runs returned to the expected couple of minutes execution time.

Moving a database to a new server while attempting to use the old remote catalog

As part of a hardware upgrade, I moved a database to a new RAC. I was attempting to use my backup scripts from the old RAC. The scripts contained connections to a remote catalog that I wanted to reuse. On the new RAC, I attempted to register and resync the catalog and got the below error messages:

[new_server-DBNAME2] rman catalog=rman_cat/rman_password@remote_catalog target=/ 

Recovery Manager: Release 11.2.0.4.0 - Production on Fri Sep 8 10:47:33 2017

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: DBNAME (DBID=599767446)
connected to recovery catalog database

RMAN> register database;

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of register command on default channel at 09/08/2017 10:47:44
RMAN-20002: target database already registered in recovery catalog

RMAN> RESYNC CATALOG;

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of resync command on default channel at 09/08/2017 10:47:51
RMAN-20020: database incarnation not set

To resolve this issue, I executed the commands:

RMAN> unregister database;

database name is "DBNAME" and DBID is 599767446

Do you really want to unregister the database (enter YES or NO)? yes
database unregistered from the recovery catalog

After the database was unregistered, I was able to register the database:

RMAN> register database;

database registered in recovery catalog
starting full resync of recovery catalog
full resync complete

RMAN> RESYNC CATALOG;

starting full resync of recovery catalog
full resync complete

The resync catalog command may be redundant, but it never hurts to make sure.

RMAN – Mismatched objects

One of my production databases was scripted to write backups to /backup01. For reasons not germane to this post, this mount point became full and I had to move the backups to /backup02. I copied the old backups and modified the script to write to /backup02 starting with the next run. However, the next execution gave me the below error.

RMAN-06207: WARNING: 1 objects could not be deleted for DISK channel(s) due
RMAN-06208:          to mismatched status.  Use CROSSCHECK command to fix status
RMAN-06210: List of Mismatched objects
RMAN-06211: ==========================
RMAN-06212:   Object Type   Filename/Handle
RMAN-06213: --------------- ---------------------------------------------------
RMAN-06214: Backup Piece    /backup01/GPn4sbh1bm_1_1.bus

This was a direct result of my moving the backups from one location to another external to RMAN. To resolve this error, I issued a cross check

RMAN> crosscheck backuppiece '/backup01/GPn4sbh1bm_1_1.bus';
 
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1011 instance=instance name device type=DISK
crosschecked backup piece: found to be 'EXPIRED'
backup piece handle=/backup01/GPn4sbh1bm_1_1.bus RECID=125099 STAMP=951616886
Crosschecked 1 objects

followed by a delete obsolete

RMAN> delete obsolete;                         
 
RMAN retention policy will be applied to the command
RMAN retention policy is set to recovery window of 7 days
using channel ORA_DISK_1
Deleting the following obsolete backups and copies:
Type                 Key    Completion Time    Filename/Handle
-------------------- ------ ------------------ --------------------
Backup Set           125099 10-AUG-17         
  Backup Piece       125099 10-AUG-17          /backup01/GPn4sbh1bm_1_1.bus
 
Do you really want to delete the above objects (enter YES or NO)? y
deleted backup piece
backup piece handle=/backup01/GPn4sbh1bm_1_1.bus RECID=125099 STAMP=951616886
Deleted 1 objects

This resolved the issue.