DR test with Snapshot Standby

This blog post contains the steps required to convert the physical standby to an open (read/write) database and allow the application to perform a DR test without impacting the primary database and the users.

This test was conducted on a four-node primary RAC with a four-node physical standby, both running Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 – 64bit Production.

We did not make any changes to the primary. As a result, users were able to work on the primary and it was able to ship logs to that standby. However, as we reserved all the space on the standby for DB_RECOVERY_FILE_DEST_SIZE the primary was unable to ship and issued the below error messages. For next time, DO NOT reserve all the space for DB_RECOVERY_FILE_DEST_SIZE. Instead, maybe use 50% or 60% of the available space and allow the archive logs from the primary to continue to land on the standby.

These errors were reported in the alert logs of the primary database because we had reserved all the space on the standby:

Tue Jun 09 12:33:44 2020
TT00: Standby redo logfile selected for thread 4 sequence 144397 for destination LOG_ARCHIVE_DEST_2
Tue Jun 09 12:35:13 2020
TT00: Attempting destination LOG_ARCHIVE_DEST_2 network reconnect (3135)
TT00: Destination LOG_ARCHIVE_DEST_2 network reconnect abandoned
Tue Jun 09 12:35:13 2020
Errors in file /oracle_home/product/diag/rdbms/prprimy/PRPRIMY4/trace/PRPRIMY4_tt00_53355.trc:
ORA-03135: connection lost contact
Error 3135 for archive log file 8 to 'PRSTNBY'

All of the below commands/SQLs were run on the standby RAC.

Check the recovery file destination:

SELECT  NAME,FLOOR (space_limit/1024/1024) "Size MB",CEIL (space_used/1024/1024) "Used MB"
FROM    v$recovery_file_dest
;

NAME                              Size MB    Used MB
------------------------------ ---------- ----------
+DG_ARCHIVE                          5535        408

Confirm space in ASM:

ASMCMD [+] > lsdg
State    Type    Rebal  Sector  Block       AU  Total_MB  Free_MB  Req_mir_free_MB  Usable_file_MB  Offline_disks  Voting_files  Name
MOUNTED  EXTERN  N         512   4096  1048576    208896   204136                0          204136              0             N  DG_ARCHIVE/

Modify the below parameters:

alter system set DB_RECOVERY_FILE_DEST_SIZE= '190G' scope=both sid='*';
alter system set DB_RECOVERY_FILE_DEST='+DG_ARCHIVE' scope=both sid='*';
alter system set undo_retention = 21600 scope=both sid='*';

Stop managed recovery process:

alter database recover managed standby database cancel;

Stop the database via srvctl:

srvctl stop database –d PRSTNBY –o immediate

In older versions of Oracle, we used to turn on flashback and activate the standby database. This is no longer required because the CONVERT TO SNAPSHOT STANDBY will turn on flashback. At this point the value of flashback will be:

select FLASHBACK_ON from gv$database;

FLASHBACK_ON
------------------
NO
NO
NO
NO

alter database flashback on;    <--- this was deemed not needed, so we didn't execute this
select flashback_on from gv$database;
select * from gv$tablespace where FLASHBACK_ON='NO';

On the standby, bring up one node in mount status to convert to snapshot:

startup nomount
alter database mount standby database;
select name, guarantee_flashback_database from gv$restore_point;

At this point, the above SQL will say no rows selected

Convert to snapshot:

ALTER DATABASE CONVERT TO SNAPSHOT STANDBY;

Check the flashback status:

select name, guarantee_flashback_database from gv$restore_point;

NAME                                            GUA
---------------------------------------------   ---
SNAPSHOT_STANDBY_REQUIRED_06/09/2020 12:35:05   YES


select flashback_on from gv$database;

FLASHBACK_ON
------------------
RESTORE POINT ONLY

Sanity check - Stop here !! -- verify that restore point EXISTs, then open database

Try to open the one node that we are working on to make sure that the open works successfully:

ALTER DATABASE OPEN;

select status, instance_name, database_role, open_mode from gv$database, gv$instance;
STATUS       INSTANCE_NAME    DATABASE_ROLE    OPEN_MODE
------------ ---------------- ---------------- --------------------
MOUNTED      PRSTNBY4         SNAPSHOT STANDBY MOUNTED

Shutdown immediate as we had only started one node to convert to snapshot

SHUTDOWN IMMEDIATE;

Bring up the database via SRVCTL in mount mode

srvctl start database –d PRSTNBY -o mount

Log on to each node (instance) and open that instance

ALTER DATABASE OPEN;

Inform users to point their application server to the former standby database and begin the DR Exercise.

After the applications inform us that they have completed their DR Exercise, continue with the below steps.

Stop the database via SRVCTL:

srvctl stop database –d PRSTNBY –o immediate

Confirm all nodes are down

srvctl status database –d PRSTNBY

Bring up one node to convert back to physical standby (Reduce the DB_RECOVERY_FILE_DEST_SIZE else we will face the error:

Waiting for all non-current ORLs to be archived...
Tue Jun 09 17:50:17 2020
Waiting for the ORL for thread 1 sequence 1 to be archived...
ARC3: Archiving not possible: error count exceeded

as the archive destination is being reserved for the DB_RECOVERY_FILE_DEST_SIZE):

startup mount
ALTER DATABASE CONVERT TO PHYSICAL STANDBY;
alter database mount standby database;
alter system set DB_RECOVERY_FILE_DEST_SIZE= '5535M' scope=both sid='*';
alter system set undo_retention = 14400 scope=both sid='*';
shutdown immediate;

Start the standby via SRVCTL

srvctl start database –d PRSTNBY

Confirm that MRP is running and log apply is progressing.

Issues with multiple outer joins causing high elapsed time

Apologies in advance as I cannot share the original query as the information is proprietary. I hope to be able to explain as much as possible to make it useful.

One of our application had converted from Sybase to Oracle Database 18c Enterprise Edition Release 18.0.0.0.0. This (obfuscated) query used to run in under 2 seconds in Sybase but was taking between 15 to 25 seconds to execute in Oracle. The 25 second result tended to be on the first execution with subsequent executions running in between 15 to 17 seconds. I put aside the 25 second result as being attributable to the creation of the access path and focused on the 15 to 17 second executions.

Original query

SELECT 
       many,
       columns,
       from,
       all,
       the,
       joined,
       tables
FROM   customer_order ao
       full outer join sales_notes sn
                    ON ao.order_id = sn.order_id
       full outer join order_contacts poc
                    ON ao.order_id = poc.order_id
       full outer join contact c
                    ON poc.contact = c.contact,
       site_order so
       full outer join order_info oori
                    ON so.site = oori.site
       full outer join tax_info t
                    ON so.site = t.site
                       AND so.order_id = t.order_id
       full outer join order_notes son
                    ON so.site = son.site
       full outer join tax_notes tn
                    ON so.site = tn.site
       full outer join tax_order_detail tod
                    ON so.site = tod.site
       full outer join tax_sub_order tso
                    ON so.site = tso.site
       full outer join other_vax_order ovo
                    ON so.site = ovo.site
       address a
WHERE  so.site          IS NOT NULL
  AND  so.site          = 'some value here'
  AND  so.site          = ao.site
  AND  so.acct_order_id = ao.acct_order_id
  AND  so.site          = a.site
ORDER  BY 
       so.site,
       so.acct_order_id 
;

All of these tables were below 100k rows with no complexity of LOB columns or anything out of the ordinary. The cost estimate for this SQL was 365,047,136.

The filter predicates in the where clause on the so.site will always result in less than 10 rows. Site is the primary index on the site table. All the other tables have an index to satisfy the join columns. The tunning advisor reported that no better plan was available.

In order to understand what was occurring, I ran an explain on the below SQL:

SELECT 
       many,
       columns,
       from,
       all,
       the,
       joined,
       tables
FROM   customer_order ao,
       site_order so,
       address a
WHERE  so.site          IS NOT NULL
  AND  so.site          = 'some value here'
  AND  so.site          = ao.site
  AND  so.acct_order_id = ao.acct_order_id
  AND  so.site          = a.site
ORDER  BY 
       so.site,
       so.acct_order_id 
;

The optimizer estimated the cost at 9 and the query ran consistently in less than a second. I then experimented with adding the other FULL OUTER JOIN tables one by one. The addition of any two tables as FULL OUTER JOINS did not impact the query cost. Additionally, these other two tables would be joined as NESTED LOOP joins. The addition of a third table as a FULL OUTER JOIN would raise the query cost significantly and also use a hash join.

After some trial and error, I added in the optimizer hint:
Modified query

SELECT /*+ USE_NL(so ovo) */
       many,
       columns,
       from,
       all,
       the,
       joined,
       tables
FROM   customer_order ao
       full outer join sales_notes sn
                    ON ao.order_id = sn.order_id
       full outer join order_contacts poc
                    ON ao.order_id = poc.order_id
       full outer join contact c
                    ON poc.contact = c.contact,
       site_order so
       full outer join order_info oori
                    ON so.site = oori.site
       full outer join tax_info t
                    ON so.site = t.site
                       AND so.order_id = t.order_id
       full outer join order_notes son
                    ON so.site = son.site
       full outer join tax_notes tn
                    ON so.site = tn.site
       full outer join tax_order_detail tod
                    ON so.site = tod.site
       full outer join tax_sub_order tso
                    ON so.site = tso.site
       full outer join other_vax_order ovo
                    ON so.site = ovo.site
       address a
WHERE  so.site          IS NOT NULL
  AND  so.site          = 'some value here'
  AND  so.site          = ao.site
  AND  so.acct_order_id = ao.acct_order_id
  AND  so.site          = a.site
ORDER  BY 
       so.site,
       so.acct_order_id 
;

And the cost went down to around 70 with the query consistently performing in under two seconds. I just happened to be testing the order of joins with the table aliased as OVO and hence the hint lists OVO. Almost any other combination results in the same (i.e. improved) access path.