TNS-12535, ns secondary err code: 12560 and users unable to log on

This afternoon the users of one of my applications reported that they were unable to sign on using a particular user account. The user account in question was open and unexpired. The users were able to sign on to other accounts without any issues. The alert logs on all nodes of the RAC were being flooded with messages such as:

Fatal NI connect error 12170.

  VERSION INFORMATION:
        TNS for Linux: Version 11.2.0.4.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.4.0 - Production
  Time: 15-JUN-2018 19:17:55
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535
    
TNS-12535: TNS:operation timed out
    ns secondary err code: 12560
    nt main err code: 505
    
TNS-00505: Operation timed out
    nt secondary err code: 110
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=xxx.xxx.xxx.xx)(PORT=nnn))
Fri Jun 15 19:18:37 2018

Despite bouncing all nodes of the RAC and flushing the shared pool etc. the users were not able to connect via this particular user account. There were existing connections with this account that were working.

After some false leads, a colleague noticed that there were a number of library cache locks with the user account set to “null”. These were being generated by a set of new servers that the application was attempting to configure using the user account that was experiencing the issue. After these new servers were shut down and the database bounced, normal functionality was restored.

Applying an incremental copy to make a physical standby current

One of our standby RACs had fallen behind in the log apply process:

DESTINATION           THREAD#    APPLIED   NOTAPPLD      DELTA
------------------ ---------- ---------- ---------- ----------
1 - From Primary            1       1187       1231         44
                            2       1520       1572         52
******************            ---------- ---------- ----------
Hash                                2707       2803         96

MRP was waiting on the below log number

   INST_ID PROCESS   STATUS          THREAD#  SEQUENCE#     BLOCK#
---------- --------- ------------ ---------- ---------- ----------
         2 MRP0      WAIT_FOR_GAP          2       1521          0

This log file was no longer available on the primary and hence I had to use an incremental copy to get the standby in synch with the primary. At this point I cancelled MRP on the standby, shut down the RAC via SRVCTL and started node 2 as a standalone in mount mode.

On the primary, I identified the first change number of the n-1th log that the standby had identified was missing. Using this change number, I created an incremental backup:

backup incremental from scn 14521210213 database format '/backup/dean/backup_%U.bak';

I also created a backup of the control file:

backup current controlfile for standby format '/backup/dean/stnd_ctl_file.ctl';

These files were transferred to the standby via FTP commands.

After the files are transferred to the standby, sign into SQLPLUS and stop MRP with

alter database recover managed standby database cancel;

Stop all nodes of the RAC with

srvctl status databs -d PRODDB
srvctl stop database -d PRODDB -o immediate

Bring up the database as a single node in SQLPLUS

startup nomount;
Alter database mount standby database;

Exit from SQLPLUS and go start RMAN and catalog the directory where the backups were FTPed with:

catalog start with '/backup/dean';

And then recovered the database with:

recover database noredo;

After the recover was complete, I restored the control file:

shutdown immediate;
startup nomount;
restore standby controlfile from '/backupb/dean/stnd_ctl_file.ctl';

Since the data file locations on the standby were different from the primary, the standby control file has to be updated. Start the database in mount mode:

alter database mount standby database;

At this point, the instance thinks the data file locations are those of the primary as the control file that was restored was from the primary. Switch the data file locations in RMAN with the below command (The catalog command should point to the data file locations on the standby.):

catalog start with '+DG_CLU_ORA1/D99Xsn';
switch database to copy;

In my experience, the “switch database to copy;” will switch the undo, sysaux, users and system data files successfully. The others (i.e. application-specific database files) tend to fail with the below error:

RMAN> switch database to copy;

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of switch to copy command at 08/27/2018 17:40:53
RMAN-06571: datafile 41 does not have recoverable copy

Check with REPORT SCHEMA to ensure that the data file exists on the standby with:

RMAN>  report schema;

RMAN-06139: WARNING: control file is not current for REPORT SCHEMA
Report of database schema for database with db_unique_name IPTSDV

List of Permanent Datafiles
===========================
File Size(MB) Tablespace           RB segs Datafile Name
---- -------- -------------------- ------- ------------------------
1    0        SYSTEM               ***     +DG_CLU_ORA1/D99Xsn/system.259.841589723
2    0        SYSAUX               ***     +DG_CLU_ORA1/D99Xsn/sysaux.260.841589731
.
.
.
41   1024     emp_stats_DAT_001_DC  ***     +DG_CLU_ORA1/D99Xsn/emp_stats_dat_001_dc.2405.983050989

If the file does exist, catalog it with:

RMAN> catalog datafilecopy '+DG_CLU_ORA1/D99Xsn/emp_stats_dat_001_dc.2405.983050989';

cataloged datafile copy
datafile copy file name=+DG_CLU_ORA1/D99Xsn/emp_stats_dat_001_dc RECID=197 STAMP=985283856

The switch database to copy should work after that

RMAN> switch database to copy;

datafile 1 switched to datafile copy "+DG_CLU_ORA1/D99Xsn/system.1853.983049983"
datafile 2 switched to datafile copy "+DG_CLU_ORA1/D99Xsn/sysaux.1793.983050037"
datafile 3 switched to datafile copy "+DG_CLU_ORA1/D99Xsn/undotbs1.1805.983050783"
.
.
.
datafile 41 switched to datafile copy "+DG_CLU_ORA1/D99Xsn/emp_stats_dat_001_dc.2405.983050989"

After this, I started MRP on the last node with:

alter database recover managed standby database using current logfile disconnect from session;

The standby database was now current with the primary

DESTINATION           THREAD#    APPLIED   NOTAPPLD      DELTA
------------------ ---------- ---------- ---------- ----------
1 - From Primary            1       1233       1234          1
                            2       1575       1575          0
******************            ---------- ---------- ----------
Hash                                2808       2809          1

To make sure everything was perfect, I cut a log on the primary and monitored to ensure that was applied on the standby.

Another issue I faced was that the log files were pointing to the primary’s log file location. This can be changed by running the output of the below SQL:

set line 200
select 'ALTER DATABASE RENAME FILE ' || '''' || member || '''' || ' to ' || '''' || replace(member,'primary_log_location','standby_log_location') || '''' || ';'
from v$logfile where member like '%primary db name%';

Output will look like:

ALTER DATABASE RENAME FILE '+DG1/primary/ONLINELOG/group_2_a.log' to '+DG3/standby/ONLINELOG/group_2_a.log';

Splitting a partition

Encountered a strange setup wherein a table was created as partitioned but only with a max partition resulting in basically a single partition table. Extract of the ORIGINAL DDL was as follows:

  CREATE TABLE “SCHEMA”.”TABLE_NAME” 
   (	"VERSION_NUM" VARCHAR2(5 BYTE), 
       .
       .
       .    
	"PARTITION_COLUMN" DATE, 
	.
      .
 (PARTITION "PARTITION_MAX" 
  TABLESPACE "DE_SUSP_IND_001" )  ENABLE
   ) PCTFREE 10 PCTUSED 40 INITRANS 1 MAXTRANS 255 
  STORAGE(
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "TABLE_DAT_001" 
  PARTITION BY RANGE ("PARTITION_COLUMN") 
 (PARTITION "PARTITION_MAX"  VALUES LESS THAN (MAXVALUE) SEGMENT CREATION IMMEDIATE
  ENABLE ROW MOVEMENT ;

I needed to split this into two partitions such that all ros with a value less than 2017-11-12 in a separate partition. I split the partitions as shown:

ALTER TABLE “SCHEMA”.”TABLE_NAME” 
      SPLIT PARTITION "PARTITION_MAX" AT (TO_DATE('2017-11-12','YYYY-MM-DD')) 
      INTO (PARTITION "PARTITION_20171111", PARTITION "PARTITION_MAX") UPDATE INDEXES PARALLEL
;

The above resulted in two partitions,

PARTITION "PARTITION_20171111"  VALUES LESS THAN (TO_DATE(' 2017-11-12 00:00:00', 'SYYYY-MM-DD HH24:MI:SS', 'NLS_CALENDAR=GREGORIAN'))

And

PARTITION "PARTITION_MAX"  VALUES LESS THAN (MAXVALUE) SEGMENT CREATION IMMEDIATE

After this was complete, I gathered statistics to check that the row count in the partitions matched the actual distribution of data.

Recovering a tablespaces with a LOB Tablespace/Auxiliary table

Due to a miscommunication with the development team, the DBA team accidentally refreshed the data in an environment that was currently being used for a test cycle. The users requested that we recover the database back to the most recent copy which we had taken earlier that morning. I created a REXX routine to generate the approximately 600 recover jobs that were needed to perform this as fast as possible. Some of my jobs failed because the base table had a LOB column that was pointing to an auxiliary table in a LOB TABLESPACE.

The table structure was defined as follows (DDL has been edited to remove options not relevant):

     CREATE TABLE                                             
       PRODA.MY_TABLE_01                                      
        (                                                     
        ACCOUNT_NAME   CHAR(50) NOT NULL WITH DEFAULT          
         FOR SBCS DATA                                        
       ,ACCOUNT_TXT   CLOB(1 M) NOT NULL WITH DEFAULT          
         FOR SBCS DATA                                        
        )                                                     
         IN PRODAB00.TSPACE57  
         CCSID EBCDIC                                         
         NOT VOLATILE                                         
         APPEND NO                                                                           
     ;                                
     CREATE                           
       LOB TABLESPACE TSPACE58        
         IN DPDNBB00                  
         USING STOGROUP PRODASMS      
         PRIQTY -1                    
         SECQTY -1                    
         ERASE NO                         
         GBPCACHE SYSTEM                  
         NOT LOGGED                       
         DSSIZE 4 G                       
         BUFFERPOOL BP32K1                
         LOCKSIZE ANY                     
         LOCKMAX SYSTEM                   
         CLOSE NO                         
     ;                                    
     CREATE AUXILIARY TABLE               
         PRODA.MY_TABLE_01_AUX_01         
         IN PRODAB00.TSPACE58             
         STORES PRODA.MY_TABLE_01         
         APPEND NO                        
         COLUMN ACCOUNT_TXT                
         PART 1                           
     ;                                    

The initial recover was coded as follows:

//SYSIN    DD  *                                   
  RECOVER  TABLESPACE  PRODAB00.TSPACE57           
           TORBA       X'0000000015D79FB6B7E0'     
/*                                                 

This generated the error:

DSNUGUTC -  RECOVER TABLESPACE PRODAB00.TSPACE57 TORBA X'0000000015D79FB6B7E0'
DSNU1316I -DBXC 102 15:41:03.83 DSNUCAIN - THE FOLLOWING TABLESPACES ARE MISSING FROM THE RECOVERY LIST
PRODAB00.TSPACE58

I changed the JCL as follows:

//SYSIN    DD  *                                  
  RECOVER  TABLESPACE  PRODAB00.TSPACE57          
           TABLESPACE  PRODAB00.TSPACE58          
           TORBA       X'0000000015D79FB6B7E0'    
/*                                                

And the recover completed:

DSNUCBMD - RECOVER TABLESPACE PRODAB00.TSPACE57   START
DSNUCBAL - THE IMAGE COPY DATA SET HLQ.PRODAB00.TSPACE57.NVE.F1010801 WITH DATE=20180411 AND TIME=071133 IS PARTICIPATING IN RECOVERY OF TABLESPACE PRODAB00.TSPACE57

DSNUCBMD - RECOVER TABLESPACE PRODAB00.TSPACE58   START                    
DSNUCBAL - THE IMAGE COPY DATA SET HLQ.PRODAB00.TSPACE58.NVE.F1010801 WITH DATE=20180411 AND TIME=071133 IS PARTICIPATING IN RECOVERY OF TABLESPACE PRODAB00.TSPACE58

Moving temp table spaces on a RAC with users online

Please see blog post entitled “Moving undo table spaces on a RAC with users online” for some additional background. In summary, we had to free up space on a certain ASM disk group. We had about 175 GB allocate to temp and decided to move this with users actively accessing the database.

First, take the file offline:

ALTER DATABASE TEMPFILE '+ASM_DG_01/PRODPRM/temp01.dbf' OFFLINE;

Copy the file from the old ASM disk group to the new disk group. Set oracle home to your ASM instance and then issue ASMCMD.

cp +ASM_DG_01/PRODPRM/temp01.dbf +ASM_DG_04/PRODPRM/temp01.dbf

(Note that this is a copy, not a move.)

Rename the data file in the database in sqlplus. If you have changed the oracle home to the ASM instance, remember to change back to your RAC instance’s oracle home:

ALTER DATABASE RENAME FILE '+ASM_DG_01/PRODPRM/temp01.dbf' TO '+ASM_DG_04/PRODPRM/temp01.dbf';

Bring it back online:

ALTER DATABASE TEMPFILE '+ASM_DG_04/PRODPRM/temp01.dbf' ONLINE;

Now to delete the file from the old ASM disk group. Set oracle home to your ASM instance and then issue ASMCMD.

rm +ASM_DG_01/PRODPRM/temp01.dbf

The space utilized in the ASM disk group should now be reduced.

Moving undo table spaces on a RAC with users online

We were running out of space on an ASM DG and had to free up space ASAP. Since out UNDO tablespaces are sized at about 200gb and were on the ASM DG that was full, we decided to move them.

This was the primary database with a physical (data guard) standby, so I needed to turn off standby file management

ALTER SYSTEM SET STANDBY_FILE_MANAGEMENT='MANUAL';

Create a new undo file for the RAC node 01 in the new ASM disk group:

create undo tablespace UNDOTBS1A datafile '+ASM_DG_01/PRODPRM/undotbs1a.dbf' size 20G;

Alter to allocate the new undo file to RAC node 01:

ALTER SYSTEM SET UNDO_TABLESPACE=UNDOTBS1A SCOPE=BOTH SID='PRODPRM1';

Since this is a multi-node RAC, we can take one node offline to remove all users attached to the old undo tablespace off RAC node 01:

srvctl stop instance PRODPRM1 –d PRODPRM –o immediate

To be safe, make sure the RAC node is down with:

srvctl status database –d PRODPRM

Also, check the alert log on this particular node to make sure it was a clean shutdown.

With the RAC node down, drop the data file:

drop tablespace UNDOTBS1 including contents and datafiles;

Bring up RAC node 01 and it will be using the new undo table space:

srvctl start instance PRODPRM1 –d PRODPRM

Turn on standby file management

ALTER SYSTEM SET STANDBY_FILE_MANAGEMENT='AUTO';

Recovering in the event a user issues an incorrect SQL in DB2 for z/OS

If a user inadvertently updates, inserts or deletes data in a table during the day and you need to recover the table to a prior point in time, you can recover to a checkpoint rather than an Image Copy or a quiesce point to reduce the amount of data loss. For example, if the user issued a “bad” query at 17:18 on March 29, 2018, you can print the checkpoint queue with DB2 utility DSNJU004. The output will look similar to:

                    CHECKPOINT QUEUE                
                18:40:10 MARCH 29, 2018             
TIME OF CHECKPOINT       18:30:04 MARCH 29, 2018    
BEGIN CHECKPOINT RBA            0000000023FDD4187405
END CHECKPOINT RBA              0000000023FDD4194000
END CHECKPOINT STCK             00D41A198B683F8E0A00
TIME OF CHECKPOINT       18:15:04 MARCH 29, 2018    
BEGIN CHECKPOINT RBA            0000000023FDCE6AA615
END CHECKPOINT RBA              0000000023FDCE6B73E7
END CHECKPOINT STCK             00D41A16311B02B53A00
TIME OF CHECKPOINT       18:02:44 MARCH 29, 2018    
BEGIN CHECKPOINT RBA            0000000023FDC976C806
END CHECKPOINT RBA              0000000023FDC9785E4F
END CHECKPOINT STCK             00D41A136F4ACE910800
TIME OF CHECKPOINT       17:47:44 MARCH 29, 2018    
BEGIN CHECKPOINT RBA            0000000023FDC3A02EF2
END CHECKPOINT RBA              0000000023FDC3A11000
END CHECKPOINT STCK             00D41A1014FBDC3B0800
TIME OF CHECKPOINT       17:32:44 MARCH 29, 2018    
BEGIN CHECKPOINT RBA            0000000023FDBDA9FBDC
END CHECKPOINT RBA              0000000023FDBDAAD967
END CHECKPOINT STCK             00D41A0CBAAD71430800
TIME OF CHECKPOINT       17:17:44 MARCH 29, 2018    
BEGIN CHECKPOINT RBA            0000000023FDB7EAA020
END CHECKPOINT RBA              0000000023FDB7EC4622
END CHECKPOINT STCK             00D41A096060B5B53000

There is a checkpoint at 17:17 on March 29, 2018 that may be a good recovery point.

The tablespace can be recovered with

//SYSIN    DD  *                        
  RECOVER  TABLESPACE proddb00.prodts33 
           TORBA      X'23FDB7EC4622'   
/*                                      

If you have an old table space with multiple tables, they will all be recovered to the RBA. Rebuild all the associated indices after the recovery completes.

Delete a backup file from RMAN

Yesterday I had a requirement to remove a backup file from the RMAN catalog because we did not want recover to it. After some “research” on Google and the RMAN documentation, I found the below method:

Connect to RMAN an issue command:

list backup;

This will list all the backups cataloged in RMAN. Find the one that you want to “uncatalog”. For example, I wanted to remove the below file:

BS Key  Type LV Size       Device Type Elapsed Time Completion Time
------- ---- -- ---------- ----------- ------------ ---------------
27249   Full    3.75G      DISK        00:00:00     28-MAR-18      
        BP Key: 27249   Status: AVAILABLE  Compressed: NO  Tag: TAG20180328
        Piece Name: /backup/move/dean/PRODDBbtsuuqm6_1_1.dbc
  List of Datafiles in backup set 27249
  File LV Type Ckp SCN    Ckp Time  Name
  ---- -- ---- ---------- --------- ----
  120     Full 132807363910 29-MAR-18 +DG_ORA64/proddb/perfx_52.dbf

I then issued the below command in RMAN:

CHANGE BACKUPpiece 27249 uncatalog;

Response was

uncataloged backup piece
backup piece handle=/backup/move/dean/PRODDBbtsuuqm6_1_1.dbc RECID=27249 STAMP=972007418
Uncataloged 1 objects

The file was no longer part of the subsequent RECOVER and RESTORE commands.

ORA-31693 and ORA-06502 on an Import Datapump (IMPDP)

While importing data via an IMPDP job we encountered the error:

Import: Release 11.2.0.4.0 - Production on Tue Mar 20 13:59:30 2018
 
Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.
;;; 
Connected to: Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, Oracle Label Security,
OLAP, Data Mining and Real Application Testing options
Master table "SYS"."SYS_IMPORT_FULL_01" successfully loaded/unloaded
Starting "SYS"."SYS_IMPORT_FULL_01":  "/******** AS SYSDBA" directory=dpump dumpfile=dpump_data.dmp 
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
. . 
ORA-31693: Table data object "SCHEMA_NAME"."TABLE_NAME" failed to load/unload and is being skipped due to error:
ORA-06502: PL/SQL: numeric or value error
. . 
. . 
. . 
. . 
. . 
Job "SYS"."SYS_IMPORT_FULL_01" completed with 1 error(s) at Tue Mar 20 13:59:32 2018 elapsed 0 00:00:02

Per Oracle Support, this was documented in “Bug 18665004 – ORA-31693/ORA-6502 from Datapump import (impdp) into an existing table having a LONG or LONG RAW column (Doc ID 18665004.8)”. This table did have a “LONG” column.

We were not in a position to install the fix suggested and instead decided to try the work around. We dropped the only index on the table and reran the import successfully. After the import, we recreated the index.

ORA-16047: DGID mismatch

I was building a new standby and after I enabled log_archive_dest_state_2 on the primary, the below error message was reported in the primary’s alert_log:

ALTER SYSTEM SET log_archive_dest_state_2='ENABLE' SCOPE=BOTH SID='*';
Tue Mar 20 00:20:08 2018
Errors in file /oracle/home/trace/PRODDB_tt01_438.trc:
ORA-16047: DGID mismatch between destination setting and target database
Error 16047 for archive log file 6 to 'STDBYDB'
Tue Mar 20 00:20:08 2018
Errors in file /oracle/home/trace/PRODDB_tt01_438.trc:
ORA-16047: DGID mismatch between destination setting and target database
Tue Mar 20 00:20:08 2018
Errors in file /oracle/home/trace/PRODDB_tt01_438.trc:
ORA-16047: DGID mismatch between destination setting and target database
Tue Mar 20 00:20:08 2018
Errors in file /oracle/home/trace/PRODDB_arc1_8368.trc:
ORA-16047: DGID mismatch between destination setting and target database
Tue Mar 20 00:20:08 2018
PING[ARC1]: Heartbeat failed to connect to standby 'STDBYDB'. Error is 16047.

After a few minutes of consternation, I realized that I had an incorrect parameter on the standby:

SQL> show parameter db_unique_name;
 
NAME                                 TYPE        VALUE
db_unique_name                       string      PRODDB

Basically, I had inadvertently set the db_unique_name to PRODDB on both the primary and standby database. I changed the standby as follows:

alter system set db_unique_name=STDBYDB scope=spfile sid='*';

And bounced the standby. This resolved the issue.

142
51
72
77