PURPOSE
=======
This note discusses resolving an RMAN Hung Job
SCOPE & APPLICATION
===================
Anyone involved in running RMAN jobs
Resolving an RMAN Hung Job
==========================
Components of an RMAN Session
The nature of an RMAN session depends on the operating system. In UNIX,
an RMAN session has the following processes associated with it:
– The RMAN process itself.
– The catalog connection to the recovery catalog database–if using a
recovery catalog, none otherwise.
– The connection to the target database, also called the default channel.
– A polling connection to the target database used for RPC testing of each
different connect string used in the allocate channel command. By default
there is no connect string in allocate channel and so there is only one
RPC connection.
– One target connection to the target database corresponding to each
allocated channel.
Process Behavior During a Hung Job
RMAN usually hangs because one of the channel connections is waiting in the
media manager code for a tape resource. The catalog connection and the default
channel seem to hang because they are waiting for RMAN to tell them what to do.
Polling connections seem to be in an infinite loop while polling the RPC under
the control of the RMAN process.
If you kill the RMAN process itself, then you also kill the catalog connection,
the default channel, and the polling connections. Target connections that are not
hung in the media manager code also terminate: only the target connection executing
in the media management layer remains active. You must manually kill this process
because terminating its session does not kill it. Even after termination, the media
manager may keep resources busy or continue processing because it does not realize
that the Oracle process is gone. This behavior is media manager-dependent.
Terminating the catalog connection does not cause RMAN to finish because RMAN is
not performing catalog operations. Removing default channel and polling connections
cause the RMAN process to detect that one of the channels has died and then proceed
to exit. In this case, the connections to the hung channels remain active as
described above.
Terminating an RMAN Session
The best way to terminate RMAN when the connections for the allocated channels
are hung in the media manager is to kill the Oracle process of the connections.
The RMAN process detects this termination and proceed to exit, removing all
connections except target connections that are still operative in the media
management layer. The caveat about the media manager resources still applies
in this case.
To identify and terminate an oracle process that is hung in the media manager code:
This procedure is system-specific. See your operating system-specific documentation
for the relevant commands.
1. Obtain the current stack trace for the desired process id using a system-specific
utility. For example, on Sun Solaris you can use the command pstack located in
/usr/proc/bin to obtain the stack.
2. After the stack is obtained, look for the process with SBTxxxx (normally sbtopen)
as one of its top calls. Note that other layers may appear on top of it.
3. Obtain the stack again after a few minutes. If the same stack trace is returned,
then you have identified the hung process.
4. Kill the hung process using a system-specific utility. For example,
on Sun Solaris execute a kill -9 command.
5. Repeat this procedure for all hung channels in the media management code.
6. Check that the media manager also clears its processes, otherwise the next
backup or restore may still hang due to the previous hang. In some media
managers, the only solution is too shut down and restart the media manager
daemons. If the documentation from the media manager is unhelpful, ask the
media manager technical support for the correct solution.
Backup Job Is Hanging
In this scenario, an RMAN backup job starts as normal and then pauses inexplicably:
Recovery Manager: Release 8.1.5.0.0 – Production
RMAN-06005: connected to target database: TORPEDO
RMAN-06008: connected to recovery catalog database
RMAN> run {
2> allocate channel t1 type “SBT_TAPE”;
3> backup
4> tablespace system,users; }
RMAN-03022: compiling command: allocate
RMAN-03023: executing command: allocate
RMAN-08030: allocated channel: t1
RMAN-08500: channel t1: sid=16 devtype=SBT_TAPE
RMAN-03022: compiling command: backup
RMAN-03023: executing command: backup
RMAN-08008: channel t1: starting datafile backupset
RMAN-08502: set_count=15 set_stamp=338309600
RMAN-08010: channel t1: including datafile 2 in backupset
RMAN-08010: channel t1: including datafile 1 in backupset
RMAN-08011: channel t1: including current controlfile in backupset
# Hanging here for 30 min now
Diagnosis of the Cause
If a backup job is hanging, that is, not proceeding, then several scenarios
are possible:
– The job abnormally terminated.
– A server-side or media management error occurred.
– RMAN is waiting for an event such as the insertion of a new cassette into
the tape device.
Your first task is to try to determine which of these scenarios is the most
likely cause.
To determine the cause of the hang:
1. If you are using a media manager, examine media manager process, log, and trace
files for signs of abnormal termination or other errors (see the description of
message files in “Identifying Types of Message Output”). If this information is
not helpful, proceed to the next step.
2. Restart RMAN and turn on debugging, making sure to specify a trace file to
contain the output. For example, enter:
% rman target / catalog rman/rman@catdb debug trace = /oracle/log
3. Re-execute the job:
run {
allocate channel c1 type ‘sbt_tape’;
backup tablespace system;
}
4. Examine the debugging output to determine where RMAN is hanging.
The output will most likely indicate that the last RPC sent from the
client to the server was SYS.DBMS_BACKUP_RESTORE.BACKUPPIECECREATE,
which is the call that causes the server to interact with the media
manager to write the backup data:
krmxrpc: xc=6897512 starting long running RPC #13 to target: DBMS_BACKUP_RESTORE.
BACKUPPIECECREATE
krmxr: xc=6897512 started long running rpc
5. Check to see what the server processes performing the backup are doing.
How many processes are hanging? If only one, check to see what it is doing
by querying V$SESSION_WAIT. For example, to determine what process 12 is
doing, enter:
SELECT * FROM v$session_wait WHERE wait_time = 0 AND sid = 12;
6. If a backup to tape stalls at the beginning, issue the following query:
SELECT * FROM v$session_longops WHERE compnam = ‘dbms_backup_restore’; –> for 8.0
SELECT * FROM v$session_longops WHERE substr(opname,1,4)=’RMAN’; –> for 8.1 & 9.0
If Oracle returns no information, then the PL/SQL program performing the backup
is hung.
Solution
Because the causes of a hung backup job can be varied, so are the solutions.
The best practice is to look for the simplest solutions first. For example,
it is quite common for backup jobs to hang simply because the tape device has
completely filled the current cassette and is waiting for a new tape to be
inserted. Look for the obvious in all components used for the backup when
problems occur.