CRS 10g Diagnostic Collection Guide
===================================
This document contains guidelines for collecting diagnostic information from a CRS installation.
The diagnostics listed in the sections below are necessary for development to be able to help with TARs,
bugs and other problems that may arise in the field.
CRS
Collect all CRS log files, trace files and core dumps from the following three directory trees, on every node in the cluster.
Use a tool such as tar, gzip or some equivalent method. Note that some of the files under these directories are binary files
and as such will not transfer well unless they are zipped.
The following commands will accomplish this on a Unix system. First log in as the root user:
HOST=`hostname`
cd $CRS_HOME/
tar cvf crsData_${HOST}.tar css/log css/init crs/log crs/init evm/log evm/init srvm/log racg/dump log
gzip crsData_${HOST}.tar
cd $ORACLE_HOME
tar cvf oraData_${HOST}.tar racg/dump admin/*/hdump
gzip oraData_${HOST}.tar
cd $ORACLE_BASE
tar cvf basData_${HOST}.tar admin/*/hdump
gzip basData_${HOST}.tar
If the CRS stack is not starting after initiating root.sh check the following:
/tmp/crsctl.
Another way to get all the trace files packaged is using $CRS_HOME/bin/diagcollect.pl
OCR
Check, and collect the location and existance from the orc.loc file. On Linux this will be in /etc/oracle/ and
on Solaris the file should be in /var/opt/oracle.
The content should be:
ocrconfig_loc=
local_only=FALSE
Perform an ocrdump as root so we can access the configuration data. This is done as follows:
cd $CRS_HOME/bin
ocrdump ${HOST}_OCRDUMP
This is an ASCII file. It is useful to repeat this on every node, in case one of the nodes is configured incorrectly
and is refering to a different OCR device.
Run $ORACLE_HOME/bin/ocrconfig -showbackup to find out the location of the most current backup. Go to the host in question,
and backup the file. The default location is $CRS_HOME/cdata/
CSS
In all cases the following data will be required from each of the configured nodes:
CSS daemon logs, found in $CRS_HOME/css/log; both ocssd*.log and ocssd*.blg
CSS daemon init output files, found in $CRS_HOME/init directory
CSS daemon startup files (if any), e.g. /etc/init.d/init.cssd /etc/inittab, /etc/hosts for system configuration data
Stack trace of any relevant core file found in a subdirectory of the $CRS_HOME/init directory.
Some core files may be old and irrelevant, compare core file timestamp to time of error.
The stacks of all threads will be required.
/tmp/crsctl.
CRSD
The relevant data is in the following files. You may want to examine these files for information about the problem.
CRS daemon logs: $CRS_HOME/crs/log/
CRS daemon core files: $CRS_HOME/crs/init/
PSTACKs or GCOREs or other available stack information for running daemons.
This is useful in situations where crs commands are hanging.
EVMD
The relevant data is in the following files. You may want to examine these files for information about the problem.
EVM daemon logs: $CRS_HOME/evm/init/
EVM event logger logs: $CRS_HOME/evm/log/
EVM event log archives for relevant timeframes:
$CRS_HOME/evm/log/
In at least one occasion in the past evm logs have been corrupted in transit. To avoid this, you can produce text versions of
the event traces with this command:
$CRS_HOME/bin/evmshow -t “@timestamp @@”
or
$CRS_HOME/bin/evmshow -t “@timestamp [@priority] @name”
@For more detailed information how to read evmlog file please see Note: 279165.1
CORE DUMPS
Core dumps may be found in multiple locations. For each core dump that occurred at a relevant time, obtain a stack trace for each thread.
$CRS_HOME/init/ may contain CSS daemon core dumps.
$CRS_HOME/crs/init/${HOST}/ may contain CRS daemon (crsd) core dumps.
$CRS_HOME/evm/init may contain EVM daemon core dumps.
$CRS_HOME/bin may contain racgmain or racgimon core dumps.
$ORACLE_HOME/bin may contain racgmain or racgimon core dumps.
To be thorough, you may wish to use:
cd $CRS_HOME ; find . -name “*core*”
cd $ORACLE_HOME ; find . -name “*core*”
The “strings” command may be useful to determine the pathname of the file that generated the core dump. The command
“strings core | more” typically prints the pathname of the binary in the first few lines.
A command like “gdb $CRS_HOME/bin/crsd core” can be used to generate the core dump. The “bt” command will generate a stack trace.
To switch threads, use a command like “thread t1” (?) and then use the “where” command to dump the stack of that thread.
In some cases it is desirable to obtain PSTACKs or GCOREs or other available stack information for running CRS and CSS daemons.
This is particularly useful when operations are hanging.
NETWORK
In some non-default network diagnostic modes, additional tracing files will be created in the following location.
CSS NETWORK tracing: $CRS_HOME/css/log/ocsns*.log
Other NETWORK tracing: $CRS_HOME/log/clscns*.trc
CLSC tracing: $CRS_HOME/log/clsc*.trc
OPROCD
Oprocd is another daemon frequently run in configurations without a vendor cluster service.
If the $CRS_HOME/log directory has been created, then it will add a small amount of diagnostics here.
OPROCD log: $CRS_HOME/log/oprocd*.log
RACX
Collect RACX trace files in all configured nodes. The files are in the following directories:
(Some directories may not exist in some configurations.)
$CRS_HOME/racg/dump
$ORACLE_HOME/racg/dump
hdump directories under $ORACLE_HOME/admin and $ORACLE_BASE/admin
In UNIX platforms, collect core files if any in the following directories:
$CRS_HOME/bin
$ORACLE_HOME/bin
RDBMS
Data collection should include data from all instances, not just the instance that experienced the problem.
In all cases the alert logs and background process trace files from all instances will be required.
If the problem is experienced by 1 or more foregrounds, the trace files of the foregrounds experiencing the problem will be required.
Network and VIP
The following information is required from all configured nodes:
The NIC configuration: This will vary by platform; on Solaris this is the output of the ‘ifconfig -a’ command, on Linux it is the output of the ‘ifconfig’ command.
Information from /etc/hosts or equivalent data.
Information about the various NICs and how they are configured.
netstat -ia / netstat -i for the customer network information.
netstat -r
/usr/sbin/traceroute 172.16.1.31 (private interconnect IP)
/usr/bin/find /etc/rc.d -type l
CLUSTER
Please collect information on the Clusterware if any is being used. The version of the clusterware, the configuration it sees and the output of its diagnostic tools will be useful.
In some cases, is it useful to know if any resources are being heavily consumed. Normally, DBAs and system administrators track at least disk, ram,
and cpu utilization via a process like SAR, or by periodically running commands like “df” or “ps -leaf”. If possible, this type of information should
be made available from around the time of the event and earlier. (Historical information is useful so that we can attempt to correlate the failure
with spikes in resource consumption.) The exact data to collect will be dependent on the normal system administration processes. The types
of questions we want to answer are:
Did we run out of disk space?
Did we run out of memory to allocate?
Were too many processes running preventing additional processes from spawning?
Were too many file descriptors open in a single process or system wide?
Was the cpu being heavily used?
Were I/O rates extremely high?
TRACE COLLECTION 11.1 ONWARDS
Starting 10.2 and 11.1 onwards we recommend use $CRS_HOME/bin/diagcollection.pl for collecting all Clusterware related log file.