How to analyze statspack or AWR report.

http://oraperf.sourceforge.net/seminar/sevt_anal.html

A typical waits event section of a report.txt looks like

SQLDBA> Rem System wide wait events.

SQLDBA> select n1.event “Event Name”,

2> n1.event_count “Count”,

3> n1.time_waited “Total Time”,

4> (n1.time_waited/n1.event_count) “Average Time”

5> from stats$event n1

6> where n1.event_count > 0

7> order by n1.time_waited desc;

Event Name Count Total Time Average Time

————————— ———— ———— ————

client message 1770692 26676762 15.065726846

rdbms ipc message 7626 782385 102.59441385

pmon timer 535 158481 296.22616822

smon timer 7 146587 20941

db file sequential read 83662 79166 .94625995075

db file scattered read 5224 26120 5

db file parallel write 6982 6848 .98080779146

inactive session 60 6060 101

latch free 15653 4442 .28377946719

enqueue 28 2268 81

log file sync 3638 2245 .61709730621

log file parallel write 3892 2004 .51490236382

rdbms ipc reply 19 1813 95.421052632

control file parallel write 843 850 1.0083036773

log file space/switch 5 236 47.2

db file single write 726 210 .28925619835

control file sequential rea 1196 22 .01839464883

library cache pin 24 10 .41666666667

buffer busy waits 28 6 .21428571429

library cache load lock 2 5 2.5

write complete waits 5 5 1

log file sequential read 1 2 2

log file single write 2 0 0

23 rows selected.

SQLDBA>

(To see and interpretation of all the wait events possible see: wait events )

We can immediately simplify this list by eliminating the wait events that are not related to users

Event Name Count Total Time Average Time

————————— ———— ———— ————

client message 1770692 26676762 15.065726846

rdbms ipc message 7626 782385 102.59441385

pmon timer 535 158481 296.22616822

smon timer 7 146587 20941

db file sequential read 83662 79166 .94625995075

db file scattered read 5224 26120 5

db file parallel write 6982 6848 .98080779146

inactive session 60 6060 101

latch free 15653 4442 .28377946719

enqueue 28 2268 81

log file sync 3638 2245 .61709730621

log file parallel write 3892 2004 .51490236382

rdbms ipc reply 19 1813 95.421052632

control file parallel write 843 850 1.0083036773

log file space/switch 5 236 47.2

db file single write 726 210 .28925619835

control file sequential rea 1196 22 .01839464883

library cache pin 24 10 .41666666667

buffer busy waits 28 6 .21428571429

library cache load lock 2 5 2.5

write complete waits 5 5 1

log file sequential read 1 2 2

log file single write 2 0 0

The above output gives a profile of the “bottleknecks” in the database. We can elemenate most of the lines because they are usually irrelevant. For example:

grey colored lines are idle events. The processes are idle and waiting for work to do.

violete colored lines are background waits, and not users shadow process waits, thus, at least in the first analysis, we can ignore them. We are interested in why users have to wait and not why the background process (pmon,smon,lgwr, dbwr) have to wait

red colored lines are the events that users have to wait for.

To see a list of wait events broken down by type see: wait events by type

This leaves us with:

Event Name Count Total Time Average Time

————————— ———— ———— ————

db file sequential read 83662 79166 .94625995075

db file scattered read 5224 26120 5

latch free 15653 4442 .28377946719

enqueue 28 2268 81

log file sync 3638 2245 .61709730621

rdbms ipc reply 19 1813 95.421052632

log file space/switch 5 236 47.2

library cache pin 24 10 .41666666667

buffer busy waits 28 6 .21428571429

library cache load lock 2 5 2.5

write complete waits 5 5 1

The TIME_WAITED is measured in 100ths of seconds.

The basic easily tunable events are:

buffer_busy_waits – rbs problems, or need free list, or maybe just I/O throughput

free_buffer_waits – dbwr not working hard enough or buffer cache too small

write_complete_waits – dbwr maybe working too hard

db_file_scattered_read – read (full table scan) disk wait, buffer cache too small

db_file_sequential_read – rowid (index) block read, read disk wait, buffer cache too small

enqueue – lock waits

latch_free – waits for latch, see v$latch to see which latch exactly

log_buffer_space – log buffer needs to be larger

log_file_space_switch – log buffer space wait or log file switch wait (pre 7.3)

log_file_switch_checkpoint_incomplete – redo logs too small or not enough redo logs

log_file_switch_completion – wait for log file while switching log file

log_file_sync – write to log file wait, raw devices are faster for log files (no RAID5!!)

So by far the biggest waits from the example above are on

db file scattered read

db file sequential read

and the sum of every time any users ever waited for this event during the period of bstat/estat is

This leaves us with:

Event Name Count Total Time Average Time

————————— ———— ———— ————

db file sequential read 83662 79166 .94625995075

db file scattered read 5224 26120 5

= 79166 + 26120

= 105286 hundreths of second

= 1052 seconds

= 17.5 minutes

Now the question, so is 17.5 minutes significant? Well thats harder to decide.

Probably the easiest thing to do is look at the statistic

CPU used by this session

in the statistic section of the bstat/estat report . Compare the ratio of the CPU time to the WAIT time.

for the lovers of generic rules:

CPU > 2 * WAIT : probably ok

CPU > WAIT : probably some tuning to do

CPU < WAIT : problem area
Another way to look at the relative importance of the wait events is to look at how many users were connected during the bstat/estat to figure out the average time waited by users., First one needs to know over what period of time the bstat/estat was run. This is given at the bottom of the report.txt … time period of stats

Second, one needs to know how many users were working (thus possible waiting for these event). The file report.txt doesn’t really help here. It does give how many users logged on during the period, but not how many were connected. You can run the following for the number of users connected:

select value from v$sysstat s, v$statname n where s.statistic#=n.statistic# and n.name=’logons current’;

OK, say you know over what period the bstat/estat was taken and you know roughly how many users were connected, then you can get some idea of whether the wait time is important. For example:

stat time 18 minutes with one user ==> probably needs tuning

stat time 18 hours with 1000 users ==> probaly not a big deal

The importance of the time waited is relative to the time we spent working. So in order to determine the importance of the wait events we need to know how much time the users spent active or working (and not idle). There are two ways to determine this

1. assume active time is given by “CPU used by this session” statistic, which is actually CPU used by all users in v$sysstat table

2. assume active time is ( connect time – idle time). This is difficult because we not only need to know how many process were connected for but also the idle time. THe idle time comes from “client message” and “rdbms ipc message” in both the table “v$session_waits” and “v$system_event” which is not exactly obvious

The first method I think is probably the best and easiest to use. The stat “CPU used by this session” is the sum of time spent by all Oracle processes using the CPU measured in 100ths of a second. So:

wait_time/(wait_time+cpu_time)

So assuming user active time = wait_time+cpu_time, then the amount waited So for generic rules lovers given the above formula, if the result is:

< .5 : probably ok
> .5 and < 1 : probably some tuning to do
> 1 : problem area

Related Posts

Leave a Reply Cancel reply