Flashback logging only has a 2% performance decrease…or not.

Firstly apologies to both people who have previously read my blog, things have been a little hectic over the past 12 months working away from home in the Netherlands has taken its toll on my spare time, and weekends are to be spent either with the family or working, things have settled into a more stable pattern now so here goes..

Some background, I’ve been assigned into a project to configure dataguard on a very busy production system performing a mixed bag of batch processing, messaging and DSS loads. Performance and availability of the system is critical for the business so everyone was happy that when flashback was proposed as an addition to the system they could quote this bit of marketing http://www.oracle.com/technology/deploy/availability/htdocs/Flashback_Overview.htm (search for 2% and you will see the bit I mean).  Great everyone was happy then i get moved onto the project to help with configuration and testing and start upsetting people..

Prior to starting the formal tests we have a fairly informal period where we have built the test databases configured the standby got the team used to data guard brokers foibles, failed over, rebuilt the primary using flash back database, performed some switch overs and switch backs, broken the configuration by an ill advised flashback on the primary fixed the configuration switched over a couple more times, then hit a bug (5937899) handily unpublished regarding flashback data guard and switch overs fortunately this was recoverable as we had set guaranteed restore points which is where our problems really got started.

As part of the test plan we have a set of processes to run, the plan is to run everything, measure performance of the network, host o/s, SAN generate awr reports etc and then flashback to our guaranteed restore point alter the configuration and re-run the same processes, giving us data we can compare e.g. no data guard, archive log shipping, lgwr async and finally lgwr sync. The plan was good right up until we tried some Micky Mouse testing with simple scripts creating tables by selecting from dba objects, delete insert commit really simple stuff, so first run through no data guard no flashback everything is good, enable  data guard configure flashback database set a guaranteed restore point, performance nosedives and top 5 wait events looks like

Top 5 Timed Events

Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
log file sync 2,055 557 271 35.7 Commit
CPU time 497 31.8
db file parallel write 33,547 437 13 28.0 System I/O
flashback buf free by RVWR 430 402 935 25.7 Other
log file parallel write 777 96 124 6.2 System I/O

This is not good, the log file sync time isn’t unexpected (in fact is not much greater than without dataguard or flashback on), but the flashback buf free by RVWR is not something we expected to see at this level.  Finding this really doesn’t bode well for the rest of the testing and the impact on production, so we start digging around, switch of flash back, configure a restore point with flashback logging not enabled. Until eventually I’m looking at the session wait events and the flash recovery area at the same time, and notice that all the sessions go into waiting on flashback buf free by RVWR at the same time and this coincides with a new flashback log being created. because of the load we have generated flashback logs are created ~500MB in size, every session we are running to generate load waits while this is happening.

Right so what is the real problem? well its the way we have set things up, with flash back logging enabled and no restore points configured flash back logs will get reused (how soon depends on db_flashback_retention_target) once the system hits a steady state it shouldn’t be creating new flashback logs just reusing old ones. In our test environment we enable flashback logging and configure a guaranteed restore point so we don’t ever reuse a flashback log we just create new ones this isn’t representative of the live system as we will not be setting guaranteed restore points, so our next step is try to find a way to enable flashback to a guaranteed point in test without inflicting the waits we have seen in our quick and dirty testing. testing on a small system the following seems to get around the problem.

Configure the test database and standby server.
Set flashback database on, set db_flashback_retention_target to a big number (even the default of 1440 will work)
execute one of our scripts to generate a lot of changes and wait until we have enough flash back logs (yes i know how many is enough?)
set a guaranteed restore point
set db_flashback_retention_target to a small number (<60)

what will then happen is the existing flashback logs will be reused removing the extreme waits we are getting waiting on new flashback log files being created while still allowing us to flashback and re-run the production like process tests against the same data.

The real point here isn’t about flash back logging or a specific process its about testing in our case we have delayed the start of the formal tests by a day but have saved time overall as we will not end up having to re-run tests that take 4-5 hours to complete, well at least not for this reason :),  in summary :-

validating your environment before you start time consuming and possibly expensive tests is a good idea

Small fairly crude tests can show up problems in the configuration shortening the test cycle.

Having time to mess about and quickly retry things is invaluable for DBA’s implementing features that haven’t been used by the team before.

Never, ever quote a performance figure pulled from marketing material no matter what the source is, always verify in your own environment.

About these ads

2 Responses to Flashback logging only has a 2% performance decrease…or not.

  1. OracleMagician says:

    Have seen huge performance degradation similar to what you described, when Oracle is not reusing flashback log. Massive hit on performance (about 4x)

    However, I see nothing in Oracle support/metalink that hints of this issue. Perhaps because, like you state, problem disappears when “steady state” is reached.

    Would like to hear others’ experience on this.

    • randomdba says:

      One thing that has occured to me is that this may be a problem with async IO not being enabled. The enviroment I was working on was using vertias filesystems but without either Quick IO or the Storage Foundation for Oracle, unfortunatly I’m not onsite anymore so can’t test this one out but its worth a look, Its on my list of things to try on my home server enviroment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: