Making life hard for yourself.

April 21, 2013

Sometimes it seems like a developer has gone out of their way to make life difficult for themselves (and us poor long suffering DBAs). For example we have an audit trail of user actions in the database that contains xml. This almost makes sense except that the xml is stored in a blob, something like this for example….


As we have limited space these records are removed after 90 days but can be restored from the backups from up to ten years ago if required, however getting the data back is only part of the problem as we now need to read it. Being a helpful sort of chap I wrote a simple blob to clob function that allows us to get the xml out of the blob, what we get for the above is this:-

<field Name="I">DO</field>
<field Name="NOT">LIKE</field>
<field Name="DATA">PERvY3VtZW50Pg0KICA8ZmllbGQgTmFtZT0iQkVJTkciPkFibGUgdG8gcXVlcnkgbXkgZGF0YTwvZmllbGQ+DQo8L0RvY3VtZW50Pg==</filed>

So what is that DATA field in the middle?
Its a base64 encoded xml document, at this point I gave up and advised the developer that they would have to sort this out themselves as we have an xml document encoded into base64, put in a field in another xml document and then the whole thing is stored in a blob field. Attempting to get meaning out of this data via sql is going to be difficult and frustrating when we know exactly which record is required if someone wants to search the xml for a specific field then we are heading down the rabbit hole into madness especially as we have only attempted to decode a couple of entries so who knows what else is hiding in there.
So instead of shipping this data into the data warehouse we a stuck with backing it up onto tape and restoring it on request along with the front end tool that can read and display it, hopefully the development team won’t mind supporting ten years worth of front ends as well.

Remember to check the obvious.

December 14, 2012

Its easy to forget to check the obvious sometimes, a user complains that performance of a RAC cluster is appalling, a quick dive into OEm shows everything is spending forever in cluster waits, its easy to dive into sql performance tuning mode. Alternatively you can start looking for hot blocks or small tables being pinged back and forth across the interconnect. Luckily I noticed something odd in SAR first, my interconnect only transferring 1.2mb/s but everything still hung up on cluster waits a quick check of the network card:-

racnode1:~ # ethtool eth4
Settings for eth4:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 10Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: umbg
        Wake-on: g
        Current message level: 0x00000003 (3)

Link detected: yes

10mb/s? so much for the fancy switch management and alerting software, failing the bond to the other network link fixed things until a networking engineer can get to it and move the cable.

Sometimes its not a complex sql tuning task sometimes it is a database performance issue and not the fault of the developers.

Bind Variable mismatch, nulls and oci

November 24, 2011

A short post on a frustrating issue that took a while to find the solution to. The developers made a small change to a table allowing nulls in 23 columns previously configured as not null. This resulted in a large number of bind mismatches and child cursors being generated, eventually giving rise to a serious performance issue.

After much discussion of the issue test cases, being told to tune the database flushing the shared pool finally a developer opened the code and took a look at the problem and found the reason for the bind mismatch was the oci null indicator, when the column value is set null a value of -1 was being passed, when a value was present a null pointer was passed. The optimiser sees these as bind mismatches giving 2^23 combinations of bind settings. Patching the code to pass a zero when a value is present rather than a null pointer prevents the bind mismatch and returned everything to normal.

One other issue found was the test configuration, due to the data distribution in test not generating the same variance in nulls/values in the binds this problem wasn’t found until the production release making the assorted workarounds necessary.

How expensive are literals?

June 26, 2011

As a follow-up to my last post I finally got around to writing a small program to demonstrate the difference between bind variables and literals, the test is a simple c++ program the reads a text file and updates 27,000 rows in a single table. The code runs the updates in three different ways.

Method 1: Bulk bind 200 rows at a time into the update statement.

Method 2: Bind 1 row at a time

Method 3: Use literals in the updates.

Tkprof output from the test looks like this.

Method 1
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        2      0.01       0.01          0          0          0           0
Execute    271      5.44       6.05         23      28661      27706       27001
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      273      5.46       6.07         23      28661      27706       27001
Method 2
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        2      0.00       0.00          0          0          0           0
Execute  27001     12.19      14.18          0      54015      27826       27001
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    27003     12.19      14.18          0      54015      27826       27001
Method 3
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    27001     19.06      25.33          0          0          0           0
Execute  27001     54.99      66.41          0      54002      27706       27001
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    54002     74.06      91.75          0      54002      27706       27001

From this simple test the literals uses 13.5 times as much CPU compared to the first method in cost terms if I need 2 x86 cores of CPU power to power my database for method 1 I need 27 cores for method 3 (assuming it would scale) in cost terms thats 1*£31,839 vs 14*£31,839 or £445,746. aside from the cost there is no way to improve the speed of method 3 and the code is harder to write as well.

Bind Variables, almost right.

June 1, 2011

Sometimes I get a bit sarcastic with and about developers, this is unfair as for the most part the ones I have worked closely have either been very good oracle developers or very good c++/java/VB6/whatever developers with a real interest in doing things the right way and have been receptive to constructive (but grumpy feedback).

This gem in a vendors application however brought out the sarcasm.

  column_1 = :1, column2 = :2, ..............
  column_125 = :125, column_126 = :126
   column_127 = 'GotBoredWithBindVariablesDidWe?';

I can’t work the logic out here, the code is already written to use bind variables it took extra code to stick a literal in there, as the application is written in c using the oci libraries these guys are supposed to be rocket surgeons. There isn’t even a justification based on skewed data column_127 is the primary key so every update updates one row and there is only one plan ever used. Added to that the design of the process is such that the application caches updates then send in 400 at a time , committing after each update just to make sure performance really suffers except for end of day processing where all 27,000 rows are updated at least there is only a single commit during that specific piece.

So far so bad, however the worst impact we see is because we are using Automatic Memory management the shared pool grows larger over time as it tries to cache all the execution plans. After three weeks we get to the point of other processes suffering form either not enough buffers or increases in mutex and library cache latch waits, the solution? restart the database every two weeks and the spfile to reset the values of __db_cache_size and __shared_pool_size. Yes we could use manual memory management but the vendors application compute guide specifies AMM changing to manual memory management is still being assessed by their performance team.

SQL Injection

February 22, 2011

Sometimes I wonder why I still find stuff like this, there are so many articles/blogs/documents about sql injection how does a developer come up with the following…

create or replace function row_count (p_table varchar2)
return number
v_str VARCHAR2(1000);
v_row NUMBER;
v_str := ‘select count(*) from ‘||p_table;
execute immediate v_str into v_row;
return v_row;

My simple explanation of why this is bad was that passing in the parameter ‘all_objects,all_objects,all_objects’ will result in a fairly nasty cartesian join and an attacker who knew the names of the database tables could potentially pick a really large table. Run enough copies and it starts to look like a denial of service attack, I’m pretty sure someone smarter than me could come up with something worse but I would rather not have to think how to defend from sql injection when simply doing it right (with technology from the 90’s possibly 80’s) is so much easier.

Flashback logging only has a 2% performance decrease…or not.

February 17, 2010

Firstly apologies to both people who have previously read my blog, things have been a little hectic over the past 12 months working away from home in the Netherlands has taken its toll on my spare time, and weekends are to be spent either with the family or working, things have settled into a more stable pattern now so here goes..

Some background, I’ve been assigned into a project to configure dataguard on a very busy production system performing a mixed bag of batch processing, messaging and DSS loads. Performance and availability of the system is critical for the business so everyone was happy that when flashback was proposed as an addition to the system they could quote this bit of marketing (search for 2% and you will see the bit I mean).  Great everyone was happy then i get moved onto the project to help with configuration and testing and start upsetting people..

Prior to starting the formal tests we have a fairly informal period where we have built the test databases configured the standby got the team used to data guard brokers foibles, failed over, rebuilt the primary using flash back database, performed some switch overs and switch backs, broken the configuration by an ill advised flashback on the primary fixed the configuration switched over a couple more times, then hit a bug (5937899) handily unpublished regarding flashback data guard and switch overs fortunately this was recoverable as we had set guaranteed restore points which is where our problems really got started.

As part of the test plan we have a set of processes to run, the plan is to run everything, measure performance of the network, host o/s, SAN generate awr reports etc and then flashback to our guaranteed restore point alter the configuration and re-run the same processes, giving us data we can compare e.g. no data guard, archive log shipping, lgwr async and finally lgwr sync. The plan was good right up until we tried some Micky Mouse testing with simple scripts creating tables by selecting from dba objects, delete insert commit really simple stuff, so first run through no data guard no flashback everything is good, enable  data guard configure flashback database set a guaranteed restore point, performance nosedives and top 5 wait events looks like

Top 5 Timed Events

Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
log file sync 2,055 557 271 35.7 Commit
CPU time 497 31.8
db file parallel write 33,547 437 13 28.0 System I/O
flashback buf free by RVWR 430 402 935 25.7 Other
log file parallel write 777 96 124 6.2 System I/O

This is not good, the log file sync time isn’t unexpected (in fact is not much greater than without dataguard or flashback on), but the flashback buf free by RVWR is not something we expected to see at this level.  Finding this really doesn’t bode well for the rest of the testing and the impact on production, so we start digging around, switch of flash back, configure a restore point with flashback logging not enabled. Until eventually I’m looking at the session wait events and the flash recovery area at the same time, and notice that all the sessions go into waiting on flashback buf free by RVWR at the same time and this coincides with a new flashback log being created. because of the load we have generated flashback logs are created ~500MB in size, every session we are running to generate load waits while this is happening.

Right so what is the real problem? well its the way we have set things up, with flash back logging enabled and no restore points configured flash back logs will get reused (how soon depends on db_flashback_retention_target) once the system hits a steady state it shouldn’t be creating new flashback logs just reusing old ones. In our test environment we enable flashback logging and configure a guaranteed restore point so we don’t ever reuse a flashback log we just create new ones this isn’t representative of the live system as we will not be setting guaranteed restore points, so our next step is try to find a way to enable flashback to a guaranteed point in test without inflicting the waits we have seen in our quick and dirty testing. testing on a small system the following seems to get around the problem.

Configure the test database and standby server.
Set flashback database on, set db_flashback_retention_target to a big number (even the default of 1440 will work)
execute one of our scripts to generate a lot of changes and wait until we have enough flash back logs (yes i know how many is enough?)
set a guaranteed restore point
set db_flashback_retention_target to a small number (<60)

what will then happen is the existing flashback logs will be reused removing the extreme waits we are getting waiting on new flashback log files being created while still allowing us to flashback and re-run the production like process tests against the same data.

The real point here isn’t about flash back logging or a specific process its about testing in our case we have delayed the start of the formal tests by a day but have saved time overall as we will not end up having to re-run tests that take 4-5 hours to complete, well at least not for this reason :),  in summary :-

validating your environment before you start time consuming and possibly expensive tests is a good idea

Small fairly crude tests can show up problems in the configuration shortening the test cycle.

Having time to mess about and quickly retry things is invaluable for DBA’s implementing features that haven’t been used by the team before.

Never, ever quote a performance figure pulled from marketing material no matter what the source is, always verify in your own environment.