Making life hard for yourself.

April 21, 2013

Sometimes it seems like a developer has gone out of their way to make life difficult for themselves (and us poor long suffering DBAs). For example we have an audit trail of user actions in the database that contains xml. This almost makes sense except that the xml is stored in a blob, something like this for example….


As we have limited space these records are removed after 90 days but can be restored from the backups from up to ten years ago if required, however getting the data back is only part of the problem as we now need to read it. Being a helpful sort of chap I wrote a simple blob to clob function that allows us to get the xml out of the blob, what we get for the above is this:-

<field Name="I">DO</field>
<field Name="NOT">LIKE</field>
<field Name="DATA">PERvY3VtZW50Pg0KICA8ZmllbGQgTmFtZT0iQkVJTkciPkFibGUgdG8gcXVlcnkgbXkgZGF0YTwvZmllbGQ+DQo8L0RvY3VtZW50Pg==</filed>

So what is that DATA field in the middle?
Its a base64 encoded xml document, at this point I gave up and advised the developer that they would have to sort this out themselves as we have an xml document encoded into base64, put in a field in another xml document and then the whole thing is stored in a blob field. Attempting to get meaning out of this data via sql is going to be difficult and frustrating when we know exactly which record is required if someone wants to search the xml for a specific field then we are heading down the rabbit hole into madness especially as we have only attempted to decode a couple of entries so who knows what else is hiding in there.
So instead of shipping this data into the data warehouse we a stuck with backing it up onto tape and restoring it on request along with the front end tool that can read and display it, hopefully the development team won’t mind supporting ten years worth of front ends as well.


Remember to check the obvious.

December 14, 2012

Its easy to forget to check the obvious sometimes, a user complains that performance of a RAC cluster is appalling, a quick dive into OEm shows everything is spending forever in cluster waits, its easy to dive into sql performance tuning mode. Alternatively you can start looking for hot blocks or small tables being pinged back and forth across the interconnect. Luckily I noticed something odd in SAR first, my interconnect only transferring 1.2mb/s but everything still hung up on cluster waits a quick check of the network card:-

racnode1:~ # ethtool eth4
Settings for eth4:
        Supported ports: [ TP ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 10Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: umbg
        Wake-on: g
        Current message level: 0x00000003 (3)

Link detected: yes

10mb/s? so much for the fancy switch management and alerting software, failing the bond to the other network link fixed things until a networking engineer can get to it and move the cable.

Sometimes its not a complex sql tuning task sometimes it is a database performance issue and not the fault of the developers.

Bind Variable mismatch, nulls and oci

November 24, 2011

A short post on a frustrating issue that took a while to find the solution to. The developers made a small change to a table allowing nulls in 23 columns previously configured as not null. This resulted in a large number of bind mismatches and child cursors being generated, eventually giving rise to a serious performance issue.

After much discussion of the issue test cases, being told to tune the database flushing the shared pool finally a developer opened the code and took a look at the problem and found the reason for the bind mismatch was the oci null indicator, when the column value is set null a value of -1 was being passed, when a value was present a null pointer was passed. The optimiser sees these as bind mismatches giving 2^23 combinations of bind settings. Patching the code to pass a zero when a value is present rather than a null pointer prevents the bind mismatch and returned everything to normal.

One other issue found was the test configuration, due to the data distribution in test not generating the same variance in nulls/values in the binds this problem wasn’t found until the production release making the assorted workarounds necessary.

How expensive are literals?

June 26, 2011

As a follow-up to my last post I finally got around to writing a small program to demonstrate the difference between bind variables and literals, the test is a simple c++ program the reads a text file and updates 27,000 rows in a single table. The code runs the updates in three different ways.

Method 1: Bulk bind 200 rows at a time into the update statement.

Method 2: Bind 1 row at a time

Method 3: Use literals in the updates.

Tkprof output from the test looks like this.

Method 1
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        2      0.01       0.01          0          0          0           0
Execute    271      5.44       6.05         23      28661      27706       27001
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total      273      5.46       6.07         23      28661      27706       27001
Method 2
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        2      0.00       0.00          0          0          0           0
Execute  27001     12.19      14.18          0      54015      27826       27001
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    27003     12.19      14.18          0      54015      27826       27001
Method 3
call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse    27001     19.06      25.33          0          0          0           0
Execute  27001     54.99      66.41          0      54002      27706       27001
Fetch        0      0.00       0.00          0          0          0           0
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total    54002     74.06      91.75          0      54002      27706       27001

From this simple test the literals uses 13.5 times as much CPU compared to the first method in cost terms if I need 2 x86 cores of CPU power to power my database for method 1 I need 27 cores for method 3 (assuming it would scale) in cost terms thats 1*£31,839 vs 14*£31,839 or £445,746. aside from the cost there is no way to improve the speed of method 3 and the code is harder to write as well.

Bind Variables, almost right.

June 1, 2011

Sometimes I get a bit sarcastic with and about developers, this is unfair as for the most part the ones I have worked closely have either been very good oracle developers or very good c++/java/VB6/whatever developers with a real interest in doing things the right way and have been receptive to constructive (but grumpy feedback).

This gem in a vendors application however brought out the sarcasm.

  column_1 = :1, column2 = :2, ..............
  column_125 = :125, column_126 = :126
   column_127 = 'GotBoredWithBindVariablesDidWe?';

I can’t work the logic out here, the code is already written to use bind variables it took extra code to stick a literal in there, as the application is written in c using the oci libraries these guys are supposed to be rocket surgeons. There isn’t even a justification based on skewed data column_127 is the primary key so every update updates one row and there is only one plan ever used. Added to that the design of the process is such that the application caches updates then send in 400 at a time , committing after each update just to make sure performance really suffers except for end of day processing where all 27,000 rows are updated at least there is only a single commit during that specific piece.

So far so bad, however the worst impact we see is because we are using Automatic Memory management the shared pool grows larger over time as it tries to cache all the execution plans. After three weeks we get to the point of other processes suffering form either not enough buffers or increases in mutex and library cache latch waits, the solution? restart the database every two weeks and the spfile to reset the values of __db_cache_size and __shared_pool_size. Yes we could use manual memory management but the vendors application compute guide specifies AMM changing to manual memory management is still being assessed by their performance team.

SQL Injection

February 22, 2011

Sometimes I wonder why I still find stuff like this, there are so many articles/blogs/documents about sql injection how does a developer come up with the following…

create or replace function row_count (p_table varchar2)
return number
v_str VARCHAR2(1000);
v_row NUMBER;
v_str := ‘select count(*) from ‘||p_table;
execute immediate v_str into v_row;
return v_row;

My simple explanation of why this is bad was that passing in the parameter ‘all_objects,all_objects,all_objects’ will result in a fairly nasty cartesian join and an attacker who knew the names of the database tables could potentially pick a really large table. Run enough copies and it starts to look like a denial of service attack, I’m pretty sure someone smarter than me could come up with something worse but I would rather not have to think how to defend from sql injection when simply doing it right (with technology from the 90’s possibly 80’s) is so much easier.

A single procedure to “hang” the database.

February 14, 2011

I really enjoy it when people don’t understand the features they are using it makes life really interesting, automatic parallelism is a great new feature of 11.2 but we found a new and interesting way to break things. As part of the application we are implementing customers may re-request fairly large volumes of data to limit the impact on the system as a whole the design is such that a cursor is opened and the data returned in chunks of between  500 rows, if two customers execute re-requests at the same time then two cursors are opened cutomer1 gets 500 rows then customer2 gets 500 rows then back to customer1 and so on. If ten customers issue re-requests results are sent back in chunks of 500 to each customer in turn and so on.

The intention is to prevent customers complaining that they are waiting too long to receive data at the same time as preventing re-requests flooding the database and making the other critical parts of the system becoming unresponsive.

Nice Idea but someone didn’t read enough about automatic parallelism, here is a simple demo, first create a table e.g. (this code has been shamelessly adapted from Jonathan Lewis’ site)

create table t1 
 with generator as ( 
 select  --+ materialize 
 rownum id 
 from dual 
 connect by
 rownum <=  10000 
 rownum                   id, 
 trunc(dbms_random.value(1,1000))    n1, 
 trunc(dbms_random.value(1,1000))     n2, 
 lpad(rownum,30,'0')          small_vc 
 generator    v1, 
 generator   v2 
 rownum <= 100000000;
 ownname      => user, 
 tabname      =>'T1', 
 method_opt    => 'for all columns size 1'

Then a nice simple piece of pl/sql to demonstrate the issue (database version, 8 cores, parallel_server_target set to 64, parallel_min_time_threshold=1 you may need different values to force the behaviour)

cursor cursor1 is select * from t1  order by id;
cursor cursor2 is select * from t1  order by id;
cursor cursor3 is select * from t1  order by id;
cursor cursor4 is select * from t1  order by id;
cursor cursor5 is select * from t1  order by id;
cursor cursor6 is select * from t1  order by id;
cursor cursor7 is select * from t1  order by id;
cursor cursor8 is select * from t1  order by id;
cursor cursor9 is select * from t1  order by id;
cursor cursor10 is select * from t1  order by id;
id dbms_sql.number_table;
n1 dbms_sql.number_table;
n2 dbms_sql.number_table;
small_vc dbms_sql.varchar2_table;
open cursor1;
open cursor2;
open cursor3;
open cursor4;
open cursor5;
open cursor6;
open cursor7;
open cursor8;
open cursor9;
open cursor10;

fetch cursor1 bulk collect into id,n1,n2,small_vc limit 500;
fetch cursor2 bulk collect into id,n1,n2,small_vc limit 500;
fetch cursor3 bulk collect into id,n1,n2,small_vc limit 500;
fetch cursor4 bulk collect into id,n1,n2,small_vc limit 500;
fetch cursor5 bulk collect into id,n1,n2,small_vc limit 500;
fetch cursor6 bulk collect into id,n1,n2,small_vc limit 500;
fetch cursor7 bulk collect into id,n1,n2,small_vc limit 500;
fetch cursor8 bulk collect into id,n1,n2,small_vc limit 500;
fetch cursor9 bulk collect into id,n1,n2,small_vc limit 500;
fetch cursor10 bulk collect into id,n1,n2,small_vc limit 500;
exit when cursor1%notfound;
end loop;

close cursor1;
close cursor2;
close cursor3;
close cursor4;
close cursor5;
close cursor6;
close cursor7;
close cursor8;
close cursor9;
close cursor10;

Now execute the stored procedure and wait, and wait, and continue waiting forever…..

The problem is simple(once we knew what was going on), auto parallelism will queue statements if it can’t execute with the degree of parallelism requested by about cursor 8 our database was unable to assign the required parallel slaves so it queued the statement, this means no one else ever gets any rows untill we kill the process. Typically what we were seeing was one session waiting on “PX Queuing: statement queue” event. unfortunately getting the code fixed is going to be a long slow process so at the moment automatic parallelism and statement queuing are switched off.