A recent incident at home reminded me of an interaction I had a few months ago with a developer, first the incident at home…
I’d been working on my laptop in the back room (commenly refered to as the X-box room, the office, and occasionally the pit), having finished I left the laptop on the floor open and wne toff to do something else. An hour or so later whillst my son was playing the xbox I hear a loud crash from the back room and as “what was that” the reply came back “nothing”, obviously this wasn’t true and on investigating I found that the metal box file I’d left on the couch had fallen off and landed on the laptop leaving a really impressive scratch on the screen. Whats interesting is the reaction this caused the instantaneous denial that anything had happened and a desire to avoid being in trouble led to a response of “nothing” and whillst I was angry it was obvious that it was my own fault as I should have put the laptop away rather than leave it lying out.
OK now how was this similar to the interaction at work, well basically we put a new application release containng a significant number of changes of an into a test enviroment, the system being primarily a batch driven data warehouse the initial results all looked good all the package updates worked so everyone went home happy. Next day everything was fine too the overnight loads completed slightly faster than previously everyones happy. Next day I get a call
Developer: “The overnight batch hasn’t finnished can you see whats wrong with the database?”
Me: “how far has it got?”
Developer: “its still loading the security reference tables rigth at the start of the batch”
Me: “OK I’ll take a look, did you make any changes to this? I thought the changes where in the widget data loads”
Develper: “we didn’t chnage any of this code in this release”
So its off into an investigation of why we suddenly have a performance problem, we start with the usual checks nothing in the alert log, no hardware issues reported, nothing in the messages logs. A look at whats actually running shows the process id inserting into the second of a set of tables used for row level security, and checking the explain plan shows its using a really inificent execution path.
A review of the process shows that we load the first of two tables with about 10 million rows then load te second with around 35 million rows based on the first table joined to some other static tables. reviewing these We find there are no stats on the first table, at tis point I’m starting to get an idea of whats going on so looking at the code it looks like this.
Truncate table a;
truncatete table b;
select blah,blah, blah into table a from …;
insert a.blah,a.blah,… into table b from a, …;
A quick dig into the code repository shows this has been changed the two truncates used to be deletes, this explains a coupe of things, the first loads where faster as removing the data from the look up tables is now a truncatete so doesn’t generate the undo that it used too however this has coincided with the gather stats job so the statistics show table a has zero rows instead of 10 million. So the following set of loads started to use a really bad execution path full scanning table a for every row inserted into table b. Mystery solved and fixed by adding a gather stats after the data is loaded into each table (alternatively we could have locked the stats).
What does this show? well you shouldd always tell your DBA the truth as it may save time also we really are not trying to shift blame when we ask questions about things changing its a tool that allows us to narrow the range of our search to the problem. Also avoid a culture of blame as it leads to individuals covering things up meaning you find out something is going wrong much later when it may be too late to fix the problem or cost much more to put right.