DB2 and other Technical Tips: Function

Showing posts with label Function. Show all posts

Monday, July 21, 2014

DB2 Performance Improvement tip using COALESCE function

One of the developers at our shop fed his query to an optimizer tool and the optimized query looked kind of weird.

This is the original query:
DELETE FROM TABLE_A
WHERE COL1 IN (SELECT COL1
FROM TABLE_A R
INNER JOIN TABLE_B O
ON O.COL2 = R.COL1
WHERE O.COL3 != R.COL4)

Query suggested by the optimizer:
DELETE
FROM TABLE_A
WHERE EXISTS (SELECT 'X'
FROM TABLE_A R
INNER JOIN TABLE_B O
ON R.COL4 <> COALESCE (O.COL3, O.COL3)
AND R.COL1 = COALESCE (O.COL2, O.COL2)
WHERE COL1 = TABLE_A.COL1
ORDER BY R.COL4)

The puzzling part here is in the revised query: COALESCE (O.COL3, O.COL3). Why would anyone feed the same column twice to the COALESCE function? It's going to give the same result.

Here is the explanation by IBM though: http://www-01.ibm.com/support/docview.wss?uid=swg21259831

In their own words:

Question

Why might adding a special form of the COALESCE predicate produce cheaper data access plans and result in better performance.

Cause

A "no-op" coalesce() predicate of the form "COALESCE(X, X) = X" introduces an estimation error into the planning of any query using it. Currently the DB2 query compiler doesn't have the capability of dissecting that predicate and determining that all rows actually satisfy it. As a result, the predicate artificially reduces the estimated number of rows coming from some part of a query plan. This smaller row estimate usually reduces the row and cost estimates for the rest of query planning, and sometimes results in a different plan being chosen because relative estimates between different candidate plans have changed.

Why can this do-nothing predicate sometimes improve query performance? The addition of the "no-op" coalesce() predicate introduces an error that masks something else that is preventing optimal performance.

What some performance enhancement tools do is a brute-force test: the tool repeatedly introduces the predicate into different places in a query, operating on different columns, to try to find a case where, by introducing an error, it stumbles onto a better-performing plan. This is also true of a query developer hand-coding the "no-op" predicate into a query. Typically, the developer will have some insight on the data to guide the placement of the predicate.

Using this method to improve query performance is a short-term solution which does not address root cause. It hides potential areas for performance improvements and could have the following implications:

This workaround does not guarantee permanent performance improvements. The DB2 query compiler might eventually handle the predicate better, or it might be affected by other random factors.

Other queries might be affected by the same root cause. The performance of your system in general might be suffering as a result.

Answer

Try to identify and address root cause by determining why the original plan chosen by the query compiler did not perform optimally.

Although the actual estimates in a better-performing plan using "COALESCE(X, X) = X" cannot be trusted, such a plan can still be useful because its "shape" (for example, its join order and access methods) can be used as clues to determine where a problem might lie with the original plan.

Here are some examples of questions that can help to better identify the source of performance issues:

Are the statistics out-of-date?

Should more statistics be collected?

Are there statistical correlations that have not been detected?

Are there other characteristics of the data that are not yet captured or modeled?

Are there hardware problems? (e.g. a disk is misbehaving)

Are there configuration problems?

Are any queries stuck behind a lock?

Saturday, January 21, 2012

DB2 for zOS Index on Expression

DB2 for z/OS version 9 brought in a new feature called "Index on Expression". Yes, you can create indexes on frequently used expressions in your queries. For example, if you often do a SELECT FIRST_NAME WHERE UPPER(LAST_NAME) = 'SMITH', then creating an index on UPPER(LAST_NAME) will be useful.

Until V8, you can create index on just columns, not on expressions/functions etc. So, you would have created an index on LAST_NAME, but DB2 would have ignored it because when you use a function on a column, DB2 ignores index access. But starting in V9, you can create an index on UPPER(LAST_NAME) and DB2 will use this index.

Now, to the tricky part ...

I tried to create an index on UPPER(LAST_NAME), DB2 complained that I need to specify 'locale' name. After reading a little bit about 'locale names', I specified UPPER(LAST_NAME, 'En_US') and DB2 happily created the index. I also ran runstats on the table. However, when I did an explain on the query SELECT FIRST_NAME WHERE UPPER(LAST_NAME) = 'SMITH', DB2 didn't seem to pick up the newly created index. That puzzled me.

After reading a little further, I found that I need to specify the same 'locale' in the query too. After specifying the 'locale', I did an explain and DB2 picked up the index. This is the modified query:

SELECT FIRST_NAME WHERE UPPER(LAST_NAME, 'EN_US') = 'SMITH'

Keywords: DB2, z/OS, Index, Expression, Upper, Locale, function

Saturday, February 28, 2009

DB2 "FIRST_DAY" function

DB2 comes with a lot of in-built functions. One of them is called LAST_DAY which returns the Last Day of a month, but there is no FIRST_DAY of the month function. Some of my colleagues asked about it. Here is an UDF (User Defined Function) that'll return FIRST DAY of the month:

CREATE FUNCTION schema.FIRST_DAY (IN_DATE DATE)
RETURNS DATE
LANGUAGE SQL
CONTAINS SQL
NO EXTERNAL ACTION
DETERMINISTIC
RETURN DATE(LAST_DAY(IN_DATE - 1 MONTH) + 1 DAY);

Thursday, October 16, 2008

DB2 ROUND_TIMESTAMP function

ROUND_TIMESTAMP(CURRENT TIMESTAMP,'MI'):

The ROUND_TIMESTAMP function returns a timestamp that is the expression rounded to the unit specified by the format-string. In this case, MI means minutes.

For example, the output of

SELECT ROUND_TIMESTAMP('2008-10-16-14.46.22.121222','MI')
FROM SYSIBM.SYSDUMMY1

looks like this:
2008-10-16-14.46.00.000000