Conditional aggregation

Substitution SQL Puzzle

May 17, 2020Zahar HilkevichLeave a comment

Level: Advanced

A colleague of mine approached me recently with a puzzle he struggled with: you have a table (let’s call it data_table) with id and val (i.e. value) columns. You are given two parameters: value_to_overwrite and value_to_use that should transform the content of the data_table in a special way:

If both parameters exist in the data_table in the val column for the same id, then the one that is equal to value_to_overwrite should be substituted with value_to_use
If none or just one of the parameters exist in the data_table.val column, than the val column should remain the same
List all the rows from the data_table after the transformation.

Let’s create the data_table using the following DDL command:

CREATE TABLE data_table AS
SELECT 1 id, 'a' val FROM dual
UNION ALL
SELECT 1 id, 'b' val FROM dual
UNION ALL
SELECT 1 id, 'c' val FROM dual
UNION ALL
SELECT 2 id, 'b' val FROM dual
UNION ALL
SELECT 2 id, 'd' val FROM dual

ID	VAL
1	a
1	b
1	c
2	b
2	d

For parameters value_to_overwrite = ‘a’ and value_to_use = ‘b’ the expected result should look like this:

ID	ORIGINAL_VALUE	NEW_VALUE
1	a	a
1	b	a
1	c	c
2	b	b
2	d	d

Note, that for id = 1, value ‘b’ is substituted with new value ‘a’ because both, value_to_overwrite (‘a’) and value_to_use (‘b’) exist in the val column. All other values should remain the same as substitution condition is not met.

To mimic the parameter use in the query we will create another table (rule_table) with a single row in it.

CREATE TABLE rule_table AS
SELECT 'a' value_to_use, 'b' value_to_overwrite
FROM dual

Translating requirements from English to SQL will likely result in a bulky and inefficient query. Let’s demonstrate that:

/* Values that need to be substituted */
SELECT d.id, d.val AS original_value, r.value_to_use AS new_value
FROM data_table d JOIN rule_table r ON d.val = r.value_to_overwrite
WHERE r.value_to_use IN (SELECT val
                         FROM data_table
                         WHERE id = d.id)
UNION ALL
/* Values that remain the same as only value_to_overwrite exist for given id */
SELECT d.id, d.val, d.val
FROM data_table d JOIN rule_table r ON d.val = r.value_to_overwrite
WHERE r.value_to_use NOT IN (SELECT val
                             FROM data_table
                             WHERE id = d.id)
UNION ALL
/* Values that remain the same as value_to_overwrite does not match val */
SELECT d.id, d.val, d.val
FROM data_table d
WHERE d.val NOT IN (SELECT value_to_overwrite
                    FROM rule_table)

As you can see, there are multiple (five) copies of the data_table used, which will lead to a poor performance when the size of the table increases dramatically.

A way better approach is to take the first SELECT from the UNIONed statement above and turn the INNER JOIN into an LEFT OUTER JOIN. At the same time, we need to move the filtering condition from the WHERE clause to the JOIN (otherwise, the LEFT JOIN will work as INNER JOIN):

SELECT d.id,
       d.val                      AS original_value,
       NVL(r.value_to_use, d.val) AS new_value
FROM data_table d LEFT JOIN rule_table r 
                  ON d.val = r.value_to_overwrite
                 AND r.value_to_use IN (SELECT val
                                        FROM data_table
                                        WHERE id = d.id)

This is a quite efficient and fairly short query that uses only two copies of the data_table. Can we do better than that? Yes, we can!

WITH x AS (
SELECT id, val,
       MIN(CASE WHEN val IN (value_to_use, value_to_overwrite) 
                THEN val 
           END)
       OVER(PARTITION BY id, value_to_overwrite)  min_val,
       MAX(CASE WHEN val IN (value_to_use, value_to_overwrite) 
                THEN val 
           END)
       OVER(PARTITION BY id, value_to_overwrite)  max_val,
       LEAST(value_to_use, value_to_overwrite)    min_ow,
       GREATEST(value_to_use, value_to_overwrite) max_ow,
       value_to_use, value_to_overwrite
FROM data_table CROSS JOIN rule_table 
)
SELECT id, val AS original_value,
       CASE WHEN min_val=min_ow AND
                 max_val=max_ow AND
                 val=value_to_overwrite THEN value_to_use
       ELSE val
       END AS new_value
FROM x

Analytic functions MIN and MAX let us scan the data_table vertically while LEAST and GREATEST do the same horizontally. The later pair of functions come very handy when you need to compare pairs of values, so the smaller of the values should match LEAST and the other – GREATEST.

And still, the last strategy has one flaw: we used a Cartesian Product (CROSS JOIN) which means that had we have more than one substitution rule, the method would not work properly. Let’s fix it.

First, we will add one more rule:

INSERT INTO rule_table VALUES('b', 'c')

Now, the expected result should looks as the following:

ID	ORIGINAL_VALUE	NEW_VALUE
1	a	a
1	b	a
1	c	b
2	b	b
2	d	d

Note, that the second rule turns original ‘c’ value into ‘b’.

And again, Analytic functions do all the magic:

WITH x AS (
SELECT id, val, value_to_overwrite, value_to_use,
       LEAST(value_to_overwrite, value_to_use) || '|' ||
       GREATEST(value_to_overwrite, value_to_use) rule_vals,
       LISTAGG(DISTINCT val, '|') WITHIN GROUP(ORDER BY val)
       OVER(PARTITION BY id) vals
FROM data_table LEFT JOIN rule_table ON val = value_to_overwrite
)
SELECT id, val AS original_value,
       CASE WHEN value_to_overwrite IS NULL THEN val
            WHEN INSTR(vals, rule_vals)=0 THEN val
            ELSE value_to_use
       END     AS new_value
FROM x

This time, LISTAGG analytic function (with DISTINCT option – recently supported by Oracle) helps matching the val against value_to_overwrite and value_to_use pair.

I strongly recommend executing parts of the above queries to gain a better understanding of the demonstrated strategies. livesql.oracle.com site offers you a great query tool with the latest version of Oracle database.

***

If you find this post useful, please press the LIKE button and subscribe.

My Oracle Group on Facebook:

Also, you may want to join my Oracle group on Facebook: https://www.facebook.com/groups/sqlpatterns/

Suggested Reading:

Would you like to read about many more tricks and puzzles? For more clever tricks and cool techniques check my book “Oracle SQL Tricks and Workarounds”.

Explore the power of analytic functions

April 29, 2020Zahar Hilkevich1 Comment

Analytic functions are still underutilized among database developers. The goal of this publication is to demonstrate hidden opportunities to improve query performance by using analytic functions where traditional approach is still dominating.

Let’s consider a few examples. We will use Oracle’s traditional educational schema scott and its famous emp and dept tables.

Problem #1: Find all employees from the department where a top paid clerk works.

A quick look at the clerk records reveals the expected department (10):

SELECT deptno, job, sal
FROM scott.emp
WHERE job = 'CLERK'
ORDER BY sal DESC

Result:

DEPTNO	JOB	SAL
10	CLERK	1300
20	CLERK	1100
30	CLERK	950
20	CLERK	800

Traditional approach suggests finding the department (10) first and then the task becomes trivial:

SELECT ename, deptno, job, sal
FROM scott.emp
WHERE deptno IN (<department that we found>)

OK, that works, but now we need to produce a query that will get us that department number (of several ones if they all have top paid clerks).

There are many ways of finding a top record with and without analytic functions. If we don’t use analytic functions we will end up using two copies of the emp table as in the following example:

SELECT deptno
FROM scott.emp
WHERE job = 'CLERK'
  AND sal = (SELECT MAX(sal) FROM scott.emp WHERE job = 'CLERK')

or this:

SELECT a.deptno
FROM scott.emp a LEFT JOIN scott.emp b ON a.sal < b.sal
AND b.job = 'CLERK'
WHERE b.deptno IS NULL 
  AND a.job = 'CLERK'

The use of analytic functions reduces the number of table copies as we can get all the necessary details in a single table scan as in the following query:

WITH x AS (
SELECT deptno, MAX(sal) max_sal, RANK() OVER(ORDER BY MAX(sal) DESC) rk
FROM scott.emp
WHERE job = 'CLERK'
GROUP BY deptno
)
SELECT deptno
FROM x
WHERE rk = 1

Such SQL looks even shorter in databases, such as Snowflake and Terdata, that support QUALIFY clause in SELECT statement:

SELECT deptno
FROM scott.emp
WHERE job = 'CLERK'
GROUP BY deptno
QUALIFY RANK() OVER(ORDER BY MAX(sal) DESC) = 1

So a complete traditional approach will use 3 copies of the emp table as in the following query:

SELECT ename, deptno, job, sal
FROM scott.emp                                 -- copy #1
WHERE deptno IN (SELECT deptno
                 FROM scott.emp                -- copy #2
                 WHERE job = 'CLERK'
                   AND sal = (SELECT MAX(sal) 
                              FROM scott.emp   -- copy #3
                              WHERE job = 'CLERK')
                 )

Yes, it works, but what an overkill! Is it still possible to use a single emp table scan to solve this problem? The answer is YES:

WITH x AS (
SELECT ename, deptno, job, sal, 
       MAX(DECODE(job,'CLERK',sal)) OVER() max_sal_global,
       MAX(DECODE(job,'CLERK',sal)) OVER(PARTITION BY deptno) max_sal_dept
FROM scott.emp
)
SELECT ename, deptno, job, sal
FROM x
WHERE max_sal_global=max_sal_dept

And of course, in the Snowflake/Teradata SQL we would not even have to use a CTE (common table expression) , thanks to the QUALIFY clause.

MAX analytic functions combined with DECODE (CASE would work as well) here ignore all non-Clerk rows . The OVER clause gives us either department level top salary or global top salary value for the clerks. And since it is done on a row level, which means that the these analytic functions return the same values for all employees in the same department, we can achieve our goal easily.

MAX function is not the only one analytic function that can be used here. The following example demonstrates the FIRST_VALUE function to achieve the same result:

WITH x AS (
SELECT ename, deptno, job, sal, 
       FIRST_VALUE(DECODE(job,'CLERK',sal)) 
         OVER(ORDER BY DECODE(job,'CLERK',sal) DESC NULLS LAST) max_sal_global,
       FIRST_VALUE(DECODE(job,'CLERK',sal)) 
         OVER(PARTITION BY deptno
              ORDER BY DECODE(job,'CLERK',sal) DESC NULLS LAST) max_sal_dept
FROM scott.emp
)
SELECT ename, deptno, job, sal
FROM x
WHERE max_sal_global=max_sal_dept

As an exercise, try to use LISTAGG analytic function to solve this problem.

Problem #2: Find employees who are paid above the average salary in their respective department.

Again, we will start with a “traditional” approach:

SELECT ename, deptno, job, sal
FROM scott.emp a
WHERE sal > (SELECT AVG(sal)
             FROM scott.emp
             WHERE deptno = a.deptno)
ORDER BY deptno, sal

Result:

ENAME	DEPTNO	JOB	SAL
KING	10	PRESIDENT	5000
JONES	20	MANAGER	2975
SCOTT	20	ANALYST	3000
FORD	20	ANALYST	3000
ALLEN	30	SALESMAN	1600
BLAKE	30	MANAGER	2850

An experienced developer would quickly see that this problem can be solved with the same approach as the problem #1 (above):

WITH x AS (
SELECT ename, deptno, job, sal, AVG(sal) OVER(PARTITION BY deptno) avg_sal
FROM scott.emp
)
SELECT *
FROM x
WHERE sal > avg_sal
ORDER BY deptno, sal

Result:

ENAME	DEPTNO	JOB	SAL	AVG_SAL
KING	10	PRESIDENT	5000	2916.66667
JONES	20	MANAGER	2975	2175
FORD	20	ANALYST	3000	2175
SCOTT	20	ANALYST	3000	2175
ALLEN	30	SALESMAN	1600	1566.66667
BLAKE	30	MANAGER	2850	1566.66667

With this strategy we can even see the department average salary value.

Problem #3: List all employees who work in the same department as the president.

The problem was discussed in one of my old blog posts and I strongly suggest you to check it out: A trick that helps avoiding multiple table scans.

***

If you find this post useful, please press the LIKE button and subscribe.

My Oracle Group on Facebook:

Also, you may want to join my Oracle group on Facebook: https://www.facebook.com/groups/sqlpatterns/

Suggested Reading:

Would you like to read about many more tricks and puzzles? For more clever tricks and cool techniques check my book “Oracle SQL Tricks and Workarounds”.

Monday Hiring SQL Puzzle and Lateral View Usage Nuance

October 27, 2018Zahar HilkevichLeave a comment

Let’s solve a fairly simple SQL problem:

For each department count the number of people hired on Monday

• Use scott.emp table
• Show department number and count columns
• If no employees from a given department was hired on Monday, we should list such
department with 0 in the count column
• Sort the result by the department number

Expected Result:

DEPTNO	MON_HIRES
10	0
20	0
30	1

We will start with the in-line scalar subquery approach as it is probably one of the most intuitive:

Strategy #1: In-Line Scalar Subquery

SELECT deptno, (SELECT COUNT(*)
                FROM scott.emp
                WHERE deptno=e.deptno
                AND TO_CHAR(hiredate, 'DY')='MON') mon_hires
FROM scott.emp e
GROUP BY deptno
ORDER BY 1

When you only need a single value/column/expression from a correlated subquery, the in-line subquery in SELECT clause works just fine. If we needed more than one: count and let say total salary, we would need to use LATERAL view (or a completely different approach – see below).

Strategy #2: Lateral View

WITH x AS (
SELECT DISTINCT deptno 
FROM scott.emp
)
SELECT x.deptno, m.mon_hires, m.total_sal
FROM x, LATERAL (SELECT COUNT(*) mon_hires, SUM(sal) total_sal
                 FROM scott.emp e
                 WHERE e.deptno=x.deptno
                   AND TO_CHAR(e.hiredate, 'DY')='MON') m
ORDER BY 1

Result:

DEPTNO	MON_HIRES	TOTAL_SAL
10	0	–
20	0	–
30	1	1250

So far, all is good. I know that some database developers don’t like using table aliases too much and when an opportunity comes they use ANSI standard JOIN syntax with USING clause. Can it be applied in the lateral view?

WITH x AS (
SELECT DISTINCT deptno 
FROM scott.emp
)
SELECT x.deptno, m.mon_hires
FROM x, LATERAL (SELECT COUNT(empno) mon_hires
                 FROM scott.emp JOIN x USING (deptno)
                 WHERE TO_CHAR(hiredate, 'DY')='MON') m
ORDER BY 1

Result:

DEPTNO	MON_HIRES
10	1
20	1
30	1

The syntax is correct but the result is apparently not! What happened?

We replaced the WHERE clause condition of e.deptno=x.deptno with the JOIN x USING(deptno) – it first looked legitimate to me until I realized that we have just added an extra join instead of a reference to an external table (CTE x). Essentially, our last (incorrect) query is the same as the following:

WITH x AS (
SELECT DISTINCT deptno 
FROM scott.emp
)
SELECT x.deptno, m.mon_hires
FROM x, LATERAL (SELECT COUNT(*) mon_hires
                 FROM scott.emp e, x
                 WHERE e.deptno=x.deptno
                   AND TO_CHAR(e.hiredate, 'DY')='MON') m
ORDER BY 1

We simply introduced a new (and unwanted) join on the CTE x and turned the correlated reference (in the WHERE clause) into an old-style joining condition which did not even require the LATERAL view functionality:

WITH x AS (
SELECT DISTINCT deptno 
FROM scott.emp
)
SELECT x.deptno, m.mon_hires
FROM x, (SELECT COUNT(*) mon_hires
         FROM scott.emp e, x
         WHERE e.deptno=x.deptno
           AND TO_CHAR(e.hiredate, 'DY')='MON') m
ORDER BY 1

Result:

DEPTNO	MON_HIRES
10	1
20	1
30	1

The above examples prove that we need to be very careful when mixing up different techniques and new syntax options.

Finally, the last approach will show that only a single scan of the emp table is needed to get the result:

Strategy #3: Conditional Counting

SELECT deptno, COUNT(DECODE(TO_CHAR(hiredate, 'DY'), 'MON', 1)) mon_hires
FROM scott.emp
GROUP BY deptno
ORDER BY 1

Result:

DEPTNO	MON_HIRES
10	0
20	0
30	1

This is the best and incidentally the shortest solution that once again demonstrates the power of conditional counting (aggregation) right in SELECT clause.

Likewise we can also show the total salary of those hired on Monday:

SELECT deptno, 
       COUNT(DECODE(TO_CHAR(hiredate, 'DY'), 'MON', 1)) mon_hires,
       SUM(DECODE(TO_CHAR(hiredate, 'DY'), 'MON', sal)) total_sal
FROM scott.emp
GROUP BY deptno
ORDER BY 1

Result:

DEPTNO	MON_HIRES	TOTAL_SAL
10	0	–
20	0	–
30	1	1250

***

If you find this post useful, please press the LIKE button and subscribe.

My Oracle Group on Facebook:

Also, you may want to join my Oracle group on Facebook: https://www.facebook.com/groups/sqlpatterns/

Suggested Reading:

Would you like to read about many more tricks and puzzles? For more clever tricks and cool techniques check my book “Oracle SQL Tricks and Workarounds”.

Interview Question: For each department count the number of employees who get no commission.

September 10, 2018Zahar HilkevichLeave a comment

Interview Question:

Write a single SELECT statement that returns the number of employees who get no commission broken down by department. (Use scott.emp table)

Level:

Intermediate

Expected Result:

DEPTNO	NO_COMM_COUNT
10	3
20	5
30	3

Solutions

A very typical attempt to solve this problem results in the following query:

SELECT deptno, COUNT(*) no_comm_count
FROM scott.emp
WHERE comm IS NULL OR comm=0
GROUP BY deptno
ORDER BY 1

Yes, the result looks correct, but is the query correct?

The answer is NO! It would become apparent if we had a department where all employees get paid commission, so the number of those who does not would be 0.

Let’s change the requirement a bit – we will show all department and number of employees hired on Friday:

SELECT deptno, COUNT(*) fri_count
FROM scott.emp
WHERE TO_CHAR(hiredate, 'DY')='FRI'
GROUP BY deptno 
ORDER BY 1

The result of this query is clearly not what we want:

DEPTNO	FRI_COUNT
30	2

We would expect the following instead:

DEPTNO	FRI_COUNT
10	0
20	0
30	2

Why don’t we get the departments 10 and 20? The answer is very simple – because we filter “all” those department rows with our WHERE clause. So how should we work around?

Let’s start with more intuitive but less efficient approaches – we will use the same query as before and UNION it with another query that returns “empty” departments. Essentially, the original problem transforms into a new one – find all department where no employees were hired on Friday.

Strategy #1: Using UNION ALL with multi-column non-correlated subquery:

SELECT deptno, COUNT(*) fri_count 
FROM scott.emp 
WHERE TO_CHAR(hiredate, 'DY')='FRI' 
GROUP BY deptno
UNION ALL
SELECT deptno, 0 fri_count 
FROM scott.emp
WHERE (deptno, 'FRI') NOT IN (SELECT deptno, TO_CHAR(hiredate, 'DY')
                              FROM scott.emp)
GROUP BY deptno
ORDER BY 1

DEPTNO	FRI_COUNT
10	0
20	0
30	2

Strategy #2: Using UNION ALL with ALL predicate on correlated subquery:

SELECT deptno, COUNT(*) fri_count
FROM scott.emp
WHERE TO_CHAR(hiredate, 'DY')='FRI'
GROUP BY deptno
UNION ALL 
SELECT deptno, 0 no_comm_count 
FROM scott.emp a
WHERE 'FRI'!=ALL(SELECT TO_CHAR(hiredate, 'DY')
                 FROM scott.emp b
                 WHERE a.deptno=b.deptno) 
GROUP BY deptno
ORDER BY 1

It is apparent that the ALL predicate ensures that no employees were hired on Friday.

Now we will mimic the behavior of the UNION ALL operator using LEFT JOIN:

Strategy #3: Using LEFT JOIN:

SELECT a.deptno, COUNT(DISTINCT b.empno) fri_count
FROM scott.emp a LEFT JOIN scott.emp b ON a.deptno=b.deptno
                                      AND TO_CHAR(b.hiredate, 'DY')='FRI'
GROUP BY a.deptno
ORDER BY 1

COUNT(DISTINCT …) is needed to handle a Cartesian Product as the join by deptno column produces many to many relationship, i.e. Cartesian product.

Strategy #4: Generic substitution technique for an outer-join using UNION ALL

WITH e AS (
SELECT deptno, COUNT(*) fri_count
FROM scott.emp
WHERE TO_CHAR(hiredate, 'DY') = 'FRI'
GROUP BY deptno
UNION ALL
SELECT deptno, 0
FROM scott.emp
GROUP BY deptno
)
SELECT deptno, MAX(fri_count) fri_count
FROM e
GROUP BY deptno
ORDER BY 1

All the above techniques may look cool but they are clearly an overkill for such a simple problem. There is a simple rule worth remembering:

If you need to conditionally aggregate all records in the table but you fail doing so due to a WHERE clause filter, consider moving the filter into the GROUP function you use in SELECT.

Strategy #5: Conditional Aggregation

SELECT deptno, COUNT(DECODE(TO_CHAR(hiredate, 'DY'), 'FRI', 1)) fri_count
FROM scott.emp
GROUP BY deptno
ORDER BY 1

Alternatively, you can use CASE function inside of COUNT. It is especially convenient for our original question/problem, i.e. to count employees who is not paid a commission:

SELECT deptno, COUNT(CASE WHEN LNNVL(comm>0) THEN 1 END) no_comm_count
FROM scott.emp
GROUP BY deptno
ORDER BY 1

DEPTNO	NO_COMM_COUNT
10	3
20	5
30	3

This approach is the most efficient as it makes Oracle scanning the emp table only once.

Notice the use of the LNNVL function. You can read more about it in my recent post here.

My Oracle Group on Facebook:

If you like this post, you may want to join my Oracle group on Facebook: https://www.facebook.com/groups/sqlpatterns/

Suggested Reading:

Would you like to read about many more tricks and puzzles? For more tricks and cool techniques check my book “Oracle SQL Tricks and Workarounds”.

Tip of the day: use LNNVL function

April 20, 2018April 20, 2018Zahar Hilkevich1 Comment

Recently, I came across LNNVL Oracle function and decided to assess its usability.

According to Oracle documentation, it “provides a concise way to evaluate a condition when one or both operands of the condition may be null.”

Let’s see a couple of problems this function can be applied to:

For each department count the number of employees who get no commission.

There are over 10 possible solutions that I am aware of, I will focus only on a couple where we can leverage LNNVL function.

Solution #1. Filter with LNNVL in WHERE clause:

SELECT deptno, COUNT(*) cnt 
FROM scott.emp 
WHERE LNNVL(comm>0)
GROUP BY deptno
ORDER BY 1;

Result:

DEPTNO        CNT
------ ----------
    10          3
    20          5
    30          3

In the above query, LNNVL(comm>0) filter is equivalent to: (comm<=0 OR comm IS NULL) and since comm cannot be negative, we can say that it is the same as NVL(comm,0)=0.

According to that same Oracle documentation, LNNVL “can be used only in the WHERE clause of a query.” This does not seem to be true, at least in 12g+ releases:

Solution #2. Using LNNVL in conditional aggregation:

SELECT deptno, SUM(CASE WHEN LNNVL(comm>0) THEN 1 ELSE 0 END) cnt
FROM scott.emp 
GROUP BY deptno
ORDER BY 1;

LNNVL can also be used with IN operator. We will illustrate it with a solution (one of the many) to the following problem:

List all employees who is not a manager of somebody else.

SELECT empno, ename, deptno, job, mgr
FROM scott.emp
WHERE LNNVL(empno IN (SELECT mgr FROM scott.emp))
ORDER BY 1;

The following boolean expression defines a condition to see managers only:

empno IN (SELECT mgr FROM scott.emp)

We need the opposite one that handles NULLs – as mgr is a nullable column. And this is where LNNVL helps us.

If you like this post, you may want to join my new Oracle group on Facebook: https://www.facebook.com/groups/sqlpatterns/

sqlpatterns

Oracle SQL Tricks, Tips, and Workarounds

Conditional aggregation

Substitution SQL Puzzle

Explore the power of analytic functions

Monday Hiring SQL Puzzle and Lateral View Usage Nuance

Interview Question: For each department count the number of employees who get no commission.

Solutions

Tip of the day: use LNNVL function

Further Reading: