Don’t Make DeDuplicating Dastardly Difficult – A devious demystification of the dreaded duplicate
Jun 11, 2024
1 min read
0
67
0
Hey Lucas,
Yes boss?
Can you tell me which tables have duplicates in the Sales schema in the Adventures Works database by lunchtime?
But boss, it’s quarter to 12!
Lucky for you, you read crypticsql.com.
That sounds difficult right?
That’s a lot of aggregating to do!
We can do it in one query:
WITH BASE AS (
SELECT
CONCAT(
CONCAT('SELECT ',STRING_AGG(COLUMN_NAME, ', '))
, ', ROW_NUMBER () OVER (PARTITION BY ') as stmt_x, TABLE_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'Sales'AND LEFT(TABLE_NAME,1) != 'v'
GROUP BY TABLE_NAME
)
,
ORDINAL AS (
SELECT COLUMN_NAME, TABLE_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE 1=1 AND
TABLE_SCHEMA = 'Sales'AND LEFT(TABLE_NAME,1) != 'v'
AND ORDINAL_POSITION = 1
)
,
TWO AS (
SELECT COLUMN_NAME, TABLE_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE 1=1 AND
TABLE_SCHEMA = 'Sales'AND LEFT(TABLE_NAME,1) != 'v'
AND ORDINAL_POSITION = 2
)
SELECT
CONCAT(CONCAT('SELECT * FROM (',
CONCAT (stmt_x, o.column_name,' ORDER BY ', t.COLUMN_NAME,') AS RN FROM Sales.', B.TABLE_NAME)),') as A WHERE RN > 1')
FROM BASE as B INNER JOIN ORDINAL AS O
ON UPPER(B.TABLE_NAME) = UPPER(O.TABLE_NAME)
INNER JOIN TWO as T ON T.TABLE_NAME = B.TABLE_NAME
A bit of revision here:
Recursive Common Table expression used here:
We can see that the BASE set is aggregating the columns,
The ORDINAL set is returning the first column (the partition – in this case the primary column),
THE TWO set is selecting the order column condition
They are then joined together the table name, which is the aggregating column in all three sets.
Running this we can output the query:
It has generated the queries for us.
Some tables have duplicates. Some don't.
Mission accomplished.
Time for a kebab.
Don't forget to pay the lady.