Lance England

Aggregate Changes with T-SQL Window Functions

Recently I was presented an interesting data challenge. A database would import data feeds at the detail level of one row per patient, per month, per amount type. The output of the database would be a summary, where one row would represent the start and end period of unchanged values. Each change in amount (in chronological order) or a skipped period (or periods) would trigger a new row.

For example, in the table below

  1. PatientId ‘1234’ no changes for January 2020 - December 2020
  2. PatientId ‘2345’ has a new Amount starting in June 2020
  3. PatientId ‘3456’ has a gap for June - July 2020

The table below shows mocked up sample input data.

PatientId PeriodStartDate PeriodEndDate AmountType Amount
1234 1/1/2020 1/31/2020 EW 100
1234 2/1/2020 2/29/2020 EW 100
1234 3/1/2020 3/31/2020 EW 100
1234 4/1/2020 4/30/2020 EW 100
1234 5/1/2020 5/31/2020 EW 100
1234 6/1/2020 6/30/2020 EW 100
1234 7/1/2020 7/31/2020 EW 100
1234 8/1/2020 8/31/2020 EW 100
1234 9/1/2020 9/30/2020 EW 100
1234 10/1/2020 10/31/2020 EW 100
1234 11/1/2020 11/30/2020 EW 100
1234 12/1/2020 12/31/2020 EW 100
2345 1/1/2020 1/31/2020 EW 100
2345 2/1/2020 2/29/2020 EW 100
2345 3/1/2020 3/31/2020 EW 100
2345 4/1/2020 4/30/2020 EW 100
2345 5/1/2020 5/31/2020 EW 100
2345 6/1/2020 6/30/2020 EW 125
2345 7/1/2020 7/31/2020 EW 125
2345 8/1/2020 8/31/2020 EW 125
2345 9/1/2020 9/30/2020 EW 125
2345 10/1/2020 10/31/2020 EW 125
2345 11/1/2020 11/30/2020 EW 125
2345 12/1/2020 12/31/2020 EW 125
2345 1/1/2020 12/31/2020 C5 5
3456 1/1/2020 1/31/2020 EW 100
3456 2/1/2020 2/29/2020 EW 100
3456 3/1/2020 3/31/2020 EW 100
3456 4/1/2020 4/30/2020 EW 100
3456 5/1/2020 5/31/2020 EW 100
3456 8/1/2020 8/31/2020 EW 100
3456 9/1/2020 9/30/2020 EW 100
3456 10/1/2020 10/31/2020 EW 100
3456 11/1/2020 11/30/2020 EW 100
3456 12/1/2020 12/31/2020 EW 100

The table below shows the output summary with changes data.

PatientId PeriodStartDate PeriodEndDate AmountType Amount
1234 1/1/2020 12/31/2020 EW 100
2345 1/1/2020 12/31/2020 C5 5
2345 1/1/2020 5/31/2020 EW 100
2345 6/1/2020 12/31/2020 EW 125
3456 1/1/2020 5/31/2020 EW 100
3456 8/1/2020 12/31/2020 EW 100

The solution involved a few common table expressions and a few T-SQL window functions. Let’s walk through each step/CTE.

Step 1: Detect Changes

The important bits are on lines 7 and 8. Line 7 uses the LAG function to calculate the amount change between the previous row and current row. The first row would return NULL. We want the first row to be flagged as a ‘change’, so the ISNULL function replaces NULL with -1 ( i.e. not equals $0 to indicate a change).

Line 8 is similar in that it calculates the date difference in months between the previous row and current row. Each row is expected to represent a month, so the expected value is 1. We also use ISNULL to add -1 to the first row i.e. any value not equal to 1 indicates a change.

SELECT
       PatientId,
       PeriodStartDate,
       PeriodEndDate,
       AmountType,
       Amount,
       ISNULL(Amount - LAG(Amount, 1) OVER (PARTITION BY PatientId, AmountType ORDER BY PeriodStartDate), -1) AS AmtDiff,
       ISNULL(DATEDIFF(month, LAG(PeriodStartDate, 1) OVER (PARTITION BY PatientId, AmountType ORDER BY PeriodStartDate), PeriodStartDate), -1) AS PeriodDiff
FROM
       dbo.PatientSubsidy
WHERE
       PeriodStartDate BETWEEN @PeriodStartDate AND @PeriodStartDate

Step 2: Assign a Group Number

The query from step 1 is referenced in step 2 as ‘cte_detectChanges’. The important bit in step 2 is on line 7-9. Line 7 conditionally sums a ‘1’ if any change in amount or period was detected from step 1, or a ‘0’ if no change. This effectively assigns a group number for each sequential block with no change.

SELECT
       PatientId,
       PeriodStartDate,
       PeriodEndDate,
       AmountType,
       Amount,
       SUM(
              CASE WHEN AmtDiff <> 0 or PeriodDiff <> 1 THEN 1 ELSE 0 END
       ) OVER (ORDER BY PatientId, AmountType, PeriodStartDate) AS GroupNumber
FROM
       cte_detectChanges

The un-aggregated results of step 2 would look like the following table:

PatientId PeriodStartDate PeriodEndDate AmountType Amount GroupNumber
1234 1/1/2020 1/31/2020 EW 100 1
1234 2/1/2020 2/29/2020 EW 100 1
1234 3/1/2020 3/31/2020 EW 100 1
1234 4/1/2020 4/30/2020 EW 100 1
1234 5/1/2020 5/31/2020 EW 100 1
1234 6/1/2020 6/30/2020 EW 100 1
1234 7/1/2020 7/31/2020 EW 100 1
1234 8/1/2020 8/31/2020 EW 100 1
1234 9/1/2020 9/30/2020 EW 100 1
1234 10/1/2020 10/31/2020 EW 100 1
1234 11/1/2020 11/30/2020 EW 100 1
1234 12/1/2020 12/31/2020 EW 100 1
2345 1/1/2020 12/31/2020 C5 5 2
2345 1/1/2020 1/31/2020 EW 100 3
2345 2/1/2020 2/29/2020 EW 100 3
2345 3/1/2020 3/31/2020 EW 100 3
2345 4/1/2020 4/30/2020 EW 100 3
2345 5/1/2020 5/31/2020 EW 100 3
2345 6/1/2020 6/30/2020 EW 125 4
2345 7/1/2020 7/31/2020 EW 125 4
2345 8/1/2020 8/31/2020 EW 125 4
2345 9/1/2020 9/30/2020 EW 125 4
2345 10/1/2020 10/31/2020 EW 125 4
2345 11/1/2020 11/30/2020 EW 125 4
2345 12/1/2020 12/31/2020 EW 125 4
3456 1/1/2020 1/31/2020 EW 100 5
3456 2/1/2020 2/29/2020 EW 100 5
3456 3/1/2020 3/31/2020 EW 100 5
3456 4/1/2020 4/30/2020 EW 100 5
3456 5/1/2020 5/31/2020 EW 100 5
3456 8/1/2020 8/31/2020 EW 100 6
3456 9/1/2020 9/30/2020 EW 100 6
3456 10/1/2020 10/31/2020 EW 100 6
3456 11/1/2020 11/30/2020 EW 100 6
3456 12/1/2020 12/31/2020 EW 100 6

Step 3: Aggregate the Data

The query from step 2 is referenced in the final step as ‘cte_AssignGroupNumber’. The important bits here are grouping on PatientId, AmountType, Amount, and the new GroupNumber created in step 2. We get the minimum start period and maximum end period for each group.

SELECT
       PatientId,
       AmountType,
       MIN(PeriodStartDate) AS PeriodStartDate,
       MAX(PeriodEndDate) AS PeriodEndDate,
       Amount
FROM
       cte_AssignGroupNumber
GROUP BY
       PatientId,
       AmountType,
       Amount,
       GroupNumber
ORDER BY
       PatientId,
       AmountType,
       PeriodStartDate
;

Complete Query

The complete query is below. Volume and usage patterns will dictate how to organize the clustered index. It is anticipated that this will be a batch load/extract process with no additional non-clustered indexes, so the initial clustered index will probably be on PeriodStartDate.

DECLARE
       @PeriodStartDate DATE = '2020-01-01',
       @PeriodStartDate DATE = '2020-12-31'
;

WITH cte_detectChanges AS (
SELECT
       PatientId,
       PeriodStartDate,
       PeriodEndDate,
       AmountType,
       Amount,
       ISNULL(Amount - LAG(Amount, 1) OVER (PARTITION BY PatientId, AmountType ORDER BY PeriodStartDate), -1) AS AmtDiff,
       ISNULL(DATEDIFF(month, LAG(PeriodStartDate, 1) OVER (PARTITION BY PatientId, AmountType ORDER BY PeriodStartDate), PeriodStartDate), -1) AS PeriodDiff
FROM
       dbo.PatientSubsidy
WHERE
       PeriodStartDate BETWEEN @PeriodStartDate AND @PeriodStartDate
),
cte_AssignGroupNumber as (
       SELECT
              PatientId,
              PeriodStartDate,
              PeriodEndDate,
              AmountType,
              Amount,
              SUM(
                     CASE WHEN AmtDiff <> 0 or PeriodDiff <> 1 THEN 1 ELSE 0 END
              ) OVER (ORDER BY PatientId, AmountType, PeriodStartDate) AS GroupNumber
       FROM
              cte_detectChanges
)
SELECT
       PatientId,
       AmountType,
       MIN(PeriodStartDate) AS PeriodStartDate,
       MAX(PeriodEndDate) AS PeriodEndDate,
       Amount
FROM
       cte_AssignGroupNumber
GROUP BY
       PatientId,
       AmountType,
       Amount,
       GroupNumber
ORDER BY
       PatientId,
       AmountType,
       PeriodStartDate
;
03 May 2021 Permalink analysis