Lance England

Sharpen the Saw

2023-11-11T00:00:00+00:00

Sharpen the Saw

I am a big proponent of The 7 Habits of Highly Effective People, particularly Habit 7: Sharpen the Saw. At the risk of sounding slightly self-promoting, I wanted to share my focus this past year, and current goals for the next few months.

The Certification Gauntlet

Some time ago my interest was piqued by the DevOps methodology, where development and operations become more integrated to increase the delivery of value to the business. My work experience through Improving has given me an opportunity to serve in different roles on multiple projects, from architecting and developing, to deployment and tier-2 operations support. I have seen first-hand the challenges to both initial deployments and subsequent deployments for bug fixes and enhancements. DevOps sounded like a creative solution for addressing these challenges.

AZ-400

I signed up for a DevOps boot camp and started working through the material with the goal of passing the Exam AZ-400: Designing and Implementing Microsoft DevOps Solutions exam and obtaining the Microsoft Certified: DevOps Engineer Expert certification. It didn’t take long for me to realize I jumped straight into the deep end and I needed to take a few steps back and focus on some foundational knowledge first. I also discovered the certification has a prerequisite exam that I had missed.

The AZ-400 exam has a choice of two prerequisites exams: either Exam AZ-104: Microsoft Azure Administrator, or Exam AZ-204: Developing Solutions for Microsoft Azure. I felt I had more gaps on the administration path, and decided to pursue the AZ-104.

AZ-104

I put the AZ-400 on pause, and poured into the AZ-104 material, using both the Microsoft Learning Paths and James Lee’s AZ-104 course for my preparation. However, as I neared completion of the course, an opportunity/diversion presented itself.

Microsoft was running a cloud skills challenge during their Build conference, and completion of their Azure Solutions Architect learning path earned a free voucher to sit the Exam AZ-305: Designing Microsoft Azure Infrastructure Solutions exam. The AZ-305 exam is part of two exams for the Microsoft Certified: Azure Solutions Architect Expert certification, and has the same choice of prerequisites as the DevOps Engineer Expert certification. This would give me double the “bang for my buck” for passing the AZ-104 and apply it towards two different expert certifications.

AZ-305

So, you guessed it: I paused my preparation for AZ-104 to focus on AZ-305. Both the skills challenge and the free exam voucher itself were limited time offers, so it kept me hyper-focused on getting through the material and scheduling the exam. Man, the things I’ll do for a free voucher!

My preparation for AZ-305 was through my A Cloud Guru subscription, with the instructor yet again James Lee. I can’t recommend James enough as he is thorough, clear, and encouraging in his videos. I passed the AZ-305 exam, but did not yet earn the Azure Solutions Architect certification as I still needed to pass the AZ-104 prerequisite.

With a little confidence boost from passing the AZ-305 exam, I resumed my preparation for the AZ-104. Upon passing the exam, I earned two certifications at once! Passing AZ-104 by itself earns a Microsoft Certified: Azure Administrator Associate, and also combined with my AZ-305 earned the Microsoft Certified: Azure Solutions Architect Expert. Whew!

Back to AZ-400

Finally, it was time to return to the AZ-400 exam. I again used my virtual tutor James Lee to learn all about continuously integrating code with Git repos, and using pipelines to continuously deploy solutions. I felt well-prepared for the exam, but I was caught a little off-guard during the hands-on lab section. It involved more tasks than expected, and I did not allow myself enough time to complete them all. Time management is a key part of sitting an exam. Regardless, I was able to pass and earned the Microsoft Certified: DevOps Engineer Expert certification. Finally!

Reflection

I have jokingly said to a few people that I went a little crazy earning three Microsoft certifications this past year, but honestly, I enjoy the process. Yes, they are stressful. Yes, they require time and energy and focus. However, the exam is a measurable goal with a nice recognition upon completion. The goal gives me a direction for purposeful study, and the preparation guides me through learning new useful skills. Certifications are not a guarantee of anything, but like most things, you get out of it what you put into it.

Next Steps

What now? After some consideration that involved falconry, cryptozoology, or starting a Chris Gaines cover band, I have zeroed in on three specific goals for the next few months.

One, renew my Microsoft Certified: Power BI Data Analyst Associate certification. I will always be a data person at heart, and I still get a thrill out of building good data models, shaping data with Power Query, writing DAX measures, and seeing it all come together in reports and dashboards. The new Microsoft renewal process is easy; passing an online assessment will renew my certification for another year. I’ll get to go through the renewal process a lot next year!

Two, obtain the Microsoft Certified: Azure Database Administrator Associate certification. SQL Server is like an old friend, but my friend is spending a lot of time in the cloud these days. I want to stay on top of the different Azure SQL options (serverless, elastic pools, etc.) and the exam will guide me through everything I need to know.

Three, and possibly most importantly, focus on hands-on practice building side-projects. The lab portion of the AZ-400 exam reminded me of the importance of this. Instructional learning is great but combining it with practice is the key to deep learning. I have already started my first side-project and will be blogging about the process.

Longer-term, I have some other ideas: diving deeper into modern data engineering, branching into Amazon Web Services, and develop solid Linux fundamentals.

Sharpen the saw.

Quicksilver Project - Database

2023-10-18T00:00:00+00:00

Quicksilver Project - Database

This post is part of a series of getting hands-on practice on a DevOps project:

Intro
Database
API (Coming soon)

Data will be stored in an relational database, specifically an Azure SQL Database. The data model should be fairly stright-forward. My initial model will be intentionaly bare-bones so I can implement a CI/CD pipeline and then add more to it.

Entities

Customer - id, first, last, email. The person requesting a delivery.

Package - id, name, size, weight. The item to be delivered.

Delivery - id, address, geocode. The destination for the package.

Courier - id, first, last, email. The person delivering the package.

Future work will include events (e.g. picked up, out-for-delivery, delivered).

Questions

Before I get started on design and coding, I have some questions that I will hopefully answer as I’m working through it.

Authentication/Authorization: This could be a good use case for Azure B2C (business-to-customer). How do I integrate that into the Customer and Courier table, and enforce row-level security?
Does the VSCode SSDT extension work like the Visual Studio SSDT projects? Does it create a DACPAC that can be deployed? Does it support publish profiles?
How difficult will it be to implement an Azure SQL Database with a Bicep file?

Getting Started

Development

First, I opened VSCode and made sure the SQL Database Projects extension was installed. Then opened the Command Pallette (Ctrl+Shift+P) and choose “Database Projects: New” which then gave an option of Azure SQL Database or SQL Server. I’m not exactly sure what the difference means in terms of the extension functionallity, probably the deployment method(?).

Right-clicking he extension toolbar gives options for New Table, New Stored Procedure, etc. so this looks familiar enough to quickly stub out a handful of tables so I can move to the deployment step.

Deployment

I’m going to try this multiple ways, each time getting closer to the goal of CI/CD pipeline deployment.

From VSCode

First, I’ll manually create an Azure SQL Database through the portal and deploy the code from VSCode, to verify the DACPAC deploys as expected.

Azure SQL Databases now have a free tier perfect for these types of learning excercises. I icked my Entra ID/Azure AD account as the administrator. After the database was created, I set the Server Firewall rule to allow y client IP address.

In VSCode, I choose ‘Publish’ and it launched a basic wizard in the command pallette area, asking for fully-qualified server name, database name, and authentication method. It also allowed me to save a publish profile. Everything worked, so on to the next deployment method.

From Azure DevOps Pipeline

From the Azure portal, I dropped/re-created a new empty database. Note: In order to apply the free offer again, you have to delete the logical server too.

Next, I created a new project in Azure DevOps. Back in VSCode, I used the Git extension to initialize a repo. Under Repos in Azure DevOps, I got the URL of the repo and added it as a Remote for my local git repo (the database project) and published it up to Azure Repos.

Next step, creating the pipeline and setting it up to publish to Azure SQL Database. First challenge, getting the build path correct. This took multiple attempts, including using echo $() to see the path and finding this article on Building Git Repos that mentions the default checkout location of /s. Finally was able to build it with the DotNetCoreCLI@2 task.

Moving on to deployment, at time of writing (October 2023) the SqlAzureDacpacDeployment@1 task only works on Windows agents, so be sure your YAML pipeline targets vmImage: windows-latest.

Azure DevOps needs a way to deploy to your Azure sunscription, so first you need to set up an Azure Resource Manager Service Connection. The first option is selecting the authentication type that Azure DevOps (AdO) will use when connecting to the Azure subscription. There are six (6) options, and I only have knowledge of two (2) of them, so that is another area to learn more. I choose *Service principle (Automatic) in which AdO will create a Service Principle in the related Azure AD aka Entra ID instance. It allows you pick the scope between Management Group, Subscription, and Resource Group, and assigns the Contributor RBAC role.

My first decision point before creating the pipeline was how much Infrastructure-as-Code to automate. Should I try to automate the creation of the Azure SQL Database/logical server instance in the pipeline? When learning something new, I tend to go for the lowest “friction” route at first; I’d prefer to get a minimum-viable concept working, and can improve/productionize it later. So, I decided to create the Azure SQL Database manually.

Next step, SQL Database permissions. The Subscription-level RBAC Contributor role is for management-plane permissions. In order to allow the AdO service principle to create objects inside the database, it must have some level of SQL Server permissions, which I assigned using T-SQL statements after connecting SQL Management Studio.

Note: In the Azure portal, browse to the SQL Database and choose “Set Firewall Rules” and choose the option to allow your client IP address.

I created a database user mapped to a server login, then assign the role of db_owner. In the future, this could be reduced to some combination of ddl_admin, db_datawriter, etc. However, if you plan on defining custom database roles you’’ need to have db_owner or db_securityadmin.

To build a pipeline to deploy the SQL Database Project, I used two tasks:

DotNetCoreCLI@2 - to build the project and produce a DACPAC
SqlAzureDacpacDeployment@1 - to deploy the DACPAC against the target Azure SQL Database

It took a multiple attempts to get the file paths figured out. But these are exactly the type of things you just have to fumble through and learn. I used the system-variable Agent.BuildDirectory and through runing a script task DIR, I discovered my repo code was under /s. After the fact I discovered a different system-variable, Build.Repository.LocalPath, that gives the path to the repo.

After some trial and error, I was able to build and deploy, hurrah! My YAML pipeline (at this point):

trigger:
- main

pool:
  vmImage: windows-latest

steps:
- task: DotNetCoreCLI@2
  inputs:
    command: 'build'
    projects: '**/*.sqlproj'

- task: SqlAzureDacpacDeployment@1
  inputs:
    azureSubscription: 'Visual Studio Enterprise Subscription – MPN(guid redacted)'
    AuthenticationType: 'servicePrincipal'
    ServerName: 'quicksilver-dev-sqlsrv01.database.windows.net'
    DatabaseName: 'QuicksilverDb'
    deployType: 'DacpacTask'
    DeploymentAction: 'Publish'
    DacpacFile: '$(Build.Repository.LocalPath)\s\Quicksilver\bin\Debug\Quicksilver.dacpac'
    IpDetectionMethod: 'AutoDetect'

Future Steps

Change the pipeline to trigger off merging a change request.
Add parameters for values that will change between environment (subscription, resource group, etc.).
Learn about different Service Principle authentication options.
Determine how much Infrastructure-as-Code whould be in the Pipeline. For databases, it’s harder to define than application code.
Reduce the permissons of the service principle in the database, ddl_admin, etc.
Research the Database Project publish profile options
Add a task tfor SSDT to generate scripts and add a manual review/approval deployment gate.
Learn the differences in between the two VSCode Database Project types, Azure SQL Database vs. SQL Server.

Quicksilver Project - Intro

2023-10-14T00:00:00+00:00

Quicksilver Project - Intro

This post is part of a series of getting hands-on practice on a DevOps project:

Intro
Database
API (Coming soon)

This Quicksilver project is simply to get hands-on practice in cloud technologies. Quicksilver is a fictional courier company from an 80s movie (credit to co-worker Bill Delaune for coming up with the idea). The goal is to build a basic front-end client and back-end system for customers and couriers to have packages delivered. The code will be a very basic proof-of-concept, but will be built with DevOps principals of source control, continuous integration (CI) and continuous delivery (CD).

Back-end technologies used:

Azure DevOps
Azure SQL Database
ASP.net API
Azure Container Instance
Azure Maps
Entra ID (aka Azure AD)
Azure KeyVault

Front-end technologies:

To be determined, probably some super basic web app

User Stories

Customer

Create/edit account
Create/edit package/delivery request
Authentication/Authorization

Courier

Create/edit account
Create/edit delivery job of package
Authentication/Authorization

Implementation

Data will be persisted in Azure SQL Database using an API interface. API will be a C# ASP.net Core container hosted in Azure Container Instance. Azure Maps will be used to translate addresses to Geocodes, and to provide route directions to couriers.

Code will be hosted in either GitHub or Azure DevOps, with CI/CD using Azure Pipelines.

Plan and Random Thoughts

I’m starting with my wheelhouse, the data model. However, my inital model will be incomplete by design, so that I can then hook up a CI/CD pipeline for further devlopment.

Next, I will implement a very basic API to for CRUD operations against the data model. Again, because the data model will be incomplete, this will also evolve with CI/CD.

I need to determine where and how to handle authentication/authorization for the API. I could put an API Management in front of it, but also want to explore other options.

I prefer the repo to be in GitHub, but will initially use Azure Repos to simplify implementing the pipelines in Azure DevOps. Later, I may try moving the repo to GitHub and either intergrating with Azure DevOps for the pipelines, or use GitHub actions. Or, maybe both just for the experience.

I will use VSCode, including the SQL Database Project extension, and for the API use the Developing Inside a Container functionality.

Infrastructure-as-Code will be defined with Bicep files.

The development and production pipelines will target different resources, optimized for cost in development and optimized for scaling in production, while still aiming for cost effectiveness.

Viewing Column Sizes in Power BI Desktop

2021-06-05T00:00:00+00:00

Viewing Column Sizes in Power BI Desktop

When performance tuning Power BI data models, the most important consideration is the size of each column. The tabular model is a columnar database which means the data values are organized by column, allowing for increased compression. Analytic workloads, such as Power BI, benefit greatly from this.

Because the entire data model is loaded in memory, the smaller the data model, the greater the performance gain. The first step in performance tuning is always measurement, and the first measurement is the size of each column. Luckily there is an easy way to do this using DAX Studio and the DISCOVER_STORAGE_TABLE_COLUMNS dynamic management view (DMV).

Open DAX Studio, connect to your Power BI Desktop data model. Behind the scenes, the data model is a SQL Server Analysis Server Tabular Model, so it supports a host of dynamic management views. The query below lists column name, table name, and column size in descending order. Once you identify the largest columns, you can decide if and how to deal with it.

Your optimization options are:

Delete it if you don’t need it
Split it (for example split dates and times into two columns, splitting order numbers if they follow a pattern that reuses same values that would benefit from column compression)
Limit the number of distinct values (meaning group outliers into a catch-all bucket)
Do nothing, because you need the column

The query:

SELECT
    DIMENSION_NAME,
    ATTRIBUTE_NAME,
    DICTIONARY_SIZE
FROM
    $System.DISCOVER_STORAGE_TABLE_COLUMNS
WHERE
    COLUMN_TYPE = 'BASIC_DATA'
ORDER BY
    DICTIONARY_SIZE DESC

Aggregate Changes with T-SQL Window Functions

2021-05-03T00:00:00+00:00

Aggregate Changes with T-SQL Window Functions

Recently I was presented an interesting data challenge. A database would import data feeds at the detail level of one row per patient, per month, per amount type. The output of the database would be a summary, where one row would represent the start and end period of unchanged values. Each change in amount (in chronological order) or a skipped period (or periods) would trigger a new row.

For example, in the table below

PatientId ‘1234’ no changes for January 2020 - December 2020
PatientId ‘2345’ has a new Amount starting in June 2020
PatientId ‘3456’ has a gap for June - July 2020

The table below shows mocked up sample input data.

PatientId	PeriodStartDate	PeriodEndDate	AmountType	Amount
1234	1/1/2020	1/31/2020	EW	100
1234	2/1/2020	2/29/2020	EW	100
1234	3/1/2020	3/31/2020	EW	100
1234	4/1/2020	4/30/2020	EW	100
1234	5/1/2020	5/31/2020	EW	100
1234	6/1/2020	6/30/2020	EW	100
1234	7/1/2020	7/31/2020	EW	100
1234	8/1/2020	8/31/2020	EW	100
1234	9/1/2020	9/30/2020	EW	100
1234	10/1/2020	10/31/2020	EW	100
1234	11/1/2020	11/30/2020	EW	100
1234	12/1/2020	12/31/2020	EW	100
2345	1/1/2020	1/31/2020	EW	100
2345	2/1/2020	2/29/2020	EW	100
2345	3/1/2020	3/31/2020	EW	100
2345	4/1/2020	4/30/2020	EW	100
2345	5/1/2020	5/31/2020	EW	100
2345	6/1/2020	6/30/2020	EW	125
2345	7/1/2020	7/31/2020	EW	125
2345	8/1/2020	8/31/2020	EW	125
2345	9/1/2020	9/30/2020	EW	125
2345	10/1/2020	10/31/2020	EW	125
2345	11/1/2020	11/30/2020	EW	125
2345	12/1/2020	12/31/2020	EW	125
2345	1/1/2020	12/31/2020	C5	5
3456	1/1/2020	1/31/2020	EW	100
3456	2/1/2020	2/29/2020	EW	100
3456	3/1/2020	3/31/2020	EW	100
3456	4/1/2020	4/30/2020	EW	100
3456	5/1/2020	5/31/2020	EW	100
3456	8/1/2020	8/31/2020	EW	100
3456	9/1/2020	9/30/2020	EW	100
3456	10/1/2020	10/31/2020	EW	100
3456	11/1/2020	11/30/2020	EW	100
3456	12/1/2020	12/31/2020	EW	100

The table below shows the output summary with changes data.

PatientId	PeriodStartDate	PeriodEndDate	AmountType	Amount
1234	1/1/2020	12/31/2020	EW	100
2345	1/1/2020	12/31/2020	C5	5
2345	1/1/2020	5/31/2020	EW	100
2345	6/1/2020	12/31/2020	EW	125
3456	1/1/2020	5/31/2020	EW	100
3456	8/1/2020	12/31/2020	EW	100

The solution involved a few common table expressions and a few T-SQL window functions. Let’s walk through each step/CTE.

Step 1: Detect Changes

The important bits are on lines 7 and 8. Line 7 uses the LAG function to calculate the amount change between the previous row and current row. The first row would return NULL. We want the first row to be flagged as a ‘change’, so the ISNULL function replaces NULL with -1 ( i.e. not equals $0 to indicate a change).

Line 8 is similar in that it calculates the date difference in months between the previous row and current row. Each row is expected to represent a month, so the expected value is 1. We also use ISNULL to add -1 to the first row i.e. any value not equal to 1 indicates a change.

SELECT
       PatientId,
       PeriodStartDate,
       PeriodEndDate,
       AmountType,
       Amount,
       ISNULL(Amount - LAG(Amount, 1) OVER (PARTITION BY PatientId, AmountType ORDER BY PeriodStartDate), -1) AS AmtDiff,
       ISNULL(DATEDIFF(month, LAG(PeriodStartDate, 1) OVER (PARTITION BY PatientId, AmountType ORDER BY PeriodStartDate), PeriodStartDate), -1) AS PeriodDiff
FROM
       dbo.PatientSubsidy
WHERE
       PeriodStartDate BETWEEN @PeriodStartDate AND @PeriodStartDate

Step 2: Assign a Group Number

The query from step 1 is referenced in step 2 as ‘cte_detectChanges’. The important bit in step 2 is on line 7-9. Line 7 conditionally sums a ‘1’ if any change in amount or period was detected from step 1, or a ‘0’ if no change. This effectively assigns a group number for each sequential block with no change.

SELECT
       PatientId,
       PeriodStartDate,
       PeriodEndDate,
       AmountType,
       Amount,
       SUM(
              CASE WHEN AmtDiff <> 0 or PeriodDiff <> 1 THEN 1 ELSE 0 END
       ) OVER (ORDER BY PatientId, AmountType, PeriodStartDate) AS GroupNumber
FROM
       cte_detectChanges

The un-aggregated results of step 2 would look like the following table:

PatientId	PeriodStartDate	PeriodEndDate	AmountType	Amount	GroupNumber
1234	1/1/2020	1/31/2020	EW	100	1
1234	2/1/2020	2/29/2020	EW	100	1
1234	3/1/2020	3/31/2020	EW	100	1
1234	4/1/2020	4/30/2020	EW	100	1
1234	5/1/2020	5/31/2020	EW	100	1
1234	6/1/2020	6/30/2020	EW	100	1
1234	7/1/2020	7/31/2020	EW	100	1
1234	8/1/2020	8/31/2020	EW	100	1
1234	9/1/2020	9/30/2020	EW	100	1
1234	10/1/2020	10/31/2020	EW	100	1
1234	11/1/2020	11/30/2020	EW	100	1
1234	12/1/2020	12/31/2020	EW	100	1
2345	1/1/2020	12/31/2020	C5	5	2
2345	1/1/2020	1/31/2020	EW	100	3
2345	2/1/2020	2/29/2020	EW	100	3
2345	3/1/2020	3/31/2020	EW	100	3
2345	4/1/2020	4/30/2020	EW	100	3
2345	5/1/2020	5/31/2020	EW	100	3
2345	6/1/2020	6/30/2020	EW	125	4
2345	7/1/2020	7/31/2020	EW	125	4
2345	8/1/2020	8/31/2020	EW	125	4
2345	9/1/2020	9/30/2020	EW	125	4
2345	10/1/2020	10/31/2020	EW	125	4
2345	11/1/2020	11/30/2020	EW	125	4
2345	12/1/2020	12/31/2020	EW	125	4
3456	1/1/2020	1/31/2020	EW	100	5
3456	2/1/2020	2/29/2020	EW	100	5
3456	3/1/2020	3/31/2020	EW	100	5
3456	4/1/2020	4/30/2020	EW	100	5
3456	5/1/2020	5/31/2020	EW	100	5
3456	8/1/2020	8/31/2020	EW	100	6
3456	9/1/2020	9/30/2020	EW	100	6
3456	10/1/2020	10/31/2020	EW	100	6
3456	11/1/2020	11/30/2020	EW	100	6
3456	12/1/2020	12/31/2020	EW	100	6

Step 3: Aggregate the Data

The query from step 2 is referenced in the final step as ‘cte_AssignGroupNumber’. The important bits here are grouping on PatientId, AmountType, Amount, and the new GroupNumber created in step 2. We get the minimum start period and maximum end period for each group.

SELECT
       PatientId,
       AmountType,
       MIN(PeriodStartDate) AS PeriodStartDate,
       MAX(PeriodEndDate) AS PeriodEndDate,
       Amount
FROM
       cte_AssignGroupNumber
GROUP BY
       PatientId,
       AmountType,
       Amount,
       GroupNumber
ORDER BY
       PatientId,
       AmountType,
       PeriodStartDate
;

Complete Query

The complete query is below. Volume and usage patterns will dictate how to organize the clustered index. It is anticipated that this will be a batch load/extract process with no additional non-clustered indexes, so the initial clustered index will probably be on PeriodStartDate.

DECLARE
       @PeriodStartDate DATE = '2020-01-01',
       @PeriodStartDate DATE = '2020-12-31'
;

WITH cte_detectChanges AS (
SELECT
       PatientId,
       PeriodStartDate,
       PeriodEndDate,
       AmountType,
       Amount,
       ISNULL(Amount - LAG(Amount, 1) OVER (PARTITION BY PatientId, AmountType ORDER BY PeriodStartDate), -1) AS AmtDiff,
       ISNULL(DATEDIFF(month, LAG(PeriodStartDate, 1) OVER (PARTITION BY PatientId, AmountType ORDER BY PeriodStartDate), PeriodStartDate), -1) AS PeriodDiff
FROM
       dbo.PatientSubsidy
WHERE
       PeriodStartDate BETWEEN @PeriodStartDate AND @PeriodStartDate
),
cte_AssignGroupNumber as (
       SELECT
              PatientId,
              PeriodStartDate,
              PeriodEndDate,
              AmountType,
              Amount,
              SUM(
                     CASE WHEN AmtDiff <> 0 or PeriodDiff <> 1 THEN 1 ELSE 0 END
              ) OVER (ORDER BY PatientId, AmountType, PeriodStartDate) AS GroupNumber
       FROM
              cte_detectChanges
)
SELECT
       PatientId,
       AmountType,
       MIN(PeriodStartDate) AS PeriodStartDate,
       MAX(PeriodEndDate) AS PeriodEndDate,
       Amount
FROM
       cte_AssignGroupNumber
GROUP BY
       PatientId,
       AmountType,
       Amount,
       GroupNumber
ORDER BY
       PatientId,
       AmountType,
       PeriodStartDate
;

Goodbye, SQL PASS

2021-01-14T00:00:00+00:00

Goodbye, SQL PASS

SQL Pass (or PASS) has been a great amplifier of both community and professional development for thousands. I was saddened by their recent announcement:

We are saddened to tell you that, due to the impact of COVID-19, PASS is ceasing all regular operations, effective January 15, 2021.

I have spoken at three different SQL Saturday events. I was SO nervous the first time that after my session I just had to go sit in my car in the parking lot and close my eyes. But it was a thrill. My favorite SQL Saturday was the last one I spoke at, because I was there not just representing myself, but also Improving - Atlanta. Experiencing the event as a team was that much more rewarding.

Ultimately, the community is bigger than just SQL PASS. Already virtual groups are migrating to Meetup and other platforms. SQL Saturdays will likely re-emerge under a different name. That said, I still hate to see SQL PASS cease operations. Thank you to all who served as board members, volunteers, and staff.

Goodbye, SQL PASS, and thanks for the memories.

Passing the DA-100 Exam - Study Notes

2020-12-22T00:00:00+00:00

Passing the DA-100 Exam - Study Notes

I recently passed Exam DA-100: Analyzing Data with Microsoft Power BI and earned Microsoft Certified: Data Analyst Associate. The DA-100 is the new role-based replacement for Exam 70-778: Analyzing and Visualizing Data with Microsoft Power BI which retires January 31, 2021. While much of the exam content overlaps, the skills measured has expanded and the weights adjusted. Luckily, Microsoft has done a really nice job with their free learning paths to prepare for the exam.

Preparation

The learning paths are listed at the bottom of the exam page. They walk through all the features and link to related Microsoft Docs pages for more information. In addition, they offer ‘sandbox’ labs to give hands-on practice for the topics. While not a replacement for working experience with Power BI, they are still an excellent way to work through the presented topics.

The following are notes I compiled during exam preparation. All notes are from the learning paths and related documentation. They are NOT comprehensive of everything on the exam, and shouldn’t be considered a replacement for preparation. I wanted to create a quick way to review the learning objectives the morning of the exam. To all pursuing the DA-100, I wish you the best of luck!

Get started with Microsoft data analytics

2 Modules

This learning path summarizes the types of analytics, related roles, and parts of the Power BI platform.

I. Discover data analysis

Analytics categories:

Descriptive - what
Diagnostic - why
Predictive - trends, eg neural networks, decision trees, regression
Prescriptive - how to hit target, eg machine learning on past data sets
Cognitive - Draw inferences from data, and add the findings back to knowledge base for self-learning loop

Roles:

Business analyst - closer to the business
Data analyst - reporting specialist, manage reports, data sets, dashboards, etc. The exam is on Data Analyst! The DA tasks are the objectives of the exam
Data engineer - data flow and integration
Data scientist - extract value from data through analytics
Database administrator - operational aspect of data platform

II. Get started building with Power BI

Power BI is:

Power BI Desktop
Power BI service
Power BI mobile

Building blocks in Power BI:

Visualizations - visual representation of data
Datasets - collection of data
Reports - collection of visualizations on one or more pages
Dashboards - collection of visualizations form one page
Tiles - single visualization

App - a collection of preset visuals, data, reports packaged to share

Prepare data for analysis

2 modules

This learning path how to connect to various data sources and transform it.

I. Get data in Power BI

A variety of data source connectors built-in e.g. flat file, relation, NoSql, multi-dimensional, on-line

Storage modes:

Import
DirectQuery
Dual (Composite)

Additional option for connecting to multidimensional i.e. Analysis Services is called Connect Live (formerly named Live Connect).

Both import and DirectQuery go through Power Query (Connect Live does NOT). Power Query can optimize source queries in some circumstances. The process is called query folding. Any transformation that has an anagalous SQL statement e.g WHERE, GROUP BY, SORT, UNION ALL, and JOIN can generally be folded.

Query Diagnostics is a tool new to me that is part of Power Query. Enable diagnostics, refresh the source(s), then stop the diagnostics. It presents information for each transformation step!

Optimization techniques:

Process as much at source as possible
For Direct Query, use native SQL and not stored procs or CTEs
For Import from relational, don’t use embedded SQL as Power Query will NOT perform query folding
Split date and time columns (this is really a data modeling optimization) as it increase compression (smaller memory footprint).

Power Query also has tools to view data quality, column distribution, and column profile.

Distinct values count makes sense. Unique values count sounds like the same thing, but its the number of values that only appear once.

Power Query examines the first 1000 rows by default. Change this by clicking the profiling status in the status bar and select ‘Column profiling based on entire data set’.

II. Clean, transform, and load data in Power BI

Append queries are like SQL UNION ALL.

Merge queries are like SQL JOIN (INNER, LEFT OUTER, FULL OUTER)

Model data in Power BI

3 modules

I. Design a data model in Power BI

Date Table - can be imported from a DW (best), built in Power Query (great), built with DAX (good).

Power Query - = List.Dates(#date(2011,05,31), 365*10, #duration(1,0,0,0)) Then convert the List to Table, then add columns as needed. There are many built-in date conversions in the UI. Highlight the date column, Add Column -> Date (in the From Date & Time group).

DAX - CALENDAR or CALENDARAUTO, then add each column e.g. DayOfMonth, Year, etc. The table won’t compress quite as well, but date tables are small enough to have negligible difference.

Mark Table as Date Table. Dates must not duplicate, and not have gaps.

Best Practice - Turn off Auto DateTime, as it builds a hidden date table for each datetime column in the entire data model.

Composite models allow a combination of imported tables and DirectQuery tables. Relationships between imported and DirectQuery tables default to many-to-many, but can be changed. It defauls to many-to-many because it can’t detect nor assume uniqueness for the DirectQuery table.

Many-to-many relationships are now supported and can replace the tradtional bridge table method. The important bit to remember is to still use single-direction filter propogation. Marco Russo has a very good presentation on this.

Aggregations are a powerful feature for combining DirectQuery detail data with the performance of imported data. Aggregations import some higher grain of the detail data. A key supporting feature is Dual mode for related dimension tables. Dual mode imports the data for DAX queries against the aggregation, and treating the table as DirectQuery when relating to the detail data. This is a very important performance feature to understand.

II. Introduction to creating measures using DAX in Power BI

Calculated columns are materialized at data refresh. They do not compress as well as imported columns, meaning they consume more memory.

A simple formula to demonstrate non-additive measure pattern:

Last Inventory Count =
CALCULATE (
    SUM ( 'Warehouse'[Inventory Count] ),
    LASTDATE ( 'Date'[Date] ))

Every time the measure is evaluated, it uses the latest date visible in the filter context.

Tables that only have measures visible (all columns hidden) are displayed at the top of the fields list with a different icon.

III. Optimize a model for performance in Power BI

A good data model in THE key to accurate numbers and good performance.

In general, reduce the size of the data model by:

Removing unnecessary columns
Remove unnecessary Date/Time hierarchies (turn off Auto date/time option)
Reduce cardinality as much as possible. For example, use qty * price instead of a higher cardinality sales column
Avoid calculated columns
Use the star schema design pattern
Push detail grain to DirectQuery and store higher-grain as aggregations

Power BI Desktop has a Performance Analyzer to show time elapsed for each visual and the generated DAX. Be aware of the visual cache and the storage cache. To clear both:

Create a new blank report page in Power BI Desktop.
Close and re-open Power BI Desktop
Start the Performance Analyzer (while still on the blank report page)
Finally, open the report page to analyze

Direct Query does not support Quick Insights and Q&A in the Power BI service.

To minimize lag on visuals refreshing, you can reduce the number of queries using options called Query Reduction. This allows disabling automatic visual interaction, manually apply slicer interaction (via a button), options for applying filters (via buttons to apply as you go or apply all).

Visualize data in Power BI

4 modules

I. Work with Power BI visuals

Types of visuals:

Clustered vs stacked bar chart - clutered show the total while stacked shows the breakdown of the total.

Line vs area chart - area chart is just a line chart filled in.

Pie chart vs donut vs tree map - A donut chart is a pie chart with the middle missing for a label. A tree map is a square pie chart essentially.

Combo chart - combinds a bar chart and a line chart.

Card - single data point.

Funnel charts - Good for showing values for a sequential process e.g. sales lead conversion.

Gauge chart - progress towards a goal.

Waterfall/bridge chart - shows a running total as increases/decreases.

Scatter chart - good for analyzing a large set of data points over time.

Maps - requires the specific data sources to be categorized for use.

Slicer - For filtering data on other visuals.

Q&A - Allows natural language questions and answers. A good data model and adding meta data (synonyms) can improve this experience. QA has these components:

The question box, for typing question (and shown suggestions)
A pre-populated list of suggested questions
An icon that users can select to convert the Q&A visual into a standard visual.
An icon that users can select to open Q&A tooling, which allows designers to configure the underlying natural language engine.

Tooltips - Tooltips support data fields. Also, they support displaying a report (that has been registered as a tooltip).

Custom visuals - Can be imported from AppSource or Your Organizations gallery. Custom visuals must be imported each time you start developing a new report.

R/Python visual - Need to enable script visual and configure the runtime.

KPIs - Needs three items: 1) a goal, 2) a unit of measurement, and 3) a time-series.

II. Create a data-driven story with Power BI reports

Power BI desktop provides tools for aligning, resizing the report and the layout within in.

Accesibility features built-in:

Keyboard navigation
Screen-reader compatibility
High contrast colors view
Focus mode
Show data table

Accesibility features to configure:

Alt text
Tab order
Titles and labels
Markers
Themes

The following features work together:

Bookmarks capture the state of a view of the report, allowing to quickly return/restore it.
Selections allow you to enable/disable items from the bookmark. This is viewable in the slection pane.
Buttons have actions, one of which is to display a bookmark.

Button: conditional formatting based on a measure can include the action.

Cross-report drill-through - Can be enabled/disabled in Power BI Deskop options and/or Power BI service. Allow a report to drill-through to a different report. Note: Navigation ‘Back’ button will be created automatically but should be deleted because it ‘Back’ only work within navigation within a report.

Conditional formatting has many Excel-like options like highlighting and color bars.

Slicers vs filters - slicers are additional DAX query to populate all values before selection (because it is a visual). Filters are not a visual and do not generate an additional DAX query.

Types of slicers:

Numeric range slicers
Relative date slicers
Relative time slicers
Responsive, resizable slicers
Hierarchy slicers with multiple fields

III. Create dashboards in Power BI

Dashboards vs Reports

Dashboards can be created from multiple datasets or reports.
Dashboards do not have the Filter, Visualization, and Fields panes that are in Power BI Desktop, meaning - that you can’t add new filters and slicers, and you can’t make edits.
Dashboards can only be a single page, whereas reports can be multiple pages.
You can’t see the underlying dataset directly in a dashboard, while you can see the dataset in a report - under the Data tab in Power BI Desktop.
Both dashboards and reports can be refreshed to show the latest data.

Besides pinning tiles to dashbords, tiles can be created on the dashboard of text boxes, images, videos, streaming data, and web content.

Dashboards support pinned reports named “Live Page” because as the data is refreshed, so will the tile in the dashboard. You can also interact with them, unlike other tiles.

Dashoards support themes, both out-of-the-box themes and custom themes authored as JSON.

Question in Learning Path I got wrong - In both reports and dashboards, you can use the slicers and filter by selecting a data point. Hmmm.

Data alerts are only in the power BI service for certain visuals: KPI visuals, gauges, and cards. Alerts will show an alert icon over the tile (though sometimes the browser cache requires F5 reload) and an optional email.

Report users can configure their own set of alerts.

Q&A - Natural-language interface for the data. Three main components:

Question box
Pre-populated suggestion tiles
Pin visual icon

Real-time data is supported through streaming datasets. These are stored in cache and do not support data modeling. Tiles on a dashboard are bound directly to the streaming dataset.

Data clasification - a way to tag reports as informational awareness ONLY; no actual security is enforced by the Power BI service.

Phone view in the dashboard is customizable for each user.

When pinning visuals to a dashboard, they retain whatever filter context is selected. Tip: Use relative date slicers e.g. Current Week, Current Month, etc.

IV. Create paginated reports

Paginated reports are basically SSRS, and only for Premium Capacity (or the new Premium-per-user license). Best use: Operational reports and tabular formatted data.

Not built in Power BI Desktop or the service; built with Power BI Report Builder, basically a newer version of the SSRS Report Builder app.

Paginated reports can connect to Power BI datasets. It uses the MDX editor, not the SQL editor in those cases.

Data analysis in Power BI

2 modules

I. Perform analytics in Power BI

Power BI has several tools for advanced analytics:

Groups - manually group attributes.
Bins - create bins on numeric data.
Clustering - an option on scatter chart visual (any others?) to color clusters of points using statistical analysis
Time-series analysis - Scatter chart has a play axis visual to animate data changing over time
Analyze feature - On visuals, the analyze feature can explain increase/decreas or why a distribution is different. Read more about Analyze
Quick Insights - Machine-learning algoriths run on the data set (Power BI service only; imported datasets only). The insights results cards let you drill-down more with scoped insights.
AI Insights - Text Analytics, Vision, and Azure Machine Learning. Text and vision require Premium Capacity license.

II. Work with AI visuals in Power BI

The Q&A feature is part of both the Power BI service and Power BI Desktop! Q&A is meta-data driven. Q&A will underline unrecognized words with a squiggly red line. Some metadata is derived, but Q&A setup lets you:

Review question - see all questions asked
Teach Q&A - an easy way to define unknown terms
Manage terms - Review/edit/add/delete terms

Q&A is displayed above dashboards automatically. You can also add a Q&A visual or a Q&A button to a report in Power BI desktop.

Key influencers - a visual that illustrates the factors affecting a metric

Decomposition trees - a visual that will automatically breakdown/drill-down measures. Selecting the AI ‘high value’ or ‘low value’ option will determine the most relevant combinations for each.

AI splits consider all available fields and determine which one to drill into to get the highest/lowest value of the measure analyzed.

Manage workspaces and datasets in Power BI

3 modules

Create and manage workspaces in Power BI

Workspaces are centralized repositories of reports, dashboards, and datasets that offer collaboration and security.

Workspace built-in roles:

Admin
- Publish/edit/delete content
- Add/remove users
- Publish/edit/delete app (bundled content)
- Schedule data refreshes
Member
- Publish/edit/delete content
- Publish/edit/delete app (bundled content)
- Schedule data refreshes
- Cannot add/remove users, delete workspace, or edit workspace metadata
Contributor
- Publish/edit/delete content
- Schedule data refreshes
Viewer
- View reports/dashboards
- Read dataflow data

If the workspace is backed by a Premium capacity, a non-Pro user can view content within the workspace under the Viewer role.

The following are assignable to roles:

individual users
mail-enabled security groups
distribution lists
Microsoft 365 groups
regular security groups (AD, I assume)

App - a published, read-only window into your data for mass distribution and viewing. Apps require Pro license to create AND view. As noted above, if backed by Premium Capacity, then Pro license for creating/publish, not for consumption.

App permissions include:

Build - allow consumers to build new content from the app datasets
Copy - copy the app into another workspace.
Share - Allow to share with others

After publishing the app, you can edit the app in the workspace, but changes are not published until the ‘Update app’ action

Usage and performance metrics are only for Pro license, and roles admin, member, or contributor. They include variations on views and time to open.

Premium Capacity has deployment pipelines where workspaces can have development, test, and production designations. Dev has ‘Deploy to Test’ option. Test allows creating testing rules and ‘Deploy to Prod’ option. Test and prod give the option of replacing the data source (so that prod can point to a prod datasource, for example).

The data lineage view illustrates datset to report to dashboard dependecies. The dataset card also display last refresh date, and offer a manual refresh option. The card also has an impact analysis option, which can show cross-workspace dependencies.

Data protection features:

Sensitivity lables - built-in (None, Personal, General, Confidential, and Highly confidential) and also custom labels supported (through Microsoft 365 security center). Can be assigned to content. Does not prevent export, but exports the label for supporting applications (PDF, Excel, PowerPoint).

Not on exam, but Power BI supports both classic and new workspaces. Read more: New Workspaces

Manage datasets in Power BI

Data sources support parameters, defined in Power Query editor. Even the data source connection string/properties can be parameterized. Parameter values can be pre-defined, or come from another query.

What-if parameters - enable running scenarios by generating a sequence of values (start, end, and increment defined by report author). Then create a new ‘forecast’ measure that uses the parameter with a defined measure (typically multiply). On the report, the value of the What-If parameter can be changed via a slicer. All this could be done manually, but the What-If parameter functionality streamlines the process.

Power BI gateway - Oragnizational and Personal modes. Installed on on-prem server or workstation. Actually named On-Prem Gateway, and the same tool supports many Microsoft cloud integration tools.

Scheduled refresh - 8 per day, or 48 per day on Premium Capacity. The schedule disables after four consectutive refresh failures.

Incremental refresh - a simplified way for partition refresh (the traditional incremental refresh tactic). The date source MUST support query folding. The steps:

Define the filter parameters (must be named RangeStart and RangeEnd)
Use the parameters to apply a filter (in Power Query, greater than or equal to RangeStart and less than RangeEnd)
Define the incremental refresh policy (on the table in the dataset, define rows to keep and rows to refresh. Power BI figures out the actual partition logic)
Publish changes to Power BI service.

Endorsement - badge icons on datasets for confidence and clarity

Promoted - any role but Viewer
Certified - request certification/endorsement. Power BI admin can define who can certify.

Query caching (Premium Capacity only) - cache is specific to a user and report page.

Implement row-level security

Row-level security (RLS) uses a DAX filter as the core logic mechanism. The RLS can be static or dynamic. The roles (and RLS) can be tested in Power BI Desktop or Power BI service via ‘View As Role’ functionality. The steps are similar in both:

Define a role
Add a DAX expression to limit table(s)
Add users to role

Static - Hard-coded value e.g. [department] = ‘game’

Dynamic - Uses userprincipalname() function that differs per user e.g. [emailAddress] = userprincipalname()

Embrace the Side-Quests

2020-11-06T00:00:00+00:00

Embrace the Side-Quests

What are side-quests? In video games, they are tasks given to the player that have no direct bearing on the main story/campaign of the game. Completion of a side-quest often results in the acquisition of money, items, or unlocks another side-quest.

Careers in technology are filled with goals and side-quests. However, unlike video games, I believe career side-quests DO have a direct bearing on your journey.

For example, I am currently working on my cloud and devops skills, perhaps even sitting the exam for the Azure DevOps Engineer certification. Before I begin studying for the AZ-400 exam, I need to study for one of the pre-requisites. I’ve chosen the AZ-104 Azure Administrator exam. As I work through the areas of knowledge, I realize I really need to improve my networking skills. Right, fine, so I stop AZ-104 preparation to learn some networking fundamentals on PluralSight. I’ve been watching a great course by Ross Bagurdes called Networking Concepts and Protocols. When I get through the section on subnetting, I realize that this is important and I want to dive in a little more. Fine, pause there, and watch another great PluralSight course by Ross called Network Layer Addressing and Subnetting.

Whew. To summarize, my current side-quest is AZ-400 -> AZ-104 -> Networking fundamentals -> Subnetting. Once I finish up the networking fundamentals, I will get back to working through the AZ-104 material and I’m sure there will be more areas I will want to pause and dive a little deeper.

My point is, you should always be prepared to take side-quests because it means aquiring valuable skills. Be patient. Embrace the process. There is a literal infinite amount of knowledge to learn, so focus on a direction, and understand the journey is part of the quest.

PowerShell and Zip files

2020-02-06T00:00:00+00:00

PowerShell and Zip files

Another day, another problem that PowerShell helped solve quickly and easily.

During the course of a typical work day, often I will have a problem in front of me that just needs a quick and easy solution. In these cases, more often than not, PowerShell has been my tool of choice.

Yesterday, I needed to cross-reference a large set of files in an archive directory against data from a SQL Server. The challenge was that the files were zipped, and the field to cross-reference was a string of text embedded in a fixed-width format.

The data I needed from SQL Server I queried from SSMS and copy/pasted into Excel. For the zipped file data, I used the (slightly scrubbed) script below.

Now, I’ll admit, if I wanted to be extra fancy I would have queried the SQL Server from PowerShell (maybe with the dbatools module) and then created the Excel file from Doug Finke’s ImportExcel module. However, time was a priority, so I just added all the cross-reference data I needed to a StringBuilder object and copied the full string to the clipboard and then I pasted into Excel. I quick VLOOKUP formula later and I had what I needed.

Notes

You have to reference the System.IO.Compression.FileSystem assembly. The call to System.IO.Path.GetTempFileName() creates a file and returns the path. I did not see an option to overwrite files during the zip extraction, so I delete the temp file first, and also after for cleanup. Also, I used Write-Host for me, and it no longer kills puppies.

If anybody has suggestions for improvement, reach out via one of the contact links at the bottom of the web site.

Clear-Host
Add-Type -assembly "System.IO.Compression.FileSystem"
$sb = New-Object -TypeName System.Text.StringBuilder

Write-Host "$(Get-Date)"
Get-ChildItem -Path '\\company.fileshare\blahblah\file\archive' |
    Select-Object FullName |
    ForEach-Object {
        $tmpFilePath = [System.IO.Path]::GetTempFileName()
        $archive = [System.IO.Compression.ZipFile]::OpenRead($_.FullName)

        [System.IO.File]::Delete($tmpFilePath) |Out-Null
        [System.IO.Compression.ZipFileExtensions]::ExtractToFile($archive.Entries[0], $tmpFilePath) | Out-Null

        $contents = [System.IO.File]::ReadAllLines($tmpFilePath)
        $filekey = $contents[0].Substring(4, 17)
        $sb.AppendLine($filekey) |Out-Null
        [System.IO.File]::Delete($tmpFilePath) |Out-Null
    }
    if ($sb.Length -gt 0) {
        [System.Windows.Forms.Clipboard]::SetText($sb.ToString()) |Out-Null
        Write-Host "List set to clipboard"
    }
Write-Host "$(Get-Date)"

Photo by Steshka Willems from Pexels

SQL Saturday #919 Recap

2019-10-21T00:00:00+00:00

SQL Saturday #919 Recap

October 19, 2019 was SQL Saturday Atlanta #919 BI Edition, and the event was really fun and educational. Thanks to the organizers, sponsors, speakers, and attendees.

I presented on DAX Fundamentals. The slide deck and demos can be found on the event session page, and also on my About Me page. Thank you to all that attended. I hope I was able to make some of the tricky parts of DAX less tricky.

While there, I attended great sessions on PySpark by Brad Llewellyn, Row Level Security Patterns in Power BI by Reza Rad, AI for the Masses by Ryan Wade, and Aggregations in Power BI by Shabnam Watson. All presenters did a really good job.

The Improving Atlanta team was represented well as a main sponsor, part of the organizer team, and with seven session speakers.