Low-Code Data Quality Automation using Dataflows
This article was authored by Gil Raviv, a community author. Please reach out to your Microsoft contact if you are interested in contributing to the Power Query Blog.
Author: Gil Raviv, Microsoft MVP, DataChant.com, Power BI Capability Lead at Avanade
Maintaining high data quality is one of the top challenges in any data-driven solution. Business Intelligence (BI) developers and data analysts spend up to 80% of their time in data cleansing & preparations. Through manual, error-prone, repetitive audits or complex, expensive high-end data quality automation tools, organizations often lose the battle against messy datasets. It is almost impossible to win this battle as you are constantly hammered by ever-changing, or lack of business requirements while your datasets keep changing.
As a result, with the lack of proper data quality validation in place, we see many delays in delivery of BI solutions, long mitigation of new data quality issues, and failures to deliver the highest value to the end-users over time. But, this can all be mitigated when we choose to use Power Query’s low-code ETL capabilities to deliver agile and low-cost data quality automation.
Now, when Power Query technology is available as a low-code ETL service in dataflows, we can use its ground-breaking, data shaping capabilities to introduce low-code Enterprise ETL and persist the prepared data outside Power BI or Excel reports. For example, with dataflows, you can store the prepared data on Azure Data Lake Storage (ADLS) Gen 2, or on Microsoft Dataverse and can apply it to a wide range of business applications and data science use cases.
Low-code Data Quality Automation
Let’s say you have three hundred mission-critical operational Excel workbooks, that are prone to data quality issues whenever their owner changes an Excel formula by mistake, or perhaps you have multiple data marts with common data quality issues done by ETL developers. With dataflows, you can create simple queries that will import a sample of the data from these workbooks or data marts, and every time refresh the dataflow, you will have the snapshots of the data stored on ADLS Gen2. From here, you can analyze changes in your data and profile the snapshots over time to detect data quality issues. This is really one of the neat features of Dataflows. After you connect it to your own ADLS Gen2, every time you refresh the dataflow a snapshot is created with a copy of your data. And here is a great potential in generating your own low-code data quality automation.
But, wouldn’t it be great, if you can have an easy solution that will detect breaking changes in any one of these three hundred Excel workbooks or unstable data marts?
Dataflow Snapshots Analysis
To help you take dataflows to the next level in building low-code data quality and auditing solution, I released two Power BI apps on Microsoft AppSource: Dataflow Snapshots Analysis here and Dataflow Snapshots Analysis – Power Apps here.
The free apps run on top of your snapshots, and perform three main objectives:
- The apps profile your snapshots and extract key data profiling metrics such as Min, Max, Average, Standard Deviation, Null Count. Over time, you can learn more about each column in your tables, identify data patterns, trends, and outliers.
- For every pair of consecutive snapshots, the apps compare between each record to detect changes. You can also detect new and missing records. Detecting changes in records is a key capability for auditing purposes and for automated regression tests.
- Analytics capabilities and report customization of the apps’ interactive Power BI reports.
In the main report page of the app, you can see an overview of main trends in record changes and data profiles.
In the Differences page, you can analyze record changes, missing and new records over time, drill down to actual changes in values and understand the context of the changes (which tables, columns and rows were changed).
In the Profiles page, you can analyze any profiled column over time by one of the key metrics: Average, Count, Distinct Count, Max, Min, Null Count and Standard Deviation.
I would like to thank the Power Query, Dataflows and Template Apps teams at Microsoft that acknowledged the value of the apps and offered me to write you this post and have featured it here on the official Power BI blog as part of May 2021 Feature Summary. To see a demo of the Dataflow Snapshots Analysis app, you can watch the recording from my last presentation in Power BI User Group Italy here or my recorded session with Reza Rad here.
I hope you find it useful and start incorporating Power Query and Dataflows as a low-code data quality automation solution. From my experience, this can be a game changer.
The Power Query team is interested in your feedback! Feel free to leave a comment below with your thoughts.