Why is Data Preparation hard and how can Power Query help you?
In a nutshell, Data Preparation is hard because Data in the world is messy.
Users need to connect to data from a wide range of data sources, each with its own characteristics and challenges (e.g. query interfaces, data sizes, shapes, latency and performance characteristics, etc.). Experiences to connect and query from each of these sources can vary across tools, which makes it even harder for users to ramp up and become successful.
In an ideal world, raw data in these data sources would be correct and “clean”, but in reality usually it is not, requiring the user to identify and fix issues in the data. Even when raw data is correct, most often it not be in the exact shape that users need for the downstream task they’re trying to complete, which means that users need to reshape this data prior to starting their downstream task – filtering, aggregating or tweaking data formatting to meet their needs.
Taking it a step further, now imagine how much harder this problem becomes when working with data coming from multiple data sources – reshaping data from each of the sources and, most challenging, reconciling and combining this data into a single structure. Matching data with different identifiers on each source/table or appending data with slightly different schemas into a single logical table are just some examples of the challenges; there are many more.
As a result of these challenges, extracting insights and value out of data is an extremely intimidating, time-consuming and complex task for users.
We live in the “Data Decade” – due to the increasing need for businesses to derive value and make decisions based on data, there is also a growing demand for Data Preparation tools to become simpler so that even more users (ranging from advanced to novice in terms of their Data Preparation skills) can become successful, empowering each and every one to achieve more with their data.
How does Power Query help with Data Preparation?
Power Query is Microsoft’s self-service data preparation technology which provides an intuitive and highly visual experience for non-technical users to easily connect to hundreds of data sources, reshape and clean data from these sources in a homogeneous and consistent way, and seamlessly combine data from one or multiple data sources together so that it can be used in downstream scenarios such as data reporting, analytics, low-code application development, and more.
Let’s look at each of the Data Preparation challenges that we covered in the previous section and analyze how Power Query helps overcoming them.
Finding and connecting to data is too difficult
Power Query enables connectivity to a wide range of data sources, including data of all sizes and shapes.
With over 150 out-of-the-box data connectors, Power Query provides market-leading Data Connectivity options, ranging from different types of files, folders, relational and multidimensional databases and warehouses, SaaS services and more.
Not only does Power Query provide many out-of-the-box data connectors, it also supports generic interfaces such as REST APIs, OData services, and ODBC/OLE DB drivers which enable connecting to many other sources of data. Power Query even supports scraping data from Web pages with minimal user effort, making your data connectivity options almost limitless.
You can find a detailed reference of all Power Query connectors in our documentation at: List of all Power Query connectors | Microsoft Docs.
Experiences for data connectivity are too fragmented
No matter which data source you are trying to connect to, Power Query provides a consistent user experience to allow users to securely connect to the data they need, explore their data, and perform any necessary data transformations.
When you are connecting to data sources capable of executing queries (such as a relational database, a data warehouse, etc.), Power Query can “push down” the execution of your data transformations to the underlying data source engine so they can be run faster and at larger scale. In cases when you are connecting to basic data sources such as text files, or scraping data from a web page, Power Query will still give you the same set of capabilities, but will instead execute those queries in its underlying query engine – the Mashup Engine.
You can learn more about Power Query’s query folding capabilities in our documentation at: Query folding basics | Microsoft Docs
Data often needs to be reshaped before consumption
Regardless of how clean and curated the data you are connecting to is, you almost always will need to reshape it to match the needs for your task. Power Query provides a highly interactive and intuitive experience for rapidly and iteratively building queries over any data source, of any size.
With over 350 supported data transformations, the possibilities for reshaping the data using Power Query in the way you need are endless.
Learn more about how to transform data using the Power Query Editor in our documentation at: Shape and combine data from multiple sources using Power Query | Microsoft Docs
Any shaping is one-off and not repeatable
When using Power Query to access and transform data, you define a repeatable process (query) that can be easily refreshed in the future to get up-to-date data.
In the event that you need to modify the data preparation process to account for underlying data or schema changes, you can use the same interactive and intuitive experience you used when you initially defined the query to refine and iterate over it.
Learn more about how to use the Applied Steps pane in Power Query to review and modify your existing queries as needed in our documentation: Applied steps | Microsoft Docs
Data Volume, Variety and Velocity
The first word that comes to people’s minds when thinking about Big Data is “Volume”. But there are actually three V words that matter: Volume, Variety and Velocity.
Working against tables with millions of rows is clearly one of the main challenges. Power Query offers the ability to work against a subset of the entire dataset to define the required data transformations, allowing you to easily filter down and transform your data to a manageable size.
But Big Data is also about “Variety” – having to work against data coming from a wide variety of data sources, each of them with different schemas that need to be reshaped, normalized, combined, etc. in order to get the right data needed for the job. Because Power Query provides connectivity to hundreds of data sources and over 350 different types of data transformations for each of these sources, you can work with data from any source and in any shape.
Last but not least, another important aspect in working with data from multiple sources is the differences in “Velocity” or rate of change of data. There are some data sources that change only every few days, or weeks, even monthly, while others can change several times per minute – No matter which data sources you are working against, Power Query queries can be refreshed manually or by taking advantage of scheduled refresh capabilities in specific products (such as Power BI) or even programmatically (by using the Excel object model).
When using Power Query to author dataflows, you can also leverage out-of-the-box composition and orchestration capabilities, ensuring that the latest data from each data source and query is leveraged in complex ETL scenarios, without any additional configuration needed from the end users. Read more about it in our documentation: Link entities between dataflows | Microsoft Docs
Get Started with Power Query
There are many resources available for you to get started with Power Query! Learn more about Power Query with the following links:
To try Power Query, get started at https://www.powerquery.com!