The Universal Data Consolidator
This article was authored by Sameer Bhide, a community author. Please reach out to your Microsoft contact if you are interested in contributing to the Power Query Blog.
Author: Sameer Bhide (Sam) currently works as a freelance Consultant and Trainer. He delivers Trainings on Power BI, Excel, VBA and has been developing applications using these technologies for his Clients since 2007. He has a Degree in Electrical Engineering.
You can read more here on LinkedIn and on the sheet About of the Universal Data Consolidator file.
During an interaction with the Power Query team, I mentioned about the lack of articles on this blog and enquired if they would be open to me doing a guest post. I was pleasantly surprised when they agreed and here, we are!
Dave the Datasmith frequently consolidates Data from a variety of sources. Some are system exports (CSV’s, Other Delimited Text), others could be manually maintained Excel tables or tables in PDF’s
His go to tool in the past has been the From Folder connector in Power Query that has served him well and saved him hundreds of hours already.
But one bright Monday morning Dave gets a curve ball thrown at him that needs something a bit more powerful than the out of the box experience that Power Query offers.
A tool that can consolidate data but is independent of Sheet Names, Column Names, Order of columns, can include/exclude sheets, files and allow users to specify folder path, file type, start point of data or determine it automagically, all this without having to fire up the Power Query editor!
Before we start looking at the scenario, download the attachments.
Extract the Data Files from the zip archive in a folder called Data on your C: Drive (you can change the folder path later). You should see folders named Input1 …Input 10
The data is fictious and is used to illustrate different scenarios that Universal DC can handle.
The Excel Version of the Universal DC is used in the illustrations in the blog post. A Power BI Desktop version is also provided that has a slightly different interface and functionality.
Store the UniversalDC.xlsm on your Desktop – Open it – Click on Enable content. Also, under Data – Get Data – Query Options – Current Workbook – Privacy choose – the 2nd option “Ignore….”. All the action is going to happen on 2 Sheets – SET PARAMETERS and DATA
Dave’s data is in Folder Input6. It currently consists of 3 Excel Files and Dave’s expecting a lot more!
Each File has multiple sheets.
As you can see the number of sheets and the naming convention is different across files and there are sheets that need to be excluded from consolidation – like Summary and Notes
The starting point of Data is also different across sheets/files. In the sheet BRUCE-EAST – It starts from B21 – and if we go through the other sheets it starts at different points.
Certain columns are called by different names across sheets/files – e.g., the Column CLIENT on one sheet is called CUSTOMER in another and ACCOUNT in the third. The only way to determine that a particular record belongs to a person is from the file name as there is no column that has the employee’s name
…. even superheroes need help at times.
Enter the “Universal Data Consolidator” – which is going to solve Dave’s problems with a few clicks. Dave opens the Universal DC and on the SET PARAMETERS sheet specifies a few parameters related to the Data in the 3 Tables.
Table 1 (Left) – Lets you specify information pertaining to your data set.
- Changing the Folder path to Input6 will trigger a refresh in the adjacent table and show you a list of files in the folder
- We can set this to INV_NO if you are sure that it is the first field name across tables else you can choose “Guess” and let Power Query do its Magic
- If you have more than one sheet to keep or exclude you need to separate names with a comma
Table 2 (Middle)
- If the Files in the folders are not listed – Click on this Refresh File List button.
- Choose “NO” in the Dropdown to exclude a file from consolidation
Table 3 (Right) – Gives a unique list of columns across sheets/files
- Click on the Refresh Columns button and use the table on the right to Exclude, Remap and format Columns
- Choose NO in column REQ to exclude columns from the output
- Use COLSREMAP to rename columns in the Output. So, Account and Customer are just synonyms for Client
- The Column FORMAT helps you change the data type of a column from any to one of the options in the dropdown
Now he heads over to the Sheet DATA and Click on Refresh – a few seconds later gets consolidated data from the 3 files into a single table !! – and “The Datasmith Rises” once again
Under the Hood
There are a few Custom M functions and queries sprinkled with a bit of VBA that powers the Universal DC. But at the core is the fantastic built-in function – Table.Combine(ListofTables, [ListOfColumns])
All it needs is a List of Tables that you wish to combine. The Columns to select is optional and when skipped all columns get included
The Universal DC Consolidates data in 3 Stages.
- Go to a sheet in a file -and determine the starting point of Data, remove rows above and columns to the left (if required) to get to that point and promote the first row as headers.
- Now repeat the same actions on all sheets (if appliable) of the file and use Table.Combine to combine all the Tables from Individual sheets into a single Table.
Now repeat this on all files in the Folder and use Table.Combine again to combine all files into a single table. That’s it!!!.
This is achieved via two functions, fGetDataSheet and fGetDataFile.
fGetDataFile – works on a list of Files provided by Folder.Files() or Sharepoint.Files (if the user chooses Sharepoint/OneDriveFB as the location) and then calls fGetDataSheet.
There are a few additional helper functions that remove Top Rows/Left Columns to reach the starting point of the Data when these parameters are specified by the end user.
When user chooses “Guess” the same logic that powers “Suggested Tables” in Power BI Desktop flavor of the Excel Connector is used to “Guess” the starting point of Data.
I encourage you to try the Universal DC against data sets in the other Input Folders, before you try it on your data and would love to hear from you about how it fared with your data. If you have suggestions on features that you would like to see included or on the M code to make it perform better, do let me know in the comments.
- For PDF files the query only looks at Tables and not Pages. It works well for machine generated PDF’s and not so well for Scanned documents.
- For Text files does not accept RowsToSkip/ColumsToSkip/TextToFind Parameters
- Does not handle a scenario where a folder has some files that are Excel, some are text and some are PDF, but does work when there are Excel files of different formats present in the same folder (.xlsx,.xlsm, .xlsb)
- The Excel File has been tested in M365 Current Channel (Preview) – but should ideally work in the older builds / versions as well right up 2010. Do let me know if you face issues with the earlier versions.
- In the Power BI Version of the file – parameters can be set by clicking on Edit Parameters inside Transform Data on the Home Tab. Certain functionality is implemented differently compared to the Excel version due to the inherent limitations in Power BI Desktop.
- In cases where there is a single Excel File with multiple sheets as in Input1, I initially thought that implementing a function using Table.View (where columns are passed dynamically) to trick the model to read the file only once would improve performance – to my surprise when I timed it using SQL Profiler it took slightly longer with Table.View than without!!! and I have no idea why. Please leave a comment if you do.