Automatically clean-up excel data with multiple entries per cell separated by comma or line-breaks

tree
This is a follow-up from a previous post where I did the same using PowerShell.
A short recap first. The goal is to turn something like this:
CleanUpData
Into this:
CleanUpData2
In the original state some of the cells have multiple entries that are either separated by comma or line-breaks (via Alt+Enter). Furthermore several of those entries contain extraneous spaces. In order to tabulate the data the columns for those rows that contain multiple entries per cell also need to be cross-joined (or Cartesian product) to ensure all possible combinations for the entries are accounted for.
Rather than merely translating the recursive CartesianProduct function from the previous post into VBA I decided to follow a different approach.
Utilizing ADO to build a cross-join (without duplicates) across columns for rows that contain multiple entries. In order do that (I’m not really good at VBA and there might be better ways, that I’d love to hear about) the columns need to be copied to separate sheets so that the ADODB adapter recognizes them as separate tablesI actually found that there is no need to copy the columns to separate sheets since ADO also accepts range references.The SQL for the cross-join with only unique entries is very simple. Assuming the following setup (for the separated entries of the second row from our example):
cross-join3
The Macro to build the cross-join looks like this:

If you like to follow along here are the steps:

  1. Setup up the workbook as in the screenshot and save it (as .xlsm)
  2. Press Alt+F11 to open the Visual Basic Editor (VBE)
  3. Locate the project for your workbook (VBAProject(NAME.xlsm)) within the project tree view on the left hand side
  4. Right-Click the project entry and pick ‘Insert -> Module’
  5. Copy and paste the code into the new window
  6. From the menu at the top select Tools -> References…
  7. Tick the box for the entry ‘Microsoft ActiveX Data Objects 2.x Library’ and click ‘OK’
  8. Close the VBE window
  9. Run the Macro by hitting Alt+F8 and picking the entry for ‘CrossJoinRanges’ and clicking on ‘Run’

If everything worked out (in case it didn’t you can download the workbook via GitHub) the workbook should now contain a new sheet with the cross-joined content of the other sheets:
cross-join2
To turn this into a re-usable generic Macro the columns for the rows containing cells with multiple-entries need to be copied to a temporary sheet and the SQL statement needs to be build dynamically based on the number of columns and rows. Furthermore the Macro should also take care of rows that do not contain multiple entries, add the header to the output, and remove extraneous spaces from all entries. The final result uses two Subs one for the cross-join (CrossJoinRangesWithoutDupes) and another one that acts as the main entry point and to do the rest of the job (CleanData) and a little helper function (isSaved) to determine whether the workbook has ever been saved (otherwise I didn’t get the ADODB connection to work):

A workbook containing the Macro that produces the output shown at the top of the post can be downloaded from my GitHub repo. If you’d like to use the function more frequently I would recommend adding it to the Personal Macro Book as outlined here in order to have it available across all Excel files.

shareThoughts


Photo Credit: byb64 via Compfight cc

Using PowerShell to clean-up excel data with multiple entries per cell

8231960108_b07671cb72_m
How many times did you come across a situation where you were supposed to work with Data that looks like this?:
CleanUpData
Some of the cells have multiple entries that are either separated by comma or line-breaks (via Alt+Enter). Furthermore several of those entries contain extraneous spaces. Happy days! What would be actually needed in order to work with the data is one clean entry per cell. In order to do that the columns for those rows that contain multiple entries per cell also need to be cross-joined (or Cartesian product) so that all possible combinations for the entries are accounted for. The end result should look like this:
CleanUpData2
How could we do the same using PowerShell? Let’s first have a look on how to do the cross-join part. This can be done quite easily with nested loops. Taking the second row from the example, the following will lead to the desired result:

$name= @('Nigel')
$products = 'Product 1', 'Product 2'
$preferences = 'Fast Delivery', 'Product Quality'
foreach($n in $name){
    foreach($product in $products){
        foreach($preference in $preferences){
            "$n, $product, $preference"
        }
    }
}

One way to turn this into a more generic solution is using recursion (You need to understand recursion in order to understand recursion 😉 ).
Here is an implementation of the same:

Ok, having covered the difficult part we now only need to read the data from excel clean it up and apply the Cartesian product function to it. Here is the full code to automate the whole process:

The above contains a modified version of the CartesianProduct function in order handle objects (actually ordered hashtables since they preserve the column order). If time permits I would like to implement the same as an Excel macro and share it as part of another post.
Update: I’ve added another post outlining how to do the same (using another approach) via an Excel Macro

shareThoughts


Photo Credit: pni via Compfight cc