Automatically clean-up excel data with multiple entries per cell separated by comma or line-breaks

This is a follow-up from a previous post where I did the same using PowerShell.
A short recap first. The goal is to turn something like this:
Into this:
In the original state some of the cells have multiple entries that are either separated by comma or line-breaks (via Alt+Enter). Furthermore several of those entries contain extraneous spaces. In order to tabulate the data the columns for those rows that contain multiple entries per cell also need to be cross-joined (or Cartesian product) to ensure all possible combinations for the entries are accounted for.
Rather than merely translating the recursive CartesianProduct function from the previous post into VBA I decided to follow a different approach.
Utilizing ADO to build a cross-join (without duplicates) across columns for rows that contain multiple entries. In order do that (I’m not really good at VBA and there might be better ways, that I’d love to hear about) the columns need to be copied to separate sheets so that the ADODB adapter recognizes them as separate tablesI actually found that there is no need to copy the columns to separate sheets since ADO also accepts range references.The SQL for the cross-join with only unique entries is very simple. Assuming the following setup (for the separated entries of the second row from our example):
The Macro to build the cross-join looks like this:

Sub CrossJoinRanges()
Dim cn As ADODB.Connection
Dim sql As String
Dim outputSheet As Worksheet
Dim rs As ADODB.Recordset
Set rs = New ADODB.Recordset
Set cn = New ADODB.Connection
With cn
.ConnectionString = "Provider=Microsoft.ACE.OLEDB.12.0;" & _
"Data Source=" & ActiveWorkbook.FullName & ";" & _
"Extended Properties=""Excel 12.0 XML;HDR=Yes"""
End With
sql = "SELECT DISTINCT * FROM [Sheet1$A1:A2], [Sheet1$B1:B3], [Sheet1$C1:C3]"
rs.Open sql, cn
Set outputSheet = Sheets.Add
outputSheet.Name = "CrossJoined"
outputSheet.Range("A1").CopyFromRecordset rs
End Sub

If you like to follow along here are the steps:

  1. Setup up the workbook as in the screenshot and save it (as .xlsm)
  2. Press Alt+F11 to open the Visual Basic Editor (VBE)
  3. Locate the project for your workbook (VBAProject(NAME.xlsm)) within the project tree view on the left hand side
  4. Right-Click the project entry and pick ‘Insert -> Module’
  5. Copy and paste the code into the new window
  6. From the menu at the top select Tools -> References…
  7. Tick the box for the entry ‘Microsoft ActiveX Data Objects 2.x Library’ and click ‘OK’
  8. Close the VBE window
  9. Run the Macro by hitting Alt+F8 and picking the entry for ‘CrossJoinRanges’ and clicking on ‘Run’

If everything worked out (in case it didn’t you can download the workbook via GitHub) the workbook should now contain a new sheet with the cross-joined content of the other sheets:
To turn this into a re-usable generic Macro the columns for the rows containing cells with multiple-entries need to be copied to a temporary sheet and the SQL statement needs to be build dynamically based on the number of columns and rows. Furthermore the Macro should also take care of rows that do not contain multiple entries, add the header to the output, and remove extraneous spaces from all entries. The final result uses two Subs one for the cross-join (CrossJoinRangesWithoutDupes) and another one that acts as the main entry point and to do the rest of the job (CleanData) and a little helper function (isSaved) to determine whether the workbook has ever been saved (otherwise I didn’t get the ADODB connection to work):

Private Function isSaved() As Boolean
Dim lastSaved As String
On Error GoTo EHandler
s = ActiveWorkbook.BuiltinDocumentProperties("last save time")
isSaved = True
Exit Function
isSaved = False
End Function
Private Sub CrossJoinRangesWithoutDupes(colRanges() As Variant, destSheetName As String, srcSheetName As String)
Dim cn As ADODB.Connection
Dim sql As String
Dim sqlRanges() As String, startAddress As String, endAddress As String
Dim tempSheet As Worksheet
Dim lastCol As Long, col As Long, endRow As Long
Dim rs As ADODB.Recordset
lastCol = UBound(colRanges)
ReDim sqlRanges(1 To lastCol)
Set tempSheet = Sheets.Add
'copy each column to a tempSheet and create the sql dynamically
For col = 1 To lastCol
'add the columnheader
tempSheet.Cells(1, col) = Sheets(srcSheetName).Cells(1, col).Value
endRow = UBound(colRanges(col)) + 2
startAddress = Cells(2, col).Address(False, False)
endAddress = Cells(endRow, col).Address(False, False)
tempSheet.Range(startAddress & ":" & endAddress) = WorksheetFunction.Transpose(colRanges(col))
startAddress = Cells(1, col).Address(False, False)
sqlRanges(col) = "[" & tempSheet.Name & "$" & startAddress & ":" & endAddress & "]"
Next col
sql = sql + Join(sqlRanges, ",")
Set rs = New ADODB.Recordset
Set cn = New ADODB.Connection
With cn
.ConnectionString = "Provider=Microsoft.ACE.OLEDB.12.0;" & _
"Data Source=" & ActiveWorkbook.FullName & ";" & _
"Extended Properties=""Excel 12.0 XML;HDR=Yes"""
End With
rs.Open sql, cn
'append to outputSheet
endRow = Sheets(destSheetName).Cells(Rows.Count, 1).End(xlUp).Row
If Sheets(destSheetName).Range("A1").Value <> "" Then
endRow = endRow + 1
End If
Sheets(destSheetName).Cells(endRow, 1).CopyFromRecordset rs
'delete the temp sheet
Application.DisplayAlerts = False
Application.DisplayAlerts = True
End Sub
Sub CleanData()
Dim colRanges() As Variant
Dim workingRange As Range, currCell As Range, currRow As Range
Dim currRowIndexIndex As Long, currColIndexIndex As Long, endRow As Long
Dim endCol As Long
Dim i As Integer
Dim activeSheetName As String
Dim hasMultiple As Boolean
Dim outputSheet As Worksheet
Dim objRegex As Object
If Not isSaved Then
MsgBox "Please save the WorkBook prior to running this macro", vbExclamation, "WorkBook not saved!"
Exit Sub
End If
Application.ScreenUpdating = False
'Define the working range
Set workingRange = Range("A1").CurrentRegion
activeSheetName = ActiveSheet.Name
Set outputSheet = Sheets.Add
outputSheet.Name = "Clean Data"
For Each currRow In workingRange.Rows
hasMultiple = False
ReDim colRanges(1 To currRow.Columns.Count)
currColIndex = 1
For Each currCol In currRow.Columns
'split column by Alt Enter or comma
colRanges(currColIndex) = Split(currCol, vbLf)
'if no alt enter check for comma
If UBound(colRanges(currColIndex)) = 0 And InStr(currCol, ",") Then
colRanges(currColIndex) = Split(currCol, ",")
End If
'check if if the current row contains any columns with multiple entries
If UBound(colRanges(currColIndex)) > 0 Then
hasMultiple = True
End If
currColIndex = currColIndex + 1
'get rid of extra spaces
For i = 1 To UBound(colRanges)
For col = 0 To UBound(colRanges(i))
colRanges(i)(col) = Trim(colRanges(i)(col))
Next col
Next i
If Not hasMultiple Then
'output row as is
endRow = outputSheet.Cells(Rows.Count, 1).End(xlUp).Row
If outputSheet.Range("A1").Value <> "" Then
endRow = endRow + 1
End If
endCol = UBound(colRanges)
outputSheet.Range(Cells(endRow, 1), Cells(endRow, endCol)) = WorksheetFunction.Transpose(colRanges)
'output cross-join (without dupes) of columns
CrossJoinRangesWithoutDupes colRanges, "Clean Data", activeSheetName
End If
End Sub

A workbook containing the Macro that produces the output shown at the top of the post can be downloaded from my GitHub repo. If you’d like to use the function more frequently I would recommend adding it to the Personal Macro Book as outlined here in order to have it available across all Excel files.


Photo Credit: byb64 via Compfight cc

I'd love to hear what you think

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s