How many times did you come across a situation where you were supposed to work with Data that looks like this?:
Some of the cells have multiple entries that are either separated by comma or line-breaks (via Alt+Enter). Furthermore several of those entries contain extraneous spaces. Happy days! What would be actually needed in order to work with the data is one clean entry per cell. In order to do that the columns for those rows that contain multiple entries per cell also need to be cross-joined (or Cartesian product) so that all possible combinations for the entries are accounted for. The end result should look like this:
How could we do the same using PowerShell? Let’s first have a look on how to do the cross-join part. This can be done quite easily with nested loops. Taking the second row from the example, the following will lead to the desired result:
$name= @('Nigel') $products = 'Product 1', 'Product 2' $preferences = 'Fast Delivery', 'Product Quality' foreach($n in $name){ foreach($product in $products){ foreach($preference in $preferences){ "$n, $product, $preference" } } }
One way to turn this into a more generic solution is using recursion (You need to understand recursion in order to understand recursion 😉 ).
Here is an implementation of the same:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#build the cartesian product for an array of arrays | |
function CartesianProduct($row, $currCol=0){ | |
if ($currCol -eq 0){ | |
$wordIndices = New-Object int[] $row.Length | |
} | |
$wordIndex = 0 | |
#walk through the items in the current column | |
foreach($word in $row[$currCol]){ | |
#add the index to the indices for the current column | |
$wordIndices[$currCol] = $wordIndex | |
$wordIndex++ | |
#if we reach the end of the row | |
if ($currCol -eq ($row.Length – 1)) { | |
$cartesianSet = @() | |
$colIndex = 0 | |
foreach($column in $row){ | |
#add the items to the result set based on the collected indices | |
$cartesianSet += $row[$colIndex][$wordIndices[$colIndex]] | |
$colIndex++ | |
} | |
$cartesianSet -join ',' | |
} | |
#do this for every column | |
else { | |
CartesianProduct $row ($currCol + 1) | |
} | |
} | |
} |
Ok, having covered the difficult part we now only need to read the data from excel clean it up and apply the Cartesian product function to it. Here is the full code to automate the whole process:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
function CartesianProduct($htRow, $currCol=0){ | |
$colCount = $htRow.Keys.Count | |
if ($currCol -eq 0){ | |
$wordIndices = New-Object int[] $colCount | |
} | |
$wordCount = ($htRow.Values | select)[$currCol].Count | |
#walk through the items in the current | |
for ($wordIndex = 0; $wordIndex -lt $wordCount; $wordIndex++){ | |
#add the index to the indices for the current column | |
$wordIndices[$currCol] = $wordIndex | |
#if we reach the end of the row | |
if ($currCol -eq ($colCount – 1)) { | |
$htCartesianSet = [ordered]@{} | |
for ($colIndex = 0; $colIndex -lt $colCount; $colIndex++){ | |
#add the items to the result set based on the collected indices | |
$key = ($htRow.Keys | select)[$colIndex] | |
$value = ($htRow.Values | select)[$colIndex][$wordIndices[$colIndex]] | |
$htCartesianSet.Add($key, $value) | |
} | |
[PSCustomObject]$htCartesianSet | |
} | |
#do this for every column | |
else { | |
CartesianProduct $htRow ($currCol + 1) | |
} | |
} | |
} | |
function Remove-ComObject { | |
end { | |
Start-Sleep –Milliseconds 500 | |
[Management.Automation.ScopedItemOptions]$scopedOpt = 'ReadOnly, Constant' | |
Get-Variable –Scope 1 | Where-Object { | |
$_.Value.PSTypeNames -contains 'System.__ComObject' -and -not ($scopedOpt -band $_.Options) | |
} | Remove-Variable –Scope 1 –Verbose:([Bool]$PSBoundParameters['Verbose'].IsPresent) | |
[GC]::Collect() | |
} | |
} | |
function ImportAndCrossJoin($path){ | |
$xls = New-Object –ComObject Excel.Application | |
$xls.Visible = $false | |
$wb = $xls.Workbooks.Open($path) | |
$ws = $wb.Sheets.Item(1) | |
$lastRow = ($ws.UsedRange.Rows).Count | |
$lastCol = ($ws.UsedRange.Columns).Count | |
foreach ($row in (2..$lastRow)){ | |
$ht = [ordered]@{} | |
foreach ($col in (1..$lastCol)){ | |
$heading = $ws.Cells.Item(1,$col).value2 | |
#split the cell by crlf or comma | |
$parts = ($ws.Cells.Item($row,$col).value2 -split ",|`n").Trim() | |
$ht."$heading" = @($parts) | |
} | |
CartesianProduct $ht | |
} | |
$wb.Close() | |
$xls.Quit() | |
} | |
ImportAndCrossJoin C:\test.xlsx | Export-CSV –NoTypeInformation –path C:\test2.csv | |
ii C:\test2.csv |
The above contains a modified version of the CartesianProduct function in order handle objects (actually ordered hashtables since they preserve the column order). If time permits I would like to implement the same as an Excel macro and share it as part of another post.
Update: I’ve added another post outlining how to do the same (using another approach) via an Excel Macro