Full text search using PowerShell, Everything, and Lucene

26803605966_33613e76a6_m
Searching for files is something everyone does on a very regular basis. While Windows is consistently changing the way this is done with every new operating system, the built-in functionality is still far from being sufficient. Therefore, I’m always looking for methods on how to improve this (you can also find several blog posts in relation to file searches around here). In regards to searching for files based on file names or paths, I’m pretty happy with the performance of Everything. If it is about searching for files based on their content (aka full-text search), there is still room for improvement in my opinion.
Recently I’ve been watching the session recordings from the PowerShell Conference Europe 2016 (I can highly recommend anyone that is interested in PowerShell to watch those).

In one of the videos, Bruce Payette talks about how to use Lucene.net through PowerShell, subsequently Doug Finke has also picked up the topic and wrapped all of it into a GUI. Lucene is basically the gold standard when it comes to full-text search.

Naturally I also wanted to see how Lucene could help to improve the Windows search capabilities further. My goal was to put it to a test and potentially further improve the implementations in order to be able to index and query text based files on my entire drive.
Using Bruce’s and Doug’s implementation, the search worked almost instantaneous even against a huge volume of files to be indexed. Only the creation of the index takes quite some time since the enumeration of the files to be indexed is based on either Get-ChildItem or System.IO.Directory.EnumerateFiles.

I’ve refactored the implementation into a new module (SearchLucene.psm1) where I based the file enumeration on the Everything command-line interface and made several additional changes. As a result, the creation of the index for my c: drive (SSD) for all .txt, .ps1, and .psm1 files takes now less than a minute.
Usage:
Prerequesites:

  • SearchLucene.psm1 module installed (The example considers, that you have put the downloaded files into a folder called ‘SearchLucene’, that resides within one of $env:PSModulePath folders
  • Everything command-line interface installed (Requires the GUI version to be installed)
Import-Module SearchLucene
#Create the index on disk within the $env:TEMP\test folder. And index all ps1, and psm1 files for the c: drive
<#
default values for each parameter are:
- DirectoryToBeIndexed = 'c:\',
- Include = @('*.ps1','*.psm1','*.txt*')
- IndexDirectory = "$env:temp\luceneIndex"
#>
New-LuceneIndex -DirectoryToBeIndexed 'c:\' -Include @('*.ps1','*.psm1') -IndexDirectory "$env:TEMP\test"

#Search all indexed .ps1 files for the word 'kw2016'
Find-FileLucene 'kw2016' -Include '.ps1'
#outputs a list of file paths that include the word test

#Search all indexed .ps1 files for the word 'test' and output the matching line and line number for each match found within the file
Find-FileLucene 'test' -Include '.ps1' -Detailed

#Same as above but output the result in a table grouped by folder
Find-FileLucene 'kw2016' -Include '.ps1' -Detailed | 
	Sort-Object {Split-Path $_.Path} | 
	Format-Table -GroupBy {Split-Path $_.Path}

SearchLucene
This is just a small example on how Lucene.net can be used to implement full-text search. The solution could be further improved by including other file types, re-creating or updating the index based on a schedule or triggered by file modifications.

shareThoughts


Photo Credit: Cho Shane via Compfight cc

Advertisements

I'd love to hear what you think

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s