Powershell : Decompiling – Compiled HTML Help (.CHM) files and Data Wrangling


WHAT IS COMPILED HTML HELP (.CHM)?

Microsoft Compiled HTML Help is a Microsoft proprietary online help format, consisting of a collection of HTML pages, an index and other navigation tools. The files are compressed and deployed in a binary format with the extension .CHM, for Compiled HTML. The format is often used for software documentation, like for Sysinternals tools.

APPROACH :

Today me and my friend were looking for a approach through which we can Decompile .chm files into HTML and then parse the HTML DOM to extract some information. After some googling I found that there is Windows command line utility HH.exe shipped with Windows operating system which can decompile the .CHM files to HTML using some command line options.

So I wrapped up the commands into a Powershell function, like below

Function Get-DecompiledHTMLHelp
{
[cmdletbinding()]
param(
[String] $Destination, [String]$Filename
)
$EXE = 'C:\Windows\hh.exe'
If(-not (Test-Path $destination))
{
"Destination folder doesn't exist"
}
elseIf(-not (Test-Path $Filename))
{
"Target .chm file not found, please make sure you're entering the full path and file name"
}
else
{
Start-Process -FilePath $EXE -ArgumentList "-decompile $Destination $Filename"
$FilesAndFolder = Get-ChildItem $Destination -Recurse| group psiscontainer
$FolderCount = ($Filesandfolder| ?{$_.name -eq $true}).count
$FileCount = ($Filesandfolder| ?{$_.name -eq $False}).count
Write-host "Decompiled into $(if($Foldercount -gt 0){$Foldercount}else{0}) Folders and $(if($FileCount){$FileCount}else{0}) Files to Destination $Destination" -ForegroundColor Yellow
}
}

and then extracted the required information using following  piece of code

Function Create-HTMLDOMFromFile
{
Param(
[String] $FileName,
[String] $TagName,
[Int] $OuputCount = 11
)
$HTML = New-Object -Com "HTMLFile";
$Content = Get-Content -Path $FileName -raw
#To convert raw content to HTML DOM (Document object Model) and 2 stands for DOM level 2
$HTML.IHTMLDocument2_write($Content)
#Some data wrangling to extract the exact information
$HTML.all.tags($TagName) | select innertext -ExpandProperty innertext -First $OuputCount
}
#Create-HTMLDOMFromFile -FileName 'C:\Temp\DecompiledHTML\Command_Line_Options.htm' -TagName 'P'

HOW TO RUN : 

Here I chose Compiled HTML Help file of  ProcMon.exe (Process Monitor – SysInternal Tool) as a sample .chm file.

decompile (4)


Function Create-HTMLDOMFromFile
{
Param(
[String] $FileName,
[String] $TagName,
[Int] $OuputCount = 11
)
$HTML = New-Object Com "HTMLFile";
$Content = Get-Content Path $FileName raw
#To convert raw content to HTML DOM (Document object Model) and 2 stands for DOM level 2
$HTML.IHTMLDocument2_write($Content)
#Some data wrangling to extract the exact information
$HTML.all.tags($TagName) | select innertext ExpandProperty innertext First $OuputCount
}
#Create-HTMLDOMFromFile -FileName 'C:\Temp\DecompiledHTML\Command_Line_Options.htm' -TagName 'P'

Hope you find it useful, happy learning 🙂

Prateek Singh  

3 thoughts on “Powershell : Decompiling – Compiled HTML Help (.CHM) files and Data Wrangling

  1. You just saved me weeks of time getting data definitions for a master data management tool by being able to read in column definitions from a .chm file. After decompiling the .chm file, I used Power Query to feed in the file and get the database schema, table name, column name, and definition via an iterative function in Power Query. Absolutely awesome! Couldn’t have done it without your post!

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s