Introduction
A few days ago in a discussion on Facebook’s PowerShell Group, I realized the many people don’t understand how to Tokenize or Use Abstract Syntax Tree to understand your PowerShell scripts better.
I think the reason behind this unawareness is mostly because it doesn’t affect your ability to write better scripts, but you’ve to agree it helps to understand how Powershell engine understands your script on the lexical level, hence this quick blog post.
What is Tokenization/Lexical Analysis?
Lexical analysis is the process of converting a sequence of characters or strings into a sequence of tokens – that is units with an assigned and thus identified meaning, Powershell engine performs all these operations for you under the hood and it looks like in the following picture when PowerShell script is tokenized
First, your script is tokenized and converted into meaningful units (Lexical units) and then converted into a Tree after parsing each token, which is called a Parse tree. That looks something like in below image and It contains all the Tokens of the script.
Parse tree helps the execution engine understands how to execute the script, I mean in which order or sequence the execution engine should evaluate the expressions etc in the script.
What is Abstract Syntax Tree?
The abstract tree is a Parse tree without presenting every detail appearing in the real syntax, that means things like ‘(‘ (Parenthesis) and ‘[{}]’ (braces) are omitted. Sometime Parser creates AST directly using the grammar of the language or it Converts token into Parse tree then converts it to AST
Following is an image of Abstract syntax tree.
In powerhsell it looks like
The overall process of converting source string to Parse/Abstract Syntax tree to execution and output looks somewhat looks like the following image
Tokenization and making Abtract Syntax Tree in Powershell
# Tokenizing Powershell script | |
# 1. From Content of a File | |
[System.Management.Automation.PSParser]::Tokenize((Get-Content $path), [ref]$null) | |
# 2. From content of Current file in Powershell ISE | |
[System.Management.Automation.PSParser]::Tokenize($PSISE.CurrentFile.Editor.Text, [ref]$null) | |
# 3. From a String | |
[System.Management.Automation.PSParser]::Tokenize('$c= $a+$b', [ref]$null) | |
# 4. Current File in VScode using Powershell editor services (only on VSCode) | |
$psEditor.GetEditorContext().currentfile.tokens.gettype() | |
# Making Abstract syntax tree (AST) | |
# 1. From content of a file | |
[System.Management.Automation.Language.Parser]::ParseInput((Get-Content $path), [ref]$null, [ref]$null).FindAll({$true}, $true) | |
# 2. From a String | |
[System.Management.Automation.Language.Parser]::ParseInput('$C =$a+$b', [ref]$null, [ref]$null).FindAll({$true}, $true) | |
# 3. From a File | |
[System.Management.Automation.Language.Parser]::ParseFile($Path,[ref]$null,[ref]$null).FindAll({$True}, $true) | |
# 4. From a ScriptBlock | |
{$sum = $a+$b}.Ast # Method 1 | |
([scriptblock]{$sum = $a+$b}).Ast # Method 2 | |
([scriptblock]::Create('$sum = $a+$b')).ast # Method 3 | |
Use cases
- Forensics: To understand how Powershell engine perceives each lexical unit, like for example to see how Foreach() statement and Foreach-Object cmdlet differs when tokenized even when former is used as alias ‘Foreach’.
Like in below example when parser tokenized the script you can see the difference in type in spite we used same ‘Foreach’ in both lines
- Finding Comments, Functions or variables in a script: Following is a link to my one of my old blog posts where I used Tokenization to extract comments from PowerShell script .
Hope You’ll find this article useful, thanks for reading!
Interesting distinction of lexical analysis and syntactic analysis.
Some of the token types are different from what I’m used to. Were you using PowerShell Core ?
There are many many more use cases for a parsing API, for example a tool like PSScriptAnalyzer wouldn’t be possible without the AST parser.
A tool like this wouldn’t be possible either :
https://github.com/MathieuBuisson/PSCodeHealth
LikeLiked by 1 person
I’m using Powershell v5.1 and I agree tools like PSScriptAnalyzer won’t be possible without that!
LikeLike
[…] (2) Singh, P. (2017, June 6). PowerShell: Tokenization and Abstract Syntax Tree. Geekeefy. https://geekeefy.wordpress.com/2017/06/07/powershell-tokenization-and-abstract-syntax-tree/ […]
LikeLike