Get-Summary : Algorithm in Powershell to summarize Text, Document(s) within a word limit.


INTRODUCTION : 

This is a PowerShell script to summarize long text document(s) depending upon your chosen word limit, it utilizes an algorithm which looks for parameters like Important words and Common content to score each sentence in order to generate a summary of the highest scored sentences in sequence of there occurrence in the content.

HOW IT WORKS :

  1. GET THE CONTENT :

    Get contents from a File or from Clip Board and store it in a temporary variable

  2. SPLIT INTO SENTENCES :

    Split the complete document into sentences using Newline string object and remove empty or blank lines.

    splitsentece

  3. RANK EACH SENTENCE :

    Once you’ve all sentences, rank each sentence in content, with scores mainly depending upon two main following criterias –

    1. IMPORTANT WORDS : To identify important words in the content, calculate the frequency distribution for each word in the content, remove words smaller than 3 alphabets (Example – “The”,”Are”, “For”, “As” etc) and group them, sort them by count and select top 10 Important words and Give them a weight in multiples of Frequency of that word in content

      impwords

    2. COMMON WORDS IN EACH SENTENCE : Now in order to get an idea how many words which in each sentence is common to all others sentences, we find Intersection of each sentence to every other sentence in the content.commonwords

      i.e. Scoring each Sentence on basis of words common in every other sentence, more a sentence has common words compared to all other sentences, more it defines/summarizes the complete document

  4. SELECT THE BEST SENTENCES :

    Once we’ve scored each sentences using above to parameters, we should add these individual scores ( CommonContentScore + ImportanceScore = SentenceScore ) and sort sentences from highest scored to lowest scored.

    sentencerank

    Count the words in each sentence and select only highest scored ones within the word limit.

    NOTE : It is a must to order Best sentences in sequence of their actual occurrences in content, so that they make more sense. Otherwise they will look jumbled and won’t be like a summary.

  5. OUTPUT SUMMARY :

    Display best sentences on the screen in form of a paragraph, which will be the summary of complete document

    out

SCRIPT : 

Download the module from TechNet  or from my GitHub Repository here

HOW TO RUN IT :

Once you’ve downloaded the module, import it in your Powershell host session like below

Provide a path to a text file in the cmdlet and it will generate a summary for you, by default it summarizes it to less than equal to 100 words.

You can also provide a value to ‘-WordLimit’ parameter to increase or decrease the length of summary.

Or, mention a ‘-Verbose’ switch to view summarization ratio, i.e, Original number of words to number of words in Summary.

You can also Use ‘-FromClipboard’ switch to summarize the content copied to clipboard

gif

If you find this script useful, give me shout on Twitter , Thanks! 🙂

Prateek Singh

Advertisements

One thought on “Get-Summary : Algorithm in Powershell to summarize Text, Document(s) within a word limit.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s