Powershell fiddling around Web scraping, Twitter – User Profiles, Images and much more


INTRODUCTION

I’m Big Fan of REST API’s , they are efficient, reliable and fun.

Recently I have been playing with Twitter REST  API’s and was thinking is it possible to get the required information from Twitter without using the API? without setting up the authentication model (OAuth tokens)  or connecting to right endpoint ?

Point of this post is to see what we can achieve with shell scripting and NOT reflecting the idea that API are bad in any way.

INITIAL HYPOTHESIS :

All information on twitter webpage are under some specific HTML tags, and can be easily extracted by parsing the HTML content returned in response of the query to twitter Webpage. Let’s suppose https://twitter.com/followers

But I would require a Web Control  mechanism (Property or Method) to infinitely scroll web page so that full list of user profiles are populated which only appear when you scroll down to the bottom of the page. We can’t use Invoke-WebRequest as it doesn’t support any functionality to scroll the webpage URL it is requesting.

Once we figure out this and the web page is fully populated, we can use web data scraping techniques to get information from any  web page(Followers, Following, Timeline etc) 

STEPS BREAKDOWN :

Following are the steps how I managed to data harvest User profiles of all my followers on twitter –

  1. Create InternetExplorer.Application COM (Component Object Model) objects and navigate to the URL; wait until the URL has been properly loaded in IE.
  2. Programmatically scroll to the bottom of the page so that all user profiles are populated.
  3. Once all profiles are populated on Internet explorer window use the internet Explorer COM Object to access the web page Document (Parsed HTML)
  4. Filter out required data sitting in specific HTML Tags
  5. Convert raw filtered data to Powershell objects and generate presentable output on screen

NOTE : 

  • It’s prerequisite to login to Internet Explorer and check ‘Remember me’ checkbox , so that IE opens your twitter profile by default when the script is running.
  • Infinite scrolling requires to be stopped when all data is populated, for me max 30 secs window worked, but it may change depending upon the length of page under your profile and speed of your internet connection.

SCRIPT :

Function Get-TwitterProfile
{
[cmdletbinding()]
Param(
$URL = "https://twitter.com/followers"
)
Begin
{}
Process
{
Write-Verbose "Instantiating InternetExplorer.Application COM Object"
$ie = New-Object -ComObject "internetexplorer.application" -Property `
@{
Navigate = $URL
visible = $true
}
# Wait unitl IE is busy
While($ie.busy){ Write-Verbose "Internet Explorer is Busy, waiting for few seconds";Start-Sleep -Seconds 5 }
$start = Get-Date
$VerticalScroll = 0
Write-Verbose "Scrolling the WebPage : $URL , to auto-populate all profiles for next 30 Secs"
# 30 Secs to Infinitely scroll webpage, So that all items are populated that only come when you scroll down
While((Get-Date) -lt $($start + [timespan]::new(0,0,30)))
{
$ie.Document.parentWindow.scrollTo(0,$VerticalScroll)
$VerticalScroll = $VerticalScroll + 100
}
Write-Verbose "Data Scraping user profile info from WebPage and Converting them to [PSObjects]"
# Grab the target HTML tags in which User Profile info is sitting, Convert them to [PSObjects]
$SavePreference = $VerbosePreference
$VerbosePreference = "SilentlyContinue"
$ie.Document.getElementsByTagName('div') |`
?{$_.classname -eq "profilecard-content"}|`
ForEach-Object {
$item = $_
$HTML = New-Object -Com "HTMLFile"
$HTML.IHTMLDocument2_write($($item |% innerhtml))
[pscustomobject][ordered]@{
ImageURL = $HTML.all.tags('img')|%{$_.src -replace "bigger","400x400"} #Replacing 'Bigger' with '400x400' is a hack to make user thumbnails bigger
DisplayName = $HTML.all.tags('a') | ?{$_.classname -eq "ProfileNameTruncated-link u-textInheritColor js-nav js-action-profile-name"} | % innertext
Twitterhandle = "@$($HTML.all.tags('span') | ?{$_.classname -eq 'u-linkComplex-target'} | % innertext)"
UserBIO = $HTML.all.tags('p') | % outertext
Followstatus = $HTML.all.tags('span') | ?{$_.classname -eq 'followStatus'} | % innertext
}
}
$VerbosePreference = $SavePreference
}
End
{
$IE.Quit()
Remove-variable IE
[GC]::collect()
}
#iwr $UserProfiles[0].ImageURL -OutFile ".\ProfilePictures\$($UserProfiles[0].DisplayName.trim()).png" -Verbose
}
Get-TwitterProfile -URL "https://twitter.com/followers" -Verbose -OutVariable Result
# saving images from twitter to your local drive
#iwr $UserProfiles[0].ImageURL -OutFile ".\ProfilePictures\$($UserProfiles[0].DisplayName.trim()).png" -Verbose

HOW TO RUN :

Run the function like I did in the animation below and it will Data scrape user Profile information from your twitter webpage

f

OK, let’s check how many user profiles I was able to harvest

count

Perfect! that looks good 🙂 and exactly the number of Followers I have (286 only! what a shame 😀 )

followers

Now let’s filter out the User profiles of Microsoft MVP (Most Valuable professionals) awardees from my followers. I guess most of them have the “MVP” keyword in the user bio , so a simple “where” keyword would do the work for us like in the following screenshot.

mvp

Though I’m harvesting only four properties from the webpage, You can tweak the script as desired to get more information from the WebPage HTML content.

jsfu

Since, I know who are the MVP’s following myself on twitter,  how about downloading their Profile picture from twitter to my local drive using the User profile data we have harvested. To achieve this you’ve to follow steps in below animation.

mvpvid

Hope you’ll find it fun and fiddle around more with Powershell ! 😉 thanks for stopping by.

signature 

 

8 thoughts on “Powershell fiddling around Web scraping, Twitter – User Profiles, Images and much more

    1. Maybe Twitter has changed the HTML tags for that, since my script is 8 months old, Recently I’ve modified the script for my twitter bot and it is now at whole new level 🙂 I’ll soon write a blog about it and make it public. Anyways I appreciate that you modified and shared your script. cheers.

      Like

  1. Greetings! Quick question that’s totally off topic. Do you
    know how to make your site mobile friendly? My web site looks weird when browsing from
    my iphone4. I’m trying to find a template or plugin that might be
    able to fix this problem. If you have any recommendations, please share.
    Thank you!

    Like

Leave a comment