Powershell fiddling around Web scraping, Twitter – User Profiles, Images and much more


INTRODUCTION

I’m Big Fan of REST API’s , they are efficient, reliable and fun.

Recently I have been playing with Twitter REST  API’s and was thinking is it possible to get the required information from Twitter without using the API? without setting up the authentication model (OAuth tokens)  or connecting to right endpoint ?

Point of this post is to see what we can achieve with shell scripting and NOT reflecting the idea that API are bad in any way.

INITIAL HYPOTHESIS :

All information on twitter webpage are under some specific HTML tags, and can be easily extracted by parsing the HTML content returned in response of the query to twitter Webpage. Let’s suppose https://twitter.com/followers

But I would require a Web Control  mechanism (Property or Method) to infinitely scroll web page so that full list of user profiles are populated which only appear when you scroll down to the bottom of the page. We can’t use Invoke-WebRequest as it doesn’t support any functionality to scroll the webpage URL it is requesting.

Once we figure out this and the web page is fully populated, we can use web data scraping techniques to get information from any  web page(Followers, Following, Timeline etc) 

STEPS BREAKDOWN :

Following are the steps how I managed to data harvest User profiles of all my followers on twitter –

  1. Create InternetExplorer.Application COM (Component Object Model) objects and navigate to the URL; wait until the URL has been properly loaded in IE.
  2. Programmatically scroll to the bottom of the page so that all user profiles are populated.
  3. Once all profiles are populated on Internet explorer window use the internet Explorer COM Object to access the web page Document (Parsed HTML)
  4. Filter out required data sitting in specific HTML Tags
  5. Convert raw filtered data to Powershell objects and generate presentable output on screen

NOTE : 

  • It’s prerequisite to login to Internet Explorer and check ‘Remember me’ checkbox , so that IE opens your twitter profile by default when the script is running.
  • Infinite scrolling requires to be stopped when all data is populated, for me max 30 secs window worked, but it may change depending upon the length of page under your profile and speed of your internet connection.

SCRIPT :

HOW TO RUN :

Run the function like I did in the animation below and it will Data scrape user Profile information from your twitter webpage

f

OK, let’s check how many user profiles I was able to harvest

count

Perfect! that looks good 🙂 and exactly the number of Followers I have (286 only! what a shame 😀 )

followers

Now let’s filter out the User profiles of Microsoft MVP (Most Valuable professionals) awardees from my followers. I guess most of them have the “MVP” keyword in the user bio , so a simple “where” keyword would do the work for us like in the following screenshot.

mvp

Though I’m harvesting only four properties from the webpage, You can tweak the script as desired to get more information from the WebPage HTML content.

jsfu

Since, I know who are the MVP’s following myself on twitter,  how about downloading their Profile picture from twitter to my local drive using the User profile data we have harvested. To achieve this you’ve to follow steps in below animation.

mvpvid

Hope you’ll find it fun and fiddle around more with Powershell ! 😉 thanks for stopping by.

signature 

 

Advertisements

5 thoughts on “Powershell fiddling around Web scraping, Twitter – User Profiles, Images and much more

    1. Maybe Twitter has changed the HTML tags for that, since my script is 8 months old, Recently I’ve modified the script for my twitter bot and it is now at whole new level 🙂 I’ll soon write a blog about it and make it public. Anyways I appreciate that you modified and shared your script. cheers.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s