How To Scrape Data From Linkedin Using Python

How to build a Web Scraper for Linkedin using Selenium and BeautifulSoup.

Photograph by Alexander Shatov on Unsplash

Introduction

Hey there! Edifice Machine Learning Algorithms to fetch and clarify Linkedin information is a popular idea among ML Enthusiasts. Simply one snag everyone comes across is the lack of data considering how tedious data collection from Linkedin is.

In this article, we will be going over how you lot tin can build and automate your very ain Web Scraping tool to excerpt information from whatsoever Linkedin Profile using just Selenium and Beautiful Soup. This is a step-past-pace guide complete with lawmaking snippets at each pace. I've added the GitHub repository link at the stop of the commodity for those who would want the complete lawmaking.

Requirements

python (Patently. 3+ recommended)
Beautiful Soup (Beautiful Soup is a library that makes information technology easy to scrape information from web pages.)
Selenium (The selenium package is used to automate web browser interaction from Python.)
A WebDriver, I've used the Chrome WebDriver.
Additionally, y'all'll also need the pandas, time, and regex libraries.
A Lawmaking editor, I used Jupyter Notebook, you may use Vscode/Cantlet/Sublime or any of your choice.

Use pip install selenium to install the Selenium library.

Use pip install beautifulsoup4 to install the Beautiful Soup library.

Y'all can download the Chrome WebDriver from here:

https://chromedriver.chromium.org/

Overview

Here'southward a complete list of all the topics covered in this commodity.

How to Automate Linkedin using Selenium.
How to utilise BeautifulSoup to excerpt posts and their authors from the given Linkedin profiles.
Writing the information to a .csv or .xlsx file to be made bachelor for afterward employ.
How to automate the procedure to get posts of multiple authors in 1 go.

With that said, Let'south Get Started

We'll start by importing everything nosotros'll need.

              from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.back up import expected_conditions every bit EC
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import re every bit re
import fourth dimension
import pandas as pd

In add-on to selenium and cute soup, nosotros'll also exist using the regex library to get the writer of the posts, the fourth dimension library to use time functionalities such as slumber and pandas to handle large-scale data, and for writing into spreadsheets. More than on that coming correct up!
Selenium needs a few things to outset the automation process. The location of the web driver in your system, a username, and a password to log in with. And then let'south first by getting those and storing them in variables PATH, USERNAME, and Countersign respectively.

              PATH = input("Enter the Webdriver path: ")
USERNAME = input("Enter the username: ")
Password = input("Enter the password: ")
print(PATH)
print(USERNAME)
print(PASSWORD)

At present we'll initialize our spider web driver in a variable that Selenium will use to carry out all the operations. Let's phone call it driver. Nosotros tell it the location of the web driver, i.e., PATH

              driver = webdriver.Chrome(PATH)

Next, we'll tell our driver the link information technology should fetch. In our case, it'south Linkedin'due south home page.

              driver.get("https://www.linkedin.com/uas/login")
time.sleep(3)

You'll notice, I've used the sleep function in the above code snippet. You lot'll find it used a lot in this article. The slumber function basically halts whatever process (in our example, the automation process) for the specified number of number of seconds. You tin freely utilise it to pause the process anywhere you need to like, in cases where you have to bypass a captcha verification.

Nosotros'll at present tell the driver to login with the credentials we've provided.

              email=driver.find_element_by_id("username")
email.send_keys(USERNAME)
countersign=commuter.find_element_by_id("password")
password.send_keys(Countersign)
fourth dimension.sleep(3)
password.send_keys(Keys.Render)

Now let's create a few lists to store data such as the contour links, the post content, and the author of each post. We'll call them post_links, post_texts, and post_names respectively.

Photo by inlytics | LinkedIn Analytics Tool on Unsplash

Once that's done, nosotros'll get-go the bodily spider web scraping procedure. Let's declare a office and so nosotros can utilize our web scraping code to fetch posts from multiple accounts in recursion. We'll call it Scrape_func.

Okay, that'due south quite a long function. Worry non! I'll explicate it step past step. Let'south first get over what our role does.
It takes iii arguments, i.e., post_links, post_texts, and post_names as a, b, and c respectively.
Now we'll go into the internal working of the function. It first takes the profile link, and slices off the contour proper name.
At present we use the driver to fetch the "posts" section of the user's profile. The driver scrolls through the posts collecting the posts' data using cute soup and storing them in "containers".
Line 17 of the above-mentioned code governs how long the driver gets to collect posts. In our case, it's 20 seconds, just you lot may change it to suit your information needs.

              if circular(end-kickoff)>20:                
                break              except:
                pass

We likewise go the number of posts the user wants from each account and shop it in a variable 'nos'.
Finally, nosotros iterate through each "container", fetching the post data stored in information technology and appending information technology to our post_texts list along with post_names. We interruption the loop when the desired number of posts is reached.

You'll notice, we've enclosed the container iteration loop in a try-catch block. This is washed as a safety measure against possible exceptions.

That's all about our function! Now, Time to put our function to use! We become a listing of the profiles from the user and send it to the function in recursion to repeat the data collection process for all accounts.
The function returns ii lists: post_texts, containing all the posts' data, and post_names, containing all the corresponding authors of the posts.
At present we've reached the about important function of our automation: Saving the data!

              data = {
                "Name": post_names,
                "Content": post_texts,
}              df = pd.DataFrame(data)
df.to_csv("gtesting2.csv", encoding='utf-viii', index=False)              writer = pd.ExcelWriter("gtesting2.xlsx", engine='xlsxwriter')
df.to_excel(author, index =False)
writer.relieve()

We create a dictionary with the lists returned by the function and save it to a variable 'df' using pandas.
You could either cull to salvage the collected data as a .csv file or a .xlsx file.

For csv:

              df.to_csv("test1.csv", encoding='utf-8', index=False)

For xlsx:

              writer = pd.ExcelWriter("test1.xlsx", engine='xlsxwriter')
df.to_excel(writer, index =False)
writer.save()

In the above code snippets, I've given a sample file name, 'text1'. Yous may requite whatever file name of your pick!

Conclusion:

Phew! That was long, merely we've managed to successfully create a fully automated web scraper that'll get you lot whatever Linkedin mail service'due south data in a jiffy! Hope that was helpful! Do go on an eye out for more than such articles in the future! Here'southward the link to the complete code on GitHub: https://github.com/FabChris01/Linkedin-Spider web-Scraper

Cheers for stopping by! Happy Learning!