Net scraping is a method used to extract information from web sites. It permits us to collect data from net pages and use it for varied functions, comparable to information evaluation, analysis, or constructing functions.
On this article, we are going to discover a Python undertaking known as “GitHub Matters Scraper,” which leverages net scraping to extract data from the GitHub matters web page and retrieve repository names and particulars for every subject.
GitHub is a broadly well-liked platform for internet hosting and collaborating on code repositories. It presents a characteristic known as “matters” that enables customers to categorize repositories primarily based on particular topics or themes. The GitHub Matters Scraper undertaking automates the method of scraping these matters and retrieving related repository data.
The GitHub Matters Scraper is carried out utilizing Python and makes use of the next libraries:
requests
: Used for making HTTP requests to retrieve the HTML content material of net pages.BeautifulSoup
: A robust library for parsing HTML and extracting information from it.pandas
: A flexible library for information manipulation and evaluation, used for organizing the scraped information right into a structured format.
Let’s dive into the code and perceive how every part of the undertaking works.
import requests
from bs4 import BeautifulSoup
import pandas as pd
The above code snippet imports three libraries: requests
, BeautifulSoup
, and pandas
.
def topic_page_authentication(url):topics_url = url
response = requests.get(topics_url)
page_content = response.textual content
doc = BeautifulSoup(page_content, 'html.parser')
return doc
Defines a operate known as topic_page_authentication
that takes a URL as an argument.
Right here’s a breakdown of what the code does:
1. topics_url = url
: This line assigns the supplied URL to the variable topics_url
. This URL represents the net web page that we wish to authenticate and retrieve its content material.
2. response = requests.get(topics_url)
: This line makes use of the requests.get()
operate to ship an HTTP GET request to the topics_url
and shops the response within the response
variable. This request is used to fetch the HTML content material of the net web page.
3. page_content = response.textual content
: This line extracts the HTML content material from the response object and assigns it to the page_content
variable. The response.textual content
attribute retrieves the textual content content material of the response.
4. doc = BeautifulSoup(page_content, 'html.parser')
: This line creates a BeautifulSoup object known as doc
by parsing the page_content
utilizing the 'html.parser'
parser. This enables us to navigate and extract data from the HTML construction of the net web page.
5. return doc
: This line returns the BeautifulSoup object doc
from the operate. Which means when the topic_page_authentication
operate is known as, it would return the parsed HTML content material as a BeautifulSoup object.
The aim of this operate is to authenticate and retrieve the HTML content material of an online web page specified by the supplied URL. It makes use of the requests
library to ship an HTTP GET request retrieves the response content material, after which parses it utilizing BeautifulSoup to create a navigable object representing the HTML construction.
Please word that the supplied code snippet handles the preliminary steps of net web page authentication and parsing, nevertheless it doesn’t carry out any particular scraping or information extraction duties.
def topicSraper(doc):# Extract title
title_class = 'f3 lh-condensed mb-0 mt-1 Hyperlink--primary'
topic_title_tags = doc.find_all('p', 'class':title_class)
# Extract description
description_class = 'f5 color-fg-muted mb-0 mt-1'
topic_desc_tags = doc.find_all('p', 'class':description_class)
# Extract hyperlink
link_class = 'no-underline flex-1 d-flex flex-column'
topic_link_tags = doc.find_all('a','class':link_class)
#Extract all the subject names
topic_titles = []
for tag in topic_title_tags:
topic_titles.append(tag.textual content)
#Extract the descrition textual content of the actual subject
topic_description = []
for tag in topic_desc_tags:
topic_description.append(tag.textual content.strip())
#Extract the urls of the actual matters
topic_urls = []
base_url = "https://github.com"
for tags in topic_link_tags:
topic_urls.append(base_url + tags['href'])
topics_dict =
'Title':topic_titles,
'Description':topic_description,
'URL':topic_urls
topics_df = pd.DataFrame(topics_dict)
return topics_df
Defines a operate known as topicScraper
that takes a BeautifulSoup object (doc
) as an argument.
This is a breakdown of what the code does:
1. title_class = 'f3 lh-condensed mb-0 mt-1 Hyperlink--primary'
: This line defines the CSS class title (title_class
) for the HTML ingredient that comprises the subject titles on the internet web page.
2. topic_title_tags = doc.find_all('p', 'class':title_class)
: This line makes use of the find_all()
methodology of the BeautifulSoup object to seek out all HTML components (<p>
) with the desired CSS class (title_class
). It retrieves an inventory of BeautifulSoup Tag objects representing the subject title tags.
3. description_class = 'f5 color-fg-muted mb-0 mt-1'
: This line defines the CSS class title (description_class
) for the HTML ingredient that comprises the subject descriptions on the internet web page.
4. topic_desc_tags = doc.find_all('p', 'class':description_class)
: This line makes use of the find_all()
methodology to seek out all HTML components (<p>
) with the desired CSS class (description_class
). It retrieves an inventory of BeautifulSoup Tag objects representing the subject description tags.
5. link_class = 'no-underline flex-1 d-flex flex-column'
: This line defines the CSS class title (link_class
) for the HTML ingredient that comprises the subject hyperlinks on the internet web page.
6. topic_link_tags = doc.find_all('a','class':link_class)
: This line makes use of the find_all()
methodology to seek out all HTML components (<a>
) with the desired CSS class (link_class
). It retrieves an inventory of BeautifulSoup Tag objects representing the subject hyperlink tags.
7. topic_titles = []
: This line initializes an empty listing to retailer the extracted subject titles.
8. for tag in topic_title_tags: ...
: This loop iterates over the topic_title_tags
listing and appends the textual content content material of every tag to the topic_titles
listing.
9. topic_description = []
: This line initializes an empty listing to retailer the extracted subject descriptions.
10. for tag in topic_desc_tags: ...
: This loop iterates over the topic_desc_tags
listing and appends the stripped textual content content material of every tag to the topic_description
listing.
11. topic_urls = []
: This line initializes an empty listing to retailer the extracted subject URLs.
12. base_url = "https://github.com"
: This line defines the bottom URL of the web site.
13. for tags in topic_link_tags: ...
: This loop iterates over the topic_link_tags
listing and appends the concatenated URL (base URL + href attribute) of every tag to the topic_urls
listing.
14. topics_dict = ...
: This block creates a dictionary (topics_dict
) that comprises the extracted information: subject titles, descriptions, and URLs.
15. topics_df = pd.DataFrame(topics_dict)
: This line converts the topics_dict
dictionary right into a pandas DataFrame, the place every key turns into a column within the DataFrame.
16. return topics_df
: This line returns the pandas DataFrame containing the extracted information.
The aim of this operate is to scrape and extract data from the supplied BeautifulSoup object (doc
). It retrieves the subject titles, descriptions, and URLs from particular HTML components on the internet web page and shops them in a pandas information body for additional evaluation or processing.
def topic_url_extractor(dataframe):url_lst = []
for i in vary(len(dataframe)):
topic_url = dataframe['URL'][i]
url_lst.append(topic_url)
return url_lst
Defines a operate known as topic_url_extractor
that takes a panda DataFrame (dataframe
) as an argument.
This is a breakdown of what the code does:
1. url_lst = []
: This line initializes an empty listing (url_lst
) to retailer the extracted URLs.
2. for i in vary(len(dataframe)): ...
: This loop iterates over the indices of the DataFrame rows.
3. topic_url = dataframe['URL'][i]
: This line retrieves the worth of the ‘URL’ column for the present row index (i
) within the information body.
4. url_lst.append(topic_url)
: This line appends the retrieved URL to the url_lst
listing.
5. return url_lst
: This line returns the url_lst
listing containing the extracted URLs.
The aim of this operate is to extract the URLs from the ‘URL’ column of the supplied DataFrame.
It iterates over every row of the DataFrame, retrieves the URL worth for every row, and provides it to an inventory. Lastly, the operate returns the listing of extracted URLs.
This operate could be helpful if you wish to extract the URLs from a DataFrame for additional processing or evaluation, comparable to visiting every URL or performing further net scraping on the person net pages.
def parse_star_count(stars_str):stars_str = stars_str.strip()[6:]
if stars_str[-1] == 'okay':
stars_str = float(stars_str[:-1]) * 1000
return int(stars_str)
Defines a operate known as parse_star_count
that takes a string (stars_str
) as an argument.
This is a breakdown of what the code does:
1. stars_str = stars_str.strip()[6:]
: This line removes main and trailing whitespace from the stars_str
string utilizing the strip()
methodology. It then slices the string ranging from the sixth character and assigns the consequence again to stars_str
. The aim of this line is to take away any undesirable characters or areas from the string.
2. if stars_str[-1] == 'okay': ...
: This line checks if the final character of stars_str
is ‘okay’, indicating that the star depend is in hundreds.
3. stars_str = float(stars_str[:-1]) * 1000
: This line converts the numeric a part of the string (excluding the ‘okay’) to a float after which multiplies it by 1000 to transform it to the precise star depend.
4. return int(stars_str)
: This line converts the stars_str
to an integer and returns it.
The aim of this operate is to parse and convert the star depend from a string illustration to an integer worth. It handles circumstances the place the star depend is in hundreds (‘okay’) by multiplying the numeric a part of the string by 1000. The operate returns the parsed star depend as an integer.
This operate could be helpful when you may have star counts represented as strings, comparable to ‘1.2k’ for 1,200 stars, and that you must convert them to numerical values for additional evaluation or processing.
def get_repo_info(h3_tags, star_tag):
base_url = 'https://github.com'
a_tags = h3_tags.find_all('a')
username = a_tags[0].textual content.strip()
repo_name = a_tags[1].textual content.strip()
repo_url = base_url + a_tags[1]['href']
stars = parse_star_count(star_tag.textual content.strip())
return username, repo_name, stars, repo_url
Defines a operate known as get_repo_info
that takes two arguments: h3_tags
and star_tag
.
This is a breakdown of what the code does:
1. base_url = 'https://github.com'
: This line defines the bottom URL of the GitHub web site.
2. a_tags = h3_tags.find_all('a')
: This line makes use of the find_all()
methodology of the h3_tags
object to seek out all HTML components (<a>
) inside it. It retrieves an inventory of BeautifulSoup Tag objects representing the anchor tags.
3. username = a_tags[0].textual content.strip()
: This line extracts the textual content content material of the primary anchor tag (a_tags[0]
) and assigns it to the username
variable. It additionally removes any main or trailing whitespace utilizing the strip()
methodology.
4. repo_name = a_tags[1].textual content.strip()
: This line extracts the textual content content material of the second anchor tag (a_tags[1]
) and assigns it to the repo_name
variable. It additionally removes any main or trailing whitespace utilizing the strip()
methodology.
5. repo_url = base_url + a_tags[1]['href']
: This line retrieves the worth of the ‘href’ attribute from the second anchor tag (a_tags[1]
) and concatenates it with the base_url
to kind the whole URL of the repository. The ensuing URL is assigned to the repo_url
variable.
6. stars = parse_star_count(star_tag.textual content.strip())
: This line extracts the textual content content material of the star_tag
object removes any main or trailing whitespace and passes it as an argument to the parse_star_count
operate. The operate returns the parsed star depend as an integer, which is assigned to the stars
variable.
7. return username, repo_name, stars, repo_url
: This line returns a tuple containing the extracted data: username
, repo_name
, stars
, and repo_url
.
The aim of this operate is to extract details about a GitHub repository from the supplied h3_tags
and star_tag
objects. It retrieves the username, repository title, star depend, and repository URL by navigating and extracting particular components from the HTML construction. The operate then returns this data as a tuple.
This operate could be helpful if you wish to extract repository data from an online web page that comprises an inventory of repositories, comparable to when scraping GitHub matters.
def topic_information_scraper(topic_url):
# web page authentication
topic_doc = topic_page_authentication(topic_url)# extract title
h3_class = 'f3 color-fg-muted text-normal lh-condensed'
repo_tags = topic_doc.find_all('h3', 'class':h3_class)
#get star tag
star_class = 'tooltipped tooltipped-s btn-sm btn BtnGroup-item color-bg-default'
star_tags = topic_doc.find_all('a','class':star_class)
#get details about the subject
topic_repos_dict =
'username': [],
'repo_name': [],
'stars': [],
'repo_url': []
for i in vary(len(repo_tags)):
repo_info = get_repo_info(repo_tags[i], star_tags[i])
topic_repos_dict['username'].append(repo_info[0])
topic_repos_dict['repo_name'].append(repo_info[1])
topic_repos_dict['stars'].append(repo_info[2])
topic_repos_dict['repo_url'].append(repo_info[3])
return pd.DataFrame(topic_repos_dict)
Defines a operate known as topic_information_scraper
that takes a topic_url
as an argument.
This is a breakdown of what the code does:
1. topic_doc = topic_page_authentication(topic_url)
: This line calls the topic_page_authentication
operate to authenticate and retrieve the HTML content material of the topic_url
. The parsed HTML content material is assigned to the topic_doc
variable.
2. h3_class = 'f3 color-fg-muted text-normal lh-condensed'
: This line defines the CSS class title (h3_class
) for the HTML ingredient that comprises the repository names inside the subject web page.
3. repo_tags = topic_doc.find_all('h3', 'class':h3_class)
: This line makes use of the find_all()
methodology of the topic_doc
object to seek out all HTML components (<h3>
) with the desired CSS class (h3_class
). It retrieves an inventory of BeautifulSoup Tag objects representing the repository title tags.
4. star_class = 'tooltipped tooltipped-s btn-sm btn BtnGroup-item color-bg-default'
: This line defines the CSS class title (star_class
) for the HTML ingredient that comprises the star counts inside the subject web page.
5. star_tags = topic_doc.find_all('a','class':star_class)
: This line makes use of the find_all()
methodology to seek out all HTML components (<a>
) with the desired CSS class (star_class
). It retrieves an inventory of BeautifulSoup Tag objects representing the star depend tags.
6. topic_repos_dict = ...
: This block creates a dictionary (topic_repos_dict
) that may retailer the extracted repository data: username, repository title, star depend, and repository URL.
7. for i in vary(len(repo_tags)): ...
: This loop iterates over the indices of the repo_tags
listing, assuming that it has the identical size because the star_tags
listing.
8. repo_info = get_repo_info(repo_tags[i], star_tags[i])
: This line calls the get_repo_info
operate to extract details about a selected repository. It passes the present repository title tag (repo_tags[i]
) and star depend tag (star_tags[i]
) as arguments. The returned data is assigned to the repo_info
variable.
9. topic_repos_dict['username'].append(repo_info[0])
: This line appends the extracted username from repo_info
to the ‘username’ listing in topic_repos_dict
.
10. topic_repos_dict['repo_name'].append(repo_info[1])
: This line appends the extracted repository title repo_info
to the ‘repo_name’ listing in topic_repos_dict
.
11. topic_repos_dict['stars'].append(repo_info[2])
: This line appends the extracted star depend repo_info
to the ‘stars’ listing in topic_repos_dict
.
12. topic_repos_dict['repo_url'].append(repo_info[3])
: This line appends the extracted repository URL from repo_info
to the ‘repo_url’ listing in topic_repos_dict
.
13. return pd.DataFrame(topic_repos_dict)
: This line converts the topic_repos_dict
dictionary right into a pandas DataFrame, the place every key turns into a column within the DataFrame. The ensuing information body comprises the extracted repository data.
The aim of this operate is to scrape and extract details about the repositories inside a selected subject on GitHub. It authenticates and retrieves the HTML content material of the subject web page, then extracts the repository names and star counts utilizing particular CSS class names.
It calls the get_repo_info
operate for every repository to retrieve the username, repository title, star depend, and repository URL.
The extracted data is saved in a dictionary after which transformed right into a pandas DataFrame, which is returned by the operate.
if __name__ == "__main__":
url = 'https://github.com/matters'
topic_dataframe = topicSraper(topic_page_authentication(url))
topic_dataframe.to_csv('GitHubtopics.csv', index=None)# Make Different CSV recordsdata acording to the matters
url = topic_url_extractor(topic_dataframe)
title = topic_dataframe['Title']
for i in vary(len(topic_dataframe)):
new_df = topic_information_scraper(url[i])
new_df.to_csv(f'GitHubTopic_CSV-Recordsdata/title[i].csv', index=None)
The code snippet demonstrates the principle execution circulation of the script.
Right here’s a breakdown of what the code does:
1. if __name__ == "__main__":
: This conditional assertion checks if the script is being run immediately (not imported as a module).
2. url = 'https://github.com/matters'
: This line defines the URL of the GitHub matters web page.
3. topic_dataframe = topicSraper(topic_page_authentication(url))
: This line retrieves the subject web page’s HTML content material utilizing topic_page_authentication
, after which passes the parsed HTML (doc
) to the topicSraper
operate. It assigns the ensuing information body (topic_dataframe
) to a variable.
4. topic_dataframe.to_csv('GitHubtopics.csv', index=None)
: This line exports the topic_dataframe
DataFrame to a CSV file named ‘GitHubtopics.csv’. The index=None
argument ensures that the row indices usually are not included within the CSV file.
5. url = topic_url_extractor(topic_dataframe)
: This line calls the topic_url_extractor
operate, passing the topic_dataframe
as an argument. It retrieves an inventory of URLs (url
) extracted from the information body.
6. title = topic_dataframe['Title']
: This line retrieves the ‘Title’ column from the topic_dataframe
and assigns it to the title
variable.
7. for i in vary(len(topic_dataframe)): ...
: This loop iterates over the indices of the topic_dataframe
DataFrame.
8. new_df = topic_information_scraper(url[i])
: This line calls the topic_information_scraper
operate, passing the URL (url[i]
) as an argument. It retrieves repository data for the precise subject URL and assigns it to the new_df
DataFrame.
9. new_df.to_csv(f'GitHubTopic_CSV-Recordsdata/title[i].csv', index=None)
: This line exports the new_df
DataFrame to a CSV file. The file title is dynamically generated utilizing an f-string, incorporating the subject title (title[i]
). The index=None
an argument ensures that the row indices usually are not included within the CSV file.
The aim of this script is to scrape and extract data from the GitHub matters web page and create CSV recordsdata containing the extracted information. It first scrapes the principle matters web page, saves the extracted data in ‘GitHubtopics.csv’, after which proceeds to scrape particular person subject pages utilizing the extracted URLs.
For every subject, it creates a brand new CSV file named after the subject and saves the repository data in it.
This script could be executed on to carry out the scraping and generate the specified CSV recordsdata.
url = 'https://github.com/matters'
topic_dataframe = topicSraper(topic_page_authentication(url))
topic_dataframe.to_csv('GitHubtopics.csv', index=None)
As soon as this code runs, it would generate a CSV file title as ‘GitHubtopics.csv’, which seems like this. and that csv covers all the subject names, their description, and their URLs.
url = topic_url_extractor(topic_dataframe)
title = topic_dataframe['Title']
for i in vary(len(topic_dataframe)):
new_df = topic_information_scraper(url[i])
new_df.to_csv(f'GitHubTopic_CSV-Recordsdata/title[i].csv', index=None)
Then this code will execute to create the precise csv recordsdata primarily based on matters we saved within the earlier ‘GitHubtopics.csv’ file. Then these CSV recordsdata are saved in a listing known as ‘GitHubTopic_CSV-Recordsdata’ with their very own particular subject names. These csv recordsdata appear to be this.
These Subject csv recordsdata saved some details about the subject, comparable to their Username, Repository title, Stars of the Repository, and the Repository URL.
Observe: The tags of the web site could change, so earlier than working this python script, test the tags as soon as in keeping with the web site.
Entry of full Script >> https://github.com/PrajjwalSule21/GitHub-Subject-Scraper/blob/most important/RepoScraper.py