Web Scraping with BeautifulSoup
How to do Web Scraping
With requests
and beautifulsoup
Introduction
requests
allows sending HTTP requests
beautifulsoup
allows extracting tags from an HTML document
Importing & Initializing
To import BeautifulSoup we request from bs4 import BeautifulSoup
and request the content of the desired URL.
response = requests.get('URL')
content = response.content
We then pass content
to BeautifulSoup parser:
parser = BeautifulSoup(content, 'html.parser')
Extracting Tags & Showing Text
To extract content we request it from the parser:
body = parser.body
print(body.text)
title = parser.title
print(title.text)
The text from the following tags will be extracted and printed.
Using find_all
Using the find_all
method to extract all occurrences of a particular tag into a list tag_nm
.
# create a list of all occurrences of a particular html tag
tag_nm_lst = parser.find_all('html_tag_name')
# print content from every occurrence
for tag_occur in tag_nm_lst:
print(tag_occur.text)
Element IDs
To find an element divided by an id
we pass it as an additional attribute into the find_all
method
tag_nm_lst = parser.find_all('html_tag_name', id='id_name')
Element Classes
To find elements that share a common characteristic weuse classes
attribute to find all such tags.
tag_nm_lst = parser.find_all('tag_name', class_='class_name')
CSS Selectors
We can use BeautifulSoup’s .select()
method to work with CSS selectors.
tag_nm_lst = parser.select('.class_name')
tag_nm_lst = parser.select('#id_name')
We can also use nested structure of CSS selectors and/or ID tags
nested_tags = 'body div .first #inner-text'
tag_nm_lst = parser.select('nested_tags)