Has anyone had any success using Beautiful Soup to fetch information from the Come Follow Me pages? I can't find a good set of tags to even get the title of the lesson in the soup.find() method. Any advice would be appreciated! New scraper here so be gentle lol
I'm creating a little program for personal use (sort of a one-stop shop for my personal study with tools I want and such) and trying to understand how to pull the titles, texts, references, etc. from the Come Follow Me lessons.
Beautiful Soup on the CFM pages
-
- New Member
- Posts: 6
- Joined: Sun Sep 20, 2015 10:43 am
-
- Church Employee
- Posts: 2371
- Joined: Tue Jun 30, 2009 8:12 am
- Location: South Jordan, UT, USA
Re: Beautiful Soup on the CFM pages
I haven't looked at Beautiful Soup before, but looking at the Come Follow Me pages they have some good ids on the p tags for the main things. I would look at the ids of subtitle1 and title1. With a quick look at the docs I would try soup.find(id="subtitle1")
-
- New Member
- Posts: 6
- Joined: Sun Sep 20, 2015 10:43 am
Re: Beautiful Soup on the CFM pages
Thanks! I think I just need a little more experience identifying the right tags and such. Self-instruction isn't always easy lol
- sbradshaw
- Community Moderators
- Posts: 6259
- Joined: Mon Sep 26, 2011 9:42 pm
- Location: Utah
- Contact:
Re: Beautiful Soup on the CFM pages
Gospel Library Online loads a lot of content dynamically via JavaScript on the client side, but web scraping tools like Beautiful Soup can only see the raw server response (before client-side JavaScript runs). So, in some cases there might be differences between the scraped content and what you see in the browser developer console. It might be helpful to print the text response before you parse it, to see what you got back.
I was able to get the title of a Come, Follow Me chapter using this code. I like using soup.select() and soup.select_one(), with a CSS selector, instead of the more verbose soup.find_all() and soup.find(), but that's up to preference.
I haven't had a chance to play around too much with it yet, but you might have more consistent results by scraping Gospel Library Online "Basic" instead of regular Gospel Library Online – it doesn't rely as heavily on JavaScript, but seems to have the same content:
https://basic.churchofjesuschrist.org/study/manual/come-follow-me-for-individuals-and-families-new-testament-2023/01?lang=eng
If you are loading a large number of pages, you may want to add a pause between each request to avoid overloading the Church's servers:
I was able to get the title of a Come, Follow Me chapter using this code. I like using soup.select() and soup.select_one(), with a CSS selector, instead of the more verbose soup.find_all() and soup.find(), but that's up to preference.
Code: Select all
import requests
from bs4 import BeautifulSoup
study_url_template = 'https://www.churchofjesuschrist.org/study{0}?lang={1}&mboxDisable=1'
study_url = study_url_template.format('/manual/come-follow-me-for-individuals-and-families-new-testament-2023/01', 'eng')
# Get the page using requests library
r = requests.get(study_url)
r.encoding = 'utf-8'
if r and r.status_code == 200:
# For debugging, to see what was scraped
# print(r.text)
# Parse HTML with BeautifulSoup
soup = BeautifulSoup(r.text, 'html.parser')
# Get metadata
head_title = soup.select_one('head title').get_text()
date = soup.select_one('#title_number1').get_text()
title = soup.select_one('#title1').get_text()
# Print metadata to the console
print(head_title)
print(date)
print(title)
https://basic.churchofjesuschrist.org/study/manual/come-follow-me-for-individuals-and-families-new-testament-2023/01?lang=eng
If you are loading a large number of pages, you may want to add a pause between each request to avoid overloading the Church's servers:
Code: Select all
import time
SECONDS_TO_PAUSE_BETWEEN_REQUESTS = 1
for uri in pages_to_scrape:
# … Scrape content here …
time.sleep(SECONDS_TO_PAUSE_BETWEEN_REQUESTS)
Samuel Bradshaw • If you desire to serve God, you are called to the work.