Beautiful Soup on the CFM pages

christopherbross · #1

Has anyone had any success using Beautiful Soup to fetch information from the Come Follow Me pages? I can't find a good set of tags to even get the title of the lesson in the soup.find() method. Any advice would be appreciated! New scraper here so be gentle lol

I'm creating a little program for personal use (sort of a one-stop shop for my personal study with tools I want and such) and trying to understand how to pull the titles, texts, references, etc. from the Come Follow Me lessons.

jonesrk · #2

I haven't looked at Beautiful Soup before, but looking at the Come Follow Me pages they have some good ids on the p tags for the main things. I would look at the ids of subtitle1 and title1. With a quick look at the docs I would try soup.find(id="subtitle1")

christopherbross · #3

Thanks! I think I just need a little more experience identifying the right tags and such. Self-instruction isn't always easy lol

sbradshaw · #4

Gospel Library Online loads a lot of content dynamically via JavaScript on the client side, but web scraping tools like Beautiful Soup can only see the raw server response (before client-side JavaScript runs). So, in some cases there might be differences between the scraped content and what you see in the browser developer console. It might be helpful to print the text response before you parse it, to see what you got back.

I was able to get the title of a Come, Follow Me chapter using this code. I like using soup.select() and soup.select_one(), with a CSS selector, instead of the more verbose soup.find_all() and soup.find(), but that's up to preference.

Code: Select all

import requests
from bs4 import BeautifulSoup

study_url_template = 'https://www.churchofjesuschrist.org/study{0}?lang={1}&mboxDisable=1'
study_url = study_url_template.format('/manual/come-follow-me-for-individuals-and-families-new-testament-2023/01', 'eng')

# Get the page using requests library
r = requests.get(study_url)
r.encoding = 'utf-8'
if r and r.status_code == 200:
  # For debugging, to see what was scraped
  # print(r.text)
  
  # Parse HTML with BeautifulSoup
  soup = BeautifulSoup(r.text, 'html.parser')
  
  # Get metadata
  head_title = soup.select_one('head title').get_text()
  date = soup.select_one('#title_number1').get_text()
  title = soup.select_one('#title1').get_text()
  
  # Print metadata to the console
  print(head_title)
  print(date)
  print(title)

I haven't had a chance to play around too much with it yet, but you might have more consistent results by scraping Gospel Library Online "Basic" instead of regular Gospel Library Online – it doesn't rely as heavily on JavaScript, but seems to have the same content:
https://basic.churchofjesuschrist.org/study/manual/come-follow-me-for-individuals-and-families-new-testament-2023/01?lang=eng

If you are loading a large number of pages, you may want to add a pause between each request to avoid overloading the Church's servers:

Code: Select all

import time

SECONDS_TO_PAUSE_BETWEEN_REQUESTS = 1

for uri in pages_to_scrape:
  # … Scrape content here …
  
  time.sleep(SECONDS_TO_PAUSE_BETWEEN_REQUESTS)

Tech Forum

Beautiful Soup on the CFM pages

Beautiful Soup on the CFM pages

Re: Beautiful Soup on the CFM pages

Re: Beautiful Soup on the CFM pages

Re: Beautiful Soup on the CFM pages