Beautiful Soup on the CFM pages

Discussions about the Notes and Journal tool on LDS.org. This includes the Study Toolbar as well as the scriptures and other content on LDS.org that is integrated with Notes and Journal.
Post Reply
christopherbross
New Member
Posts: 6
Joined: Sun Sep 20, 2015 10:43 am

Beautiful Soup on the CFM pages

#1

Post by christopherbross »

Has anyone had any success using Beautiful Soup to fetch information from the Come Follow Me pages? I can't find a good set of tags to even get the title of the lesson in the soup.find() method. Any advice would be appreciated! New scraper here so be gentle lol

I'm creating a little program for personal use (sort of a one-stop shop for my personal study with tools I want and such) and trying to understand how to pull the titles, texts, references, etc. from the Come Follow Me lessons.
jonesrk
Church Employee
Church Employee
Posts: 2361
Joined: Tue Jun 30, 2009 8:12 am
Location: South Jordan, UT, USA

Re: Beautiful Soup on the CFM pages

#2

Post by jonesrk »

I haven't looked at Beautiful Soup before, but looking at the Come Follow Me pages they have some good ids on the p tags for the main things. I would look at the ids of subtitle1 and title1. With a quick look at the docs I would try soup.find(id="subtitle1")
christopherbross
New Member
Posts: 6
Joined: Sun Sep 20, 2015 10:43 am

Re: Beautiful Soup on the CFM pages

#3

Post by christopherbross »

Thanks! I think I just need a little more experience identifying the right tags and such. Self-instruction isn't always easy lol
User avatar
sbradshaw
Community Moderators
Posts: 6251
Joined: Mon Sep 26, 2011 9:42 pm
Location: Utah
Contact:

Re: Beautiful Soup on the CFM pages

#4

Post by sbradshaw »

Gospel Library Online loads a lot of content dynamically via JavaScript on the client side, but web scraping tools like Beautiful Soup can only see the raw server response (before client-side JavaScript runs). So, in some cases there might be differences between the scraped content and what you see in the browser developer console. It might be helpful to print the text response before you parse it, to see what you got back.

I was able to get the title of a Come, Follow Me chapter using this code. I like using soup.select() and soup.select_one(), with a CSS selector, instead of the more verbose soup.find_all() and soup.find(), but that's up to preference.

Code: Select all

import requests
from bs4 import BeautifulSoup

study_url_template = 'https://www.churchofjesuschrist.org/study{0}?lang={1}&mboxDisable=1'
study_url = study_url_template.format('/manual/come-follow-me-for-individuals-and-families-new-testament-2023/01', 'eng')

# Get the page using requests library
r = requests.get(study_url)
r.encoding = 'utf-8'
if r and r.status_code == 200:
  # For debugging, to see what was scraped
  # print(r.text)
  
  # Parse HTML with BeautifulSoup
  soup = BeautifulSoup(r.text, 'html.parser')
  
  # Get metadata
  head_title = soup.select_one('head title').get_text()
  date = soup.select_one('#title_number1').get_text()
  title = soup.select_one('#title1').get_text()
  
  # Print metadata to the console
  print(head_title)
  print(date)
  print(title)
I haven't had a chance to play around too much with it yet, but you might have more consistent results by scraping Gospel Library Online "Basic" instead of regular Gospel Library Online – it doesn't rely as heavily on JavaScript, but seems to have the same content:
https://basic.churchofjesuschrist.org/study/manual/come-follow-me-for-individuals-and-families-new-testament-2023/01?lang=eng

If you are loading a large number of pages, you may want to add a pause between each request to avoid overloading the Church's servers:

Code: Select all

import time

SECONDS_TO_PAUSE_BETWEEN_REQUESTS = 1

for uri in pages_to_scrape:
  # … Scrape content here …
  
  time.sleep(SECONDS_TO_PAUSE_BETWEEN_REQUESTS)
Samuel Bradshaw • If you desire to serve God, you are called to the work.
Post Reply

Return to “Notes and Journal, and Online Scriptures”