Google Calendar Sync Issue (Robots.txt)

aebrown · Post by **aebrown** » Tue Sep 02, 2014 5:53 am

russellhltn wrote:I'm not so sure. When I look up the purpose of Robots.txt, it seems to be aimed at automated processes, not just search engine web crawlers.

What's your source? Everything I read talks about web crawlers. See here, or here.

russellhltn wrote:When I look for complaints about the robots.txt for Google calendar - I can find them, but they are from years ago. And thanks to the Wayback Machine we can see that the there was a change to the church's robots.txt done on Aug 13. So while we can argue standards, it was the church that made the change that broke things.

It's true that the Church made a change, and it was an important change, but I don't think it's fair to say that it was that change "that broke things." That implies that the Church did something wrong in their change. The change they made on August 13 was to change the "disallow" directory from "/calendar" to "/church-calendar". I don't know why they ever had "/calendar" in the disallow category, since the calendar stuff has been under "/church-calendar" for years. I would think it's more accurate to say that the Church finally corrected their mistake.

russellhltn wrote:The fact it hasn't been changed back makes me think there's a security issue involved.

It may or may not be a security issue. If the web crawlers for search engines somehow became aware of Sync URLs and started crawling those URLs (perhaps because they come across them in some other source where someone makes them available, such as sometimes happens here on LDSTech where someone posts their personal sync URL), then they would be affecting the throttle mechanism. It could be that the exclusion is simply to avoid such a problem.

russellhltn · Post by **russellhltn** » Tue Sep 02, 2014 12:23 pm

aebrown wrote:
russellhltn wrote:I'm not so sure. When I look up the purpose of Robots.txt, it seems to be aimed at automated processes, not just search engine web crawlers.
What's your source? Everything I read talks about web crawlers. See here, or here.

That second link says "and other web robots". It should be noted that the file is "Robots" not "Crawler" or "Spider".

The bottom line is that Google thinks their calendar import bot should obey Robots.txt. It's made worse that it appears Googlebot is the name of both the web indexing function and the calendar import function. (Some other Google indexing function have different names so they can be specifically controlled by the text file.)

aebrown wrote:It's true that the Church made a change, and it was an important change, but I don't think it's fair to say that it was that change "that broke things."

We can debate fault, but it's that change that caused things stop working. Not a change implemented on Google's side. Had this text file been that way from the beginning, it would appear that Google Calendar sync would have never worked and perhaps never been offered.

So where do we go from here? Unless the church changes the text file, perhaps allowing just Googlebot to get files, this isn't going to get fixed anytime soon.

aebrown · Post by **aebrown** » Tue Sep 02, 2014 1:14 pm

russellhltn wrote:
aebrown wrote:
russellhltn wrote:I'm not so sure. When I look up the purpose of Robots.txt, it seems to be aimed at automated processes, not just search engine web crawlers.
What's your source? Everything I read talks about web crawlers. See here, or here.
That second link says "and other web robots". It should be noted that the file is "Robots" not "Crawler" or "Spider".

And what is a robot? You seem to be assuming that it might mean something other than a crawler or spider. But the definition given at http://www.robotstxt.org/ is:

Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.

The key attribute is that robots "traverse the Web automatically." That is definitely NOT what is happening when Google Calendar accesses a Sync URL -- that is hitting a specific URL provided by a user.

russellhltn wrote:The bottom line is that Google thinks their calendar import bot should obey Robots.txt.

You're correct that Google seems to think that. But Google is wrong.

russellhltn wrote:
aebrown wrote:It's true that the Church made a change, and it was an important change, but I don't think it's fair to say that it was that change "that broke things."
We can debate fault, but it's that change that caused things stop working. Not a change implemented on Google's side. Had this text file been that way from the beginning, it would appear that Google Calendar sync would have never worked and perhaps never been offered.

That's rather speculative. You're assuming that Google's implementation now is the same that it was back when Calendar Sync was implemented. I personally have no idea whether robots.txt was considered in Google Calendar's sync at that point, or if it was changed in the meantime. I don't think we have enough information to draw a conclusion one way or the other.

russellhltn wrote:So where do we go from here? Unless the church changes the text file, perhaps allowing just Googlebot to get files, this isn't going to get fixed anytime soon.

Changing the robots.txt file to allow only Googlebot (or removing that disallow line entirely) would fix the calendar sync, but the Church might not think it would be acceptable. That would almost certainly subvert the reasons why the Church specifically chose to disallow the /church-calendar directory in robots.txt.

I certainly don't know why /church-calendar is in robots.txt. Much of what is presented on web pages under that path is protected by login. The main Calendar view, and all its related pages under Settings all appear under that path, but none of those would be available to a web crawler, so that's no reason to disallow that path via robots.txt. The only URLs I know of under that path that are publicly visible are for sync URLs, which contain a very long identifier as part of their path. A web crawler would have no way to visit such URLs unless a link to that URL were posted on some other publicly available page. One possible motive for disallowing that path in robots.txt is to avoid the effects of the throttle mechanism being activated as web crawlers follow such links that they found on a public page. If that's the main reason, I personally think the Church could remove their exclusion. Anyone who posts their Sync URL publicly deserves whatever pain comes from the throttle mechanism being activated by a web crawler, since they were specifically told "This URL is provided for your own personal use. It should not be shared with anyone else."

Another possibility is for the Church to try to get Google to change their implementation. That certainly seems like a long shot, but Google has on occasion been known to make changes based on user feedback.

ButtonGear · Post by **ButtonGear** » Tue Sep 02, 2014 1:47 pm

Regardless of the actual technical issues, causes and parties to blame, The Church's calendar won't talk to my Google Calendar, and that's not acceptable. The rest is just technical sabre-rattling.

From what I've read of this issue (people's comments who appear to be far more technologically "in the know" than am I), it seems like the Church needs to "suck it up," play nice with Google and let us sync up. Because in the real world, people use their GOOGLE calendars, and rarely look at the Church's proprietary online calendar, unless they are scheduling a building. And it sure would be nice to know what's going on in my ward and stake by looking at my phone, where (like most people today) I live and die.

russellhltn · Post by **russellhltn** » Tue Sep 02, 2014 2:49 pm

aebrown wrote:
russellhltn wrote:The bottom line is that Google thinks their calendar import bot should obey Robots.txt.
You're correct that Google seems to think that. But Google is wrong.

Ultimately, that's really up to a standards body.

aebrown wrote:That's rather speculative. You're assuming that Google's implementation now is the same that it was back when Calendar Sync was implemented. I personally have no idea whether robots.txt was considered in Google Calendar's sync at that point, or if it was changed in the meantime. I don't think we have enough information to draw a conclusion one way or the other.

This post from February and this post from 2013 shows it's been that way at least that long. Someone even started a service to deal with the problem for iCloud users.

The good news is that things are moving behind the scenes on the church's side. We can expect a solution soon. Maybe even today.

aebrown · Post by **aebrown** » Tue Sep 02, 2014 4:50 pm

russellhltn wrote:
aebrown wrote:
russellhltn wrote:The bottom line is that Google thinks their calendar import bot should obey Robots.txt.
You're correct that Google seems to think that. But Google is wrong.
Ultimately, that's really up to a standards body.

Unfortunately, there's no standards body for robots.txt -- it's simply a de facto standard.

russellhltn wrote:The good news is that things are moving behind the scenes on the church's side. We can expect a solution soon. Maybe even today.

The robots.txt file for LDS.org no longer has a "Disallow" line for /church-calendar, which is good news.

However, when I try to add my calendar sync URL to my Google Calendar, I still get the error message about robots.txt. Perhaps Google is doing some caching, which will take some time to expire.

russellhltn · Post by **russellhltn** » Tue Sep 02, 2014 5:23 pm

aebrown wrote:The robots.txt file for LDS.org no longer has a "Disallow" line for /church-calendar, which is good news.

However, when I try to add my calendar sync URL to my Google Calendar, I still get the error message about robots.txt. Perhaps Google is doing some caching, which will take some time to expire.

I'm seeing the same thing. Your guess sounds as good as any.

aebrown · Post by **aebrown** » Tue Sep 02, 2014 7:08 pm

aebrown wrote:However, when I try to add my calendar sync URL to my Google Calendar, I still get the error message about robots.txt. Perhaps Google is doing some caching, which will take some time to expire.

I found this, where Google says: "First, the cache of the robots.txt file must be refreshed (we generally cache the contents for up to one day)."

So it may still take another 20 hours or so until Google's cache expires. Hopefully the sync process will start working once their cache of robots.txt is refreshed.

aebrown · Post by **aebrown** » Wed Sep 03, 2014 10:37 am

An update: sometime yesterday evening something changed. As I was doing the "Add by URL" option in Google Calendar, I ceased getting a message saying that the URL could be retrieved because of robots.txt. So that would seem to indicate that the change posted yesterday afternoon to the LDS.org robots.txt was now being seen by Google Calendar.

However, nothing actually gets imported. The church calendar appears on my list of Other Calendars, but there are no events. And previously, I at least saw a "Loading..." popup right after I selected "Add by URL", which remained in place for a few seconds until the robots.txt error message was displayed. Now I don't see anything happen, except that the calendar appears in the list of Other Calendars. There are no events. And I'm making sure that I wait far more than 9 minutes between my attempts, so there should be no throttle issues.

aebrown · Post by **aebrown** » Wed Sep 03, 2014 10:47 am

aebrown wrote:An update:

An update to my update: Hallelujah!

I decided to generate a new Sync URL. When I used the new Sync URL in "Add by URL" in Google Calendar, I got the "Importing calendar by URL..." popup. So that was comforting.

Then that message went away, and no events appeared at first. But I went back and checked a few minutes later, and all the events from my stake calendar are truly there.

I have no way of knowing now whether generating a new Sync URL was an essential step; it may be that caches finally got completely cleared out, and so it would have worked now even with my old Sync URL. Perhaps someone else can try it with an existing Sync URL and report what happens.

In any case, it seems to work correctly now for an initial import. I'll edit an event on the stake calendar and see if and when that change comes across to my Google Calendar.

Tech Forum

Google Calendar Sync Issue (Robots.txt)

Re: Google Calendar Sync Issue (Robots.txt)

Re: Google Calendar Sync Issue (Robots.txt)

Re: Google Calendar Sync Issue (Robots.txt)

Re: Google Calendar Sync Issue (Robots.txt)

Re: Google Calendar Sync Issue (Robots.txt)

Re: Google Calendar Sync Issue (Robots.txt)

Re: Google Calendar Sync Issue (Robots.txt)

Re: Google Calendar Sync Issue (Robots.txt)

Re: Google Calendar Sync Issue (Robots.txt)

Re: Google Calendar Sync Issue (Robots.txt)