russellhltn wrote:aebrown wrote:russellhltn wrote:I'm not so sure. When I look up the purpose of Robots.txt, it seems to be aimed at automated processes, not just search engine web crawlers.
What's your source? Everything I read talks about web crawlers. See
here, or
here.
That second link says "and other web robots". It should be noted that the file is "Robots" not "Crawler" or "Spider".
And what is a robot? You seem to be assuming that it might mean something other than a crawler or spider. But the definition given at
http://www.robotstxt.org/ is:
Web Robots (also known as Web Wanderers, Crawlers, or Spiders), are programs that traverse the Web automatically. Search engines such as Google use them to index the web content, spammers use them to scan for email addresses, and they have many other uses.
The key attribute is that robots "traverse the Web automatically." That is definitely NOT what is happening when Google Calendar accesses a Sync URL -- that is hitting a specific URL provided by a user.
russellhltn wrote:The bottom line is that Google thinks their calendar import bot should obey Robots.txt.
You're correct that Google seems to think that. But Google is wrong.
russellhltn wrote:aebrown wrote:It's true that the Church made a change, and it was an important change, but I don't think it's fair to say that it was that change "that broke things."
We can debate fault, but it's that change that caused things stop working. Not a change implemented on Google's side. Had this text file been that way from the beginning, it would appear that Google Calendar sync would have never worked and perhaps never been offered.
That's rather speculative. You're assuming that Google's implementation now is the same that it was back when Calendar Sync was implemented. I personally have no idea whether robots.txt was considered in Google Calendar's sync at that point, or if it was changed in the meantime. I don't think we have enough information to draw a conclusion one way or the other.
russellhltn wrote:So where do we go from here? Unless the church changes the text file, perhaps allowing just Googlebot to get files, this isn't going to get fixed anytime soon.
Changing the robots.txt file to allow only Googlebot (or removing that disallow line entirely) would fix the calendar sync, but the Church might not think it would be acceptable. That would almost certainly subvert the reasons why the Church specifically chose to disallow the /church-calendar directory in robots.txt.
I certainly don't know why /church-calendar is in robots.txt. Much of what is presented on web pages under that path is protected by login. The main Calendar view, and all its related pages under Settings all appear under that path, but none of those would be available to a web crawler, so that's no reason to disallow that path via robots.txt. The only URLs I know of under that path that are publicly visible are for sync URLs, which contain a very long identifier as part of their path. A web crawler would have no way to visit such URLs unless a link to that URL were posted on some other publicly available page. One possible motive for disallowing that path in robots.txt is to avoid the effects of the throttle mechanism being activated as web crawlers follow such links that they found on a public page. If that's the main reason, I personally think the Church could remove their exclusion. Anyone who posts their Sync URL publicly deserves whatever pain comes from the throttle mechanism being activated by a web crawler, since they were specifically told "This URL is provided for your own personal use. It should not be shared with anyone else."
Another possibility is for the Church to try to get Google to change their implementation. That certainly seems like a long shot, but Google has on occasion been known to make changes based on user feedback.