Auto-cross referencing Prototype

josh-p40 · Post by **josh-p40** » Wed Feb 14, 2007 12:17 am

Here is a VERY rough protoype for auto-cross referencing using the simple verse matching method I mention in the other post.

It's still in development. There are 100 ways I can think of to improve it, so it may get better day to day. But as is, it's not bad.

Nothing fancy yet. You have to know the text of the verse ahead of time.

Basically, put in the exact verse reference written out fully (like: 1 Nephi 3:7) and then put in some phrase or word from the verse like: commandments.

If you want the EXACT phrase, use quotes like "go and do"

http://menkefamily.no-ip.org/

[thread="181"]I posted it in the other post too[/thread]

I double posted since I felt it new enough of a topic being a specific instance of a stat-linking app.

WelchTC · Post by **WelchTC** » Wed Feb 14, 2007 7:20 am

This is cool stuff. The online scriptures does some of this also but not to the extent you did it. For example, if you use the online scriptures and do a search for "dwealt" (notice the mis-spelling) it will ask if you mean "dwelt". Also you can search for "gather" and it will find all words like "gathering, gathered".

Is your code open source so we can look at it?

Tom

josh-p40 · Post by **josh-p40** » Wed Feb 14, 2007 2:17 pm

Hey, thanks.

Yeah, I don't do anything nearly as fancy as the scriptures.lds.org search yet. I mean, I don't handle alternate spellings, synonyms, word conjugations, or even mispellings. Those would all improve the results.

As for open source, I'll clean up the code a little and then post it either here or somewhere.

If you guys use it, I'd be interested in hearing "how it's going" and even participating if possible.

Even without cleaning it up, it's only 162 lines of code. Of course, that assumes you already have an access to a full-text index of all the scriptures.

I'll post where the code is available in this thread and maybe the other.

josh-p40 · Post by **josh-p40** » Wed Feb 14, 2007 5:27 pm

OK, here is the source code.

It's also available here for now:

http://menkefamily.no-ip.org/MatchVerses.pys

You still need a Lucene index and PyLucece to use it.

Code: Select all

#!/usr/bin/env python
from PyLucene import QueryParser, IndexSearcher, StandardAnalyzer, FSDirectory, Term, QueryTermExtractor
from PyLucene import VERSION, LUCENE_VERSION

import string
import optparse

"""
MatchVerses.py
Author: Josh Menke
Date created: Feb 12 2007
Last Modified: Feb 14 2007

Given a verse and a search for the verse, this will find verses that contain the
search and will rank them by how closely they match the original verse.

Meant as an "auto-cross-indexer"

This script is loosely based on the Lucene (java implementation) demo class 
org.apache.lucene.demo.SearchFiles.
"""

def remove_punctuation(text):
	cleaned = ""
	for char in text:
		if not char in string.punctuation:
			cleaned += char
	return cleaned

def get_bigrams(words):
	bigram_set = set()
	for word_number in range(len(words)-1):
		bigram_set.add(words[word_number]+words[word_number+1])
	return bigram_set

def run(searcher, analyzer,verse=None,command=None):

	# output is handled differently if this was called from the console vs. interactively
	is_console = verse != None
	if is_console:
		paragraph = "<p>"
	else:
		paragraph = ""

	while True:

		if not is_console:
			print paragraph
			verse = raw_input("Verse:")
			if verse == '':
			    return

		print paragraph,"Verse:",verse

		verse_path = "/home/josh/code/statlink/lds-scriptures-2_5_0-csv/./verses/"+verse+".txt"
		try:
			verse_raw = open(verse_path).read()
		except IOError:
			print paragraph,"ERROR: Verse not found. Please use full names for now"
			if is_console:
				return
			else:
				continue

		print paragraph,verse_raw

		# weight terms using API into my PyLucene created full-text index
		term_iter = searcher.getIndexReader().terms();
		term_weight_map = {}
		while (term_iter.next()):
			term = term_iter.term()
			text = term.text()
			freq = term_iter.docFreq()
			term_weight_map[text] = 1-freq*1.0/searcher.maxDoc()

		# remove punctuation
		words_in_verse = remove_punctuation(verse_raw).split()

		# only use words Lucene likes (removes stopwords)
		word_set = set()
		for word in words_in_verse:
			query = QueryParser("contents", analyzer).parse(word)
			hits = searcher.search(query)
			if hits.length():
				word_set.add(word.lower())

		# calculate highest possible score
		highest_possible = 0
		for word in word_set:
			try:
				highest_possible += term_weight_map[word]
			except KeyError:
				continue

		# add bigrams
		bigram_set = get_bigrams(words_in_verse)
		highest_possible += len(bigram_set)

		if not is_console:
			print "Hit enter with no input to quit."
			command = raw_input("Query:")
			if command == '':
			    return
			print

		if is_console:
			print "

"

		print paragraph,"Searching for:", command

		query = QueryParser("contents", analyzer).parse(command)
		hits = searcher.search(query)
		print paragraph,"%s total matching documents." % hits.length()
		matching_docs = []
		for doc in hits:
			path = doc.get("path")
			path = "/home/josh/code/statlink/lds-scriptures-2_5_0-csv/"+path
			if path == verse_path:
				continue
			raw_verse= open(path).read()

			# remove punctuation
			verse_words = remove_punctuation(raw_verse).split()

			# score match
			matches = word_set.intersection(set(verse_words))
			match_score = 0
			for match in matches:
				try:
					match_score += term_weight_map[match.lower()]
				except KeyError:
					continue

			# add bigram bonus
			match_bigram_set = get_bigrams(verse_words)
			match_score += len(match_bigram_set.intersection(bigram_set))

			# store match
			matching_docs.append((match_score,doc.get("name"),raw_verse))

		matching_docs.sort()
		if is_console:
			matching_docs.reverse()
			matching_docs = matching_docs[0:10]
		else:
			matching_docs = matching_docs[-10:]

		if is_console:
			print "
","
"
		for doc in matching_docs:
			print paragraph,doc[1][0:doc[1].find(".")],"Score:",round(float(doc[0])/highest_possible,2)
			print paragraph,doc[2]
			if is_console:
				print "
","
"

		if is_console:
			return


if __name__ == '__main__':
	usage = "usage: %prog [options] [\"verse\" \"word or phrase\"]"
	parser = optparse.OptionParser(usage=usage)
	parser.add_option("-d", "--directory",dest="directory", help="directory with index", default="index")
	(options, args) = parser.parse_args()

	if not (len(args) != 0 or len(args) != 2):
		parser.error("Incorrect number of arguments")

	STORE_DIR = options.directory
	directory = FSDirectory.getDirectory(STORE_DIR, False)
	searcher = IndexSearcher(directory)
	analyzer = StandardAnalyzer()
	if len(args) == 2:
		run(searcher,analyzer,args[0],args[1])
	else:
		run(searcher, analyzer)
	searcher.close()

WelchTC · Post by **WelchTC** » Thu Feb 15, 2007 7:34 am

Cool! Nice work!

Tom

josh-p40 · Post by **josh-p40** » Thu Feb 15, 2007 12:37 pm

FYI, I slightly modified the term weights:

I changed:

Code: Select all

term_weight_map[text] = 1-freq*1.0/searcher.maxDoc()

to

Code: Select all

term_weight_map[text] = 1-freq*4.0/searcher.maxDoc()

The most common word (outside the normal english stop words) is "unto" which appears in over 11,000 verses. By quadrupling the commanlity penalty, "unto" becomes worth almost 0.

mkmurray · Post by **mkmurray** » Sun Feb 25, 2007 10:28 pm

Follow over to [thread=264]this thread[/thread] for a better AJAX-style prototype.