Here is a VERY rough protoype for auto-cross referencing using the simple verse matching method I mention in the other post.
It's still in development. There are 100 ways I can think of to improve it, so it may get better day to day. But as is, it's not bad.
Nothing fancy yet. You have to know the text of the verse ahead of time.
Basically, put in the exact verse reference written out fully (like: 1 Nephi 3:7) and then put in some phrase or word from the verse like: commandments.
If you want the EXACT phrase, use quotes like "go and do"
http://menkefamily.no-ip.org/
[thread="181"]I posted it in the other post too[/thread]
I double posted since I felt it new enough of a topic being a specific instance of a stat-linking app.
Auto-cross referencing Prototype
-
- New Member
- Posts: 24
- Joined: Fri Jan 26, 2007 10:25 am
- Location: San Jose, CA
-
- Senior Member
- Posts: 2085
- Joined: Wed Sep 06, 2006 8:51 am
- Location: Kaysville, UT, USA
This is cool stuff. The online scriptures does some of this also but not to the extent you did it. For example, if you use the online scriptures and do a search for "dwealt" (notice the mis-spelling) it will ask if you mean "dwelt". Also you can search for "gather" and it will find all words like "gathering, gathered".
Is your code open source so we can look at it?
Tom
Is your code open source so we can look at it?
Tom
-
- New Member
- Posts: 24
- Joined: Fri Jan 26, 2007 10:25 am
- Location: San Jose, CA
Hey, thanks.
Yeah, I don't do anything nearly as fancy as the scriptures.lds.org search yet. I mean, I don't handle alternate spellings, synonyms, word conjugations, or even mispellings. Those would all improve the results.
As for open source, I'll clean up the code a little and then post it either here or somewhere.
If you guys use it, I'd be interested in hearing "how it's going" and even participating if possible.
Even without cleaning it up, it's only 162 lines of code. Of course, that assumes you already have an access to a full-text index of all the scriptures.
I'll post where the code is available in this thread and maybe the other.
Yeah, I don't do anything nearly as fancy as the scriptures.lds.org search yet. I mean, I don't handle alternate spellings, synonyms, word conjugations, or even mispellings. Those would all improve the results.
As for open source, I'll clean up the code a little and then post it either here or somewhere.
If you guys use it, I'd be interested in hearing "how it's going" and even participating if possible.
Even without cleaning it up, it's only 162 lines of code. Of course, that assumes you already have an access to a full-text index of all the scriptures.
I'll post where the code is available in this thread and maybe the other.
--josh
-
- New Member
- Posts: 24
- Joined: Fri Jan 26, 2007 10:25 am
- Location: San Jose, CA
OK, here is the source code.
It's also available here for now:
http://menkefamily.no-ip.org/MatchVerses.pys
You still need a Lucene index and PyLucece to use it.
It's also available here for now:
http://menkefamily.no-ip.org/MatchVerses.pys
You still need a Lucene index and PyLucece to use it.
Code: Select all
#!/usr/bin/env python
from PyLucene import QueryParser, IndexSearcher, StandardAnalyzer, FSDirectory, Term, QueryTermExtractor
from PyLucene import VERSION, LUCENE_VERSION
import string
import optparse
"""
MatchVerses.py
Author: Josh Menke
Date created: Feb 12 2007
Last Modified: Feb 14 2007
Given a verse and a search for the verse, this will find verses that contain the
search and will rank them by how closely they match the original verse.
Meant as an "auto-cross-indexer"
This script is loosely based on the Lucene (java implementation) demo class
org.apache.lucene.demo.SearchFiles.
"""
def remove_punctuation(text):
cleaned = ""
for char in text:
if not char in string.punctuation:
cleaned += char
return cleaned
def get_bigrams(words):
bigram_set = set()
for word_number in range(len(words)-1):
bigram_set.add(words[word_number]+words[word_number+1])
return bigram_set
def run(searcher, analyzer,verse=None,command=None):
# output is handled differently if this was called from the console vs. interactively
is_console = verse != None
if is_console:
paragraph = "<p>"
else:
paragraph = ""
while True:
if not is_console:
print paragraph
verse = raw_input("Verse:")
if verse == '':
return
print paragraph,"Verse:",verse
verse_path = "/home/josh/code/statlink/lds-scriptures-2_5_0-csv/./verses/"+verse+".txt"
try:
verse_raw = open(verse_path).read()
except IOError:
print paragraph,"ERROR: Verse not found. Please use full names for now"
if is_console:
return
else:
continue
print paragraph,verse_raw
# weight terms using API into my PyLucene created full-text index
term_iter = searcher.getIndexReader().terms();
term_weight_map = {}
while (term_iter.next()):
term = term_iter.term()
text = term.text()
freq = term_iter.docFreq()
term_weight_map[text] = 1-freq*1.0/searcher.maxDoc()
# remove punctuation
words_in_verse = remove_punctuation(verse_raw).split()
# only use words Lucene likes (removes stopwords)
word_set = set()
for word in words_in_verse:
query = QueryParser("contents", analyzer).parse(word)
hits = searcher.search(query)
if hits.length():
word_set.add(word.lower())
# calculate highest possible score
highest_possible = 0
for word in word_set:
try:
highest_possible += term_weight_map[word]
except KeyError:
continue
# add bigrams
bigram_set = get_bigrams(words_in_verse)
highest_possible += len(bigram_set)
if not is_console:
print "Hit enter with no input to quit."
command = raw_input("Query:")
if command == '':
return
print
if is_console:
print "
"
print paragraph,"Searching for:", command
query = QueryParser("contents", analyzer).parse(command)
hits = searcher.search(query)
print paragraph,"%s total matching documents." % hits.length()
matching_docs = []
for doc in hits:
path = doc.get("path")
path = "/home/josh/code/statlink/lds-scriptures-2_5_0-csv/"+path
if path == verse_path:
continue
raw_verse= open(path).read()
# remove punctuation
verse_words = remove_punctuation(raw_verse).split()
# score match
matches = word_set.intersection(set(verse_words))
match_score = 0
for match in matches:
try:
match_score += term_weight_map[match.lower()]
except KeyError:
continue
# add bigram bonus
match_bigram_set = get_bigrams(verse_words)
match_score += len(match_bigram_set.intersection(bigram_set))
# store match
matching_docs.append((match_score,doc.get("name"),raw_verse))
matching_docs.sort()
if is_console:
matching_docs.reverse()
matching_docs = matching_docs[0:10]
else:
matching_docs = matching_docs[-10:]
if is_console:
print "
","
"
for doc in matching_docs:
print paragraph,doc[1][0:doc[1].find(".")],"Score:",round(float(doc[0])/highest_possible,2)
print paragraph,doc[2]
if is_console:
print "
","
"
if is_console:
return
if __name__ == '__main__':
usage = "usage: %prog [options] [\"verse\" \"word or phrase\"]"
parser = optparse.OptionParser(usage=usage)
parser.add_option("-d", "--directory",dest="directory", help="directory with index", default="index")
(options, args) = parser.parse_args()
if not (len(args) != 0 or len(args) != 2):
parser.error("Incorrect number of arguments")
STORE_DIR = options.directory
directory = FSDirectory.getDirectory(STORE_DIR, False)
searcher = IndexSearcher(directory)
analyzer = StandardAnalyzer()
if len(args) == 2:
run(searcher,analyzer,args[0],args[1])
else:
run(searcher, analyzer)
searcher.close()
--josh
-
- Senior Member
- Posts: 2085
- Joined: Wed Sep 06, 2006 8:51 am
- Location: Kaysville, UT, USA
-
- New Member
- Posts: 24
- Joined: Fri Jan 26, 2007 10:25 am
- Location: San Jose, CA
FYI, I slightly modified the term weights:
I changed:
to
The most common word (outside the normal english stop words) is "unto" which appears in over 11,000 verses. By quadrupling the commanlity penalty, "unto" becomes worth almost 0.
I changed:
Code: Select all
term_weight_map[text] = 1-freq*1.0/searcher.maxDoc()
Code: Select all
term_weight_map[text] = 1-freq*4.0/searcher.maxDoc()
--josh
-
- Senior Member
- Posts: 3266
- Joined: Tue Jan 23, 2007 9:56 pm
- Location: Utah