SuperPAF/HyperPAF Idea

#31

Just to put a plug for the family search indexing, it is a great way to relax for 10 mins a day. You do not need to complete an entire packet in one day but you can easily get 10-20 names done in 10 mins.

You just sign up to be a volunteer. It takes about 2-3 days before you get your username and password but after that, 10 mins a day gets a LOT done.

huffkw · #32

Future Features (that are not movies)

Garys Turn:
“It is my understanding that when you log onto NewFamilySearch with your church password you will see yourself and your personal Church records and all your ancestors who were members while living and their personal church records all tied into the Family Histroy pedigrees. . . . . .

It is my understanding that links to original documents and multimedia items will already be included in NewFamilySearch and many automaticaly linked as the images are indexed and brought online.”

I hope your description of the data-linking features of the NewFamilySearch is correct. You have identified a need and an expectation that I suspect and hope that many people have. We really ought to have that feature, it seems to me.

I have been suggesting something similar. I expect that it would be far simpler and more efficient for all concerned, and save some mind-boggling (and error-prone?) programming work, if the personalized data you describe were actually kept in a separate database structure, with links to other databases as needed.

As you suggest, the personalized portion needs to be able to select and consolidate data from the many separate NewFamilySearch databases, and well as from the even newer, and eventually much larger, Granite Mountain digitization project.

If the personal data feature is embedded in the NewFamilySearch system, it will probably tend to be limited to that data (50 million unique names in 1 billion records?). I expect that design choice would also make it many times harder to include the much larger (and less plowed) Granite Mountain material (1 billion unique names in 20 billion name entries?) as part of the personalized data, and so it might not happen at all. That would be a great loss.

The extra storage space needed would still be very tiny, certainly less than 1% of the current and planned size of all databases.

mikael.ronstrom · #33

huffkw wrote:mikron:
I like your more detailed link idea. I had not thought any more about it than just giving a line number as one might do with a book, page, and line. But it would be pretty cool to have a document image page come up with a highlighted rectangle around the data of interest. How would you describe the rectangle size and location in some generic way?

My thinking is that it requires the following data:
1) Page ID
2) X and Y position of upper left corner and length in X and Y directions.

Given this information it should be feasible to index on Page ID such that when you click
on a certain X and Y position on the page it should be possible to quickly deliver all link objects
covering that position.

I'm assuming that one should have quality ratings of links as well as what the links refer to. Thus after
clicking on a link one gets a pop-up window which provides links to:
1) Transcriptions of the page part
2) Links to family trees referring to the position
3) Links to other Link Objects (e.g when there are several recordings of the same basic information,
e.g. in Sweden we have church records of births and for the period 1860 and forward we have
transcriptions of these books that have also been microfilmed).

----------------------------
I read your 2001 FHT paper. I think it is great, especially your ability to estimate the number of servers needed to process the six object classes you describe: Application Objects, Catalog Objects, Link Objects, Scanned Objects, Translation Objects, and Name Translation Objects. I have wondered if it would be feasible to load the entire set of Application Objects and Link Objects (and perhaps other small objects) into server main memory, and it appears it is. That should deliver an enormous performance boost.

Yep, it's definitely getting there, I'm currently at a benchmark center and they have clusters with
2.4 TBytes of memory so it's definitely feasible to put all the family trees and transcriptions and
indexing information in memory.

On your link object type that could specify which part of an image page is referenced, I wonder if that would need to be one step more generalized by using percentages of distances between the four corners of the image, since there could be many sizes and compressions used in moving the image around and displaying it, and any absolute size or pixel count could be inappropriate in some cases.

I guess one has to specify the position in relative terms in percent of page size rather than in pixels.
Then the graphical software that translates the mouse clicks into a database query has to translate
the pixel position into a relative position used by the database.

I see most of my ideas as relating to yours by specifying a further set of rules for the formation and interaction of objects, mostly the Application Objects and Link Objects. I am happy to leave the underlying industrial strength memory and storage systems to somebody else.
---------------------------
Your telecom paper sounds like slightly familiar territory.
I should mention that I worked for a time (among a staff of 900 programmers) on the gigantic Verizon telecom Event Processing System covering most of the US east coast. We were testing for about 200 million transactions a day in 1999, but I assume it has gotten much bigger since then.
I have read only the first chapter of your paper, but I was fascinated by the idea of using a second main memory as a backup or commit space, plus all the bottleneck studies and solutions.
---------------------------
The Google video mentioned your major PhD work adopted for MySQL clusters. That was pretty interesting. It makes me wonder whether the Church guys have used any of your open source work on very large data storage systems. Seems like it would be a good thing for them to check out, if they haven’t already.

I'll be presenting some of this at the Family History Technology Workshop at BYU in 2 weeks from now
so will be interesting to also hear where the church is on this right now.

---------------------------

Text entered to overcome bug in Web software

mikael.ronstrom · #34

WaddleDee wrote:I heard talk that the records stored in the granite vaults are being recorded electronically. Do you think that PAF could connect to that database, So when you enter someone in that PAF recognizes, It brings up a tree that is already existing and asks if it is your family. If it is, then it imports.

With this ides making one big tree connecting all of us to Adam could be a little easier.

I think what we want is something pretty similar to source code version systems. There is the
major copy which is central and then each person has his own tree which contains parts
downloaded from the central database but also parts he is working on.

I think it is also import to be able to insert research ideas, so if a person has been identified
as a possible father to a person it should be possible to insert him into the database but where
the researcher can specify that the result is so far uncertain. Given indexing to real images
it should be possible for other researchers to assist then in giving more solid evidences of
the family tree connections.

JamesAnderson · #35

I see where you may be going with this.

So you are saying that when we have a tree up, and we know certain sources, and those sources have been indexed and scanned and are thus online, that we should be able to link to them?

If so that would be an extraordinary thing, to see exactly where the data in the tree was found and so forth.

Eventually, there will be other sources, including material the person who has the data has found, such as a picture of a headstone, and I think there ought to be a way one could attach things like that to the record to indicate a source record for a burial (the headstone photo), for example. Maybe not initially, but down the road hopefully sooner than later.

One question was asked about electronic images, and they are taking digital pictures now, 100 cameras are all-digital now, 15 megapixel cameras are being used. No retakes are ever needed (compared to the problems routinely cropping up when doing microfilm shoots) in virtually every case, a DVD or two is burned onsite, given to the party or archive owner at the location they just shot the images in, and the images are sent off via the net to the vault, where they are stored for later use at what to me presently is an undetermined point in the rollout of new FamilySearch.

jaocon · #36

Long thread so I didn't read all of it; I hope I can add my two cents in adequately and that it doesn't duplicate too much. I come at this topic using the following logic from my perspective as an amateur genealogist: I exist therefore my ancestors exist! My job is to prove they existed to the best of my ability and to provide source documentation. Now I may not be able to find source documentation - it may not exist - but how am I to know unless I have access to all the databases of all source records?

Although perfection may not be obtained, I think a "smart database" is definitely something that should be considered. Let me explain my non-technical view of it:

1. Digitizing a database and allowing it to be searched is great, BUT what if the database searched itself and linked data that was "similar" - uniquely identifies individuals and links the digital source record/image to that individual - that would be even better!

EX: Each database type, i.e. Census is digitized and then grouped into Surname sub-categories, since Census is taken sequentially every 10 years THEORETICALLY one could start at ANY Census year and begin linking individuals in previous/subsequential Census years. Of course I realize Census is an imperfect database - it is an incomplete database - not everyone was counted, the Census taker recorded information incorrectly, people didn't go by the same name - or with the same birthdate - HOWEVER a "smart database" can check for inconsistencies and cross-check against the family group.

2. So if a "smart database" works on itself and also takes end-user data and compares an accurate database MAY result with the end goal that a user inputs one ancestor/family group and not only finds their ancestors but decendants as well. So not only does the database search for and find individuals but it links family groups generationally and displays original source images.

I believe the Census database would be a perfect test case/base for implementation of such a project because if its sequential nature. Once the algorithm was perfected then other database types could be added - of course a perfected Census database would depend on linking maiden names requiring birth/marriage/death database "smartness" as well.

Wouldn't this result be great? The end-user inputs one individual/family group and ends up with not only the lineage-linked information they may be searching for but the original source documents!

russellhltn · #37

JAOCON wrote:"smart database"

I think my biggest fear would be the smart database would make some errors and match people who are similar but are not really the same person together. You could see that happening with common names. Then people would accept it as true without looking at the connections in depth.

#38

An automated system that would help find and link families will/would be great. I have wondered about a system that could use birth, marriage, and death information to (semi-) automatically build and link a pedigree. Census and other records could certainly figure into something like that.

However, I suspect it will be difficult, especially with census records, to automatically deal with the large number of inconsistencies in the raw data. I hope those working on such systems don't underestimate the level of inconsistencies in the data, as I think it would be very easy to do that.

For example, my wife found what appears quite strongly to be the same family in two adjacent decades (1920 and 1930, if I remember correctly). The earlier has a female child Harriet of age 3. The later one has a male child named Harry of age 13. She guesses Harriet used "Harry" as a nickname and the census taker got the gender wrong. She is exploring other resources to try to confirm (or refute) that guess.

garysturn · #39

newFamilySearch (NFS) is a smart database in some respects. It searches the databases in FamilySearch and provides them as possible links, all someone has to do is look at the possible links and decide if the records NFS has found should be linked. Check out some of the Screenshots that are linked to on this site. As more databases are added to NFS they will also be searched for possible links, this will include census and other records.

Tech Forum

SuperPAF/HyperPAF Idea

Future Features (that are not movies)

Link Objects between scanned images and family trees

don't underestimate inconsistencies in census records

New FamilySearch is smart