new FamilySearch merging

garysturn · Post by **garysturn** » Sun Mar 18, 2007 4:33 pm

rmrichesjr wrote:I recall one of the training overviews saying that work-in-progress data should be put in the database, and that the whole database is or should be considered a work in progress. (My comment: After all, it's not going to be complete and flawless until probably somewhere close to a thousand years from now. 1/2

This whole concept is different than what we have now. It will save a lot of effort. In the old system people were downloading their line from Ancestral File doing some research and making a few changes then uploading the entire file back to Pedigree Resource File. So Pedigree Resource File contains a lot of duplicate information. If someone re-submitted their files every few months the PRF gets big and hard to find the new info. But it was the only way we had of sharing our work with others. This new system will be so much better with everyone working together on the same file. And there are controls in place so that information is not lost or changed by someone who did not submit it.

iknovate · Post by **iknovate** » Tue Mar 20, 2007 9:27 pm

I wasn't sure that I agreed with the 'public' concerns for posting our family files, but there is a very relevant point made in the post. There are 'states' to our research. When managing our own files, we are likely to have very cryptic notes. Has the design considered the various 'states' of an entry?

I'm also curious as to how duplicate entries are going to be handled at the variable level. Will there be a 'collaborative' element to resolve conflicts? If the collaboration identifies a need for a 'merge' of entries, how will this be facilitated?

Indeed, is there any way to share the design scenarios for review/comment?

The answers are in the questions...I'd like to see what questions were considered.

garysturn · Post by **garysturn** » Tue Mar 20, 2007 10:08 pm

iknovate wrote:I wasn't sure that I agreed with the 'public' concerns for posting our family files, but there is a very relevant point made in the post. There are 'states' to our research. When managing our own files, we are likely to have very cryptic notes. Has the design considered the various 'states' of an entry?

I'm also curious as to how duplicate entries are going to be handled at the variable level. Will there be a 'collaborative' element to resolve conflicts? If the collaboration identifies a need for a 'merge' of entries, how will this be facilitated?

Indeed, is there any way to share the design scenarios for review/comment?

The answers are in the questions...I'd like to see what questions were considered.

A lot of these questions have been answered in this an other forums. Here is a link to a web site which links to a lot of the other sites discussing newFamilySearch. There are Blogs by beta testers, forums and other resources here about newFamilySearch: http://newfamilyhistory.googlepages.com/home

blackrg · Post by **blackrg** » Wed Mar 21, 2007 12:09 pm

Well since it's all "one tree" we're working on, and since we seem to just be submitting conflicting info about individuals if we differ, at what point does something constitute a different person vs corrections to a person's record? If the name is listed as Jack Frost, and I feel the name is Jack Frostland, and I have different event dates that differ by only a few days, does that constitute a different person? What if I show a branch with an entirely different set of parents?

How is temple worked handled under this? Does someone get to do the work for Jack Frost and someone else for Jack Frostland? Or is Jack Frost's work considered valid for Jack Frostland?

Can temple work be reserved for this? I submit info on my grandfather and want to do his work - can I reserve it or can anyone do it? If I reserve it and fall over dead without doing the work, and the system is never notified, is there a mechanism to "unlock" it so someone else can do it? How about current work that is marked in progress (and has been that way for years) but is not done that is introduced into the system - will we have problems printing cards to do that work?

garysturn · Post by **garysturn** » Wed Mar 21, 2007 12:47 pm

gblack wrote:Well since it's all "one tree" we're working on, and since we seem to just be submitting conflicting info about individuals if we differ, at what point does something constitute a different person vs corrections to a person's record? If the name is listed as Jack Frost, and I feel the name is Jack Frostland, and I have different event dates that differ by only a few days, does that constitute a different person? What if I show a branch with an entirely different set of parents?

How is temple worked handled under this? Does someone get to do the work for Jack Frost and someone else for Jack Frostland? Or is Jack Frost's work considered valid for Jack Frostland?

Can temple work be reserved for this? I submit info on my grandfather and want to do his work - can I reserve it or can anyone do it? If I reserve it and fall over dead without doing the work, and the system is never notified, is there a mechanism to "unlock" it so someone else can do it? How about current work that is marked in progress (and has been that way for years) but is not done that is introduced into the system - will we have problems printing cards to do that work?

The program will link all the records it determines are the same person. Some individual will have to make the determination if it is the same person or a different person. The system shows possible matches and patrons decide if they are to be combined.

If two submissions are combined, they can be uncombined if someone determines later that it is a different individual.

When you contribute names to the system it gives you the option to save the names for yourself to do the temple work. You can then do the work yourself or release the names to other family or friends to do. They are working on a method for claiming submissions from people who have died. And they are working on how they will handle names submitted and locked for work but nothing is being done for a specific period of time. I am not aware what the solution is, but I have heard that they are aware of the need for a solution and are working on a procedure.

BrainClay · Post by **BrainClay** » Tue Apr 03, 2007 1:09 pm

GarysTurn wrote: The only files that I know of that can be merged in PAF automatically are identical records with the same Ancestral File Number (AFN), so if you use the merge by AFN option, PAF can auto merge those records. When you download someone else's entire database you can compare that file with yours side by side, record by record and just select the information you want to move to your record using "PAF Insight", this program is available at most FHC's to use and is available to purchase online. Do a Google search for "PAF Insight" for more info.

This has probably already been discussed, but I've always wondered about how hard it would be to create a more intelligent merging program: Say that Mary Jones, born in 1901 in Geneva, NY married John Smith, born in 1895 in Topeka, Kansas, and had several children, one of whom was Stephen Smith, born in 1920 in Buffalo, NY. If multiple people upload this family, why couldn't a merging program figure out that these are the same people. The argument I've heard is 'there are hundreds of thousands of Mary Jones, etc.' Sure, but how many were born in 1901 in Geneva, NY, married John Smith, born in 1895 in Topeka, Kansas, and had a son named Stephen, born in 1920 in Buffalo, NY?

I would put the odds of two different Mary Jones matching all that at 1 in a million or less. I think you could have an accurate match, even if you didn't have all the dates and places, especially if you had more children's names.

A second element of an intelligent match, even without the above should be that it should figure out that if I merged child A1 in Family A with Child B1 in Family B, then it's a good guess that I can merge everyone in family A with their matches in family B, assuming they have the same first name and birth dates. In other words, once I identify a match, it could go on it's merry way: Multiple generations could be merged very quickly with this approach, simply by making one match, as parents would be children in the next family, and so on.

Again, this sounds easy on paper, but how hard is that? I would imagine it taking a fair amount of CPU time, and probably couldn't be done in real time. The first suggestion could be like a web crawler, constantly looking for possible merges, and the second one would be a batch process that runs asynchronously.

Even better - the program that looks for possible matches could simply flag the matches to the two tree owners as possible matches, and let them decide.

mkmurray · Post by **mkmurray** » Tue Apr 03, 2007 2:14 pm

bcpalmer60 wrote:Sure, but how many were born in 1901 in Geneva, NY, married John Smith, born in 1895 in Topeka, Kansas, and had a son named Stephen, born in 1920 in Buffalo, NY?

I would put the odds of two different Mary Jones matching all that at 1 in a million or less.

I don't know how many people have lived on this Earth total from the beginning of time, but we can at least say 6 billion because that is the current number right now. So using the 6 billion number, 1 in a million odds would mean that you are banking on no more than 6,000 Mary Jones that meet the criteria you specified. Even that is overly safe; maybe 1 in 6 billion odds is better!

garysturn · Post by **garysturn** » Tue Apr 03, 2007 2:29 pm

bcpalmer60 wrote:This has probably already been discussed, but I've always wondered about how hard it would be to create a more intelligent merging program: Say that Mary Jones, born in 1901 in Geneva, NY married John Smith, born in 1895 in Topeka, Kansas, and had several children, one of whom was Stephen Smith, born in 1920 in Buffalo, NY. If multiple people upload this family, why couldn't a merging program figure out that these are the same people. The argument I've heard is 'there are hundreds of thousands of Mary Jones, etc.' Sure, but how many were born in 1901 in Geneva, NY, married John Smith, born in 1895 in Topeka, Kansas, and had a son named Stephen, born in 1920 in Buffalo, NY?

I would put the odds of two different Mary Jones matching all that at 1 in a million or less. I think you could have an accurate match, even if you didn't have all the dates and places, especially if you had more children's names.

A second element of an intelligent match, even without the above should be that it should figure out that if I merged child A1 in Family A with Child B1 in Family B, then it's a good guess that I can merge everyone in family A with their matches in family B, assuming they have the same first name and birth dates. In other words, once I identify a match, it could go on it's merry way: Multiple generations could be merged very quickly with this approach, simply by making one match, as parents would be children in the next family, and so on.

Again, this sounds easy on paper, but how hard is that? I would imagine it taking a fair amount of CPU time, and probably couldn't be done in real time. The first suggestion could be like a web crawler, constantly looking for possible merges, and the second one would be a batch process that runs asynchronously.

Even better - the program that looks for possible matches could simply flag the matches to the two tree owners as possible matches, and let them decide.

A lot of the software out there does take into consideration the possible matches you mention. That is how match and merge programs work now. The final decision is still left up to a person. What you describe as a web crawler type program constantly searching for matches is kind of what will take place in newFamilySearch. NewFamilySearch will link possible matches for you and you will decide if they are real matches. With the structure of newFamilySearch a true merge is not done but a combining of information, so that no information is lost. There will not be as much of a problem created with merging as there is with current software, because things that are combined in error can be un-combined.

Post by **rmrichesjr** » Tue Apr 03, 2007 2:45 pm

bcpalmer60 wrote:Again, this sounds easy on paper, but how hard is that? I would imagine it taking a fair amount of CPU time, and probably couldn't be done in real time. The first suggestion could be like a web crawler, constantly looking for possible merges, and the second one would be a batch process that runs asynchronously.

As I understand it, and as the current beta of the new FamilySearch is set up, the idea is to have human users manually decide which records to match. It sounds like the intent is to allow inspiration to guide the process.

The current beta system has some level of heuristic of some sort that suggests whether the system thinks the people are probable matches or are not as probable to be matches. Most of the time, it works fairly well, but I have seen it make some decisions that were somewhat to very obviously outlandish. I have reported some of these using the feedback system.

Have you seen the postings in the forum about the API that is planned to be available once the new FamilySearch goes production live? The plan, as I have read and understood it, is for the API to support multiple desktop record managers. If we're lucky, the API will include capability to handle a client system doing the type of merging decisions you wrote about. Maybe somebody will get the chance to try out your idea. Are you a programmer?

Having done a fair amount of merging and unmerging in the current beta system, I would suggest that any automated merging should be _very_ conservative and only merge automatically on the basis of overwhelming statistical likelihood. While there is a lot of merging to do (a few dozen records per real person in some cases), it is _massively_ more difficult to unmerge than to merge. In some cases, just wrapping your head around what needs to be unmerged and where is daunting.

Tech Forum

new FamilySearch merging

States of Entry

newFamilySearch

Questions

Some probability fun...

Matches