In-common-with, shared matches, and clusterings

There are a few tools out there that either these DNA testing companies will provide to help distinguish our matches from each other.  They are known as in-common-with (icw) or shared matches.  The idea is that a group of DNA matches on your match list who match each other indicates a common ancestor.  

Figuring out a paternal DNA match from a maternal match may or may not be as challenging for some, depending on how well of a tree you have.  It might be difficult to know if a DNA match is on your paternal grandfather vs. paternal grandmother’s side, or from a maternal grandfather vs. a maternal grandmother’s side.  Or even going back further, figuring out that a DNA match is on your maternal grandmother’s father’s or mother’s side, or that grandparent’s maternal grandfather vs. their maternal grandmother’s side.  That would also depend on how well your tree is built out, and the same would apply for your DNA matches.

This is where the shared matches or in-common-with features could help.  For Polynesians, because we match each other to some extent due to endogamy (just as other endogamous populations will experience this), it can be confusing, misleading and really not useful.

Clustering

Visually, there are a few tools to help make it easier for you to distinguish.  Clustering (auto-clustering) is another tool, something that MyHeritage offers or you could use a third-party site such as GeneticAffairs.com to visually show you groups of matches.

Here I show a few of my 1st cousins who have DNA tested, both on my father’s and mother’s side.

My paternal 1st cousins are represented in the green.  My maternal 1st cousins are in red.  Then there are my 2nd cousins on my maternal grandfather’s side represented by the orange.  Going further back on my grandfather’s side, specifically to his mother’s side I have two 2nd cousins once removed who have tested, they’re in blue.  Then on my grandfather’s paternal side, other distant cousins, they are in lavender.

A closer look at this shows how on my father’s side (green) my 1st cousins will match each other, defined by a line.  Since we are all 1st cousins to each other, cousin 1 will match cousins 2, 3, 4, 5, 6 & 7, plus me of course as these are my DNA matches.  Cousin 2 will match 1 (as already mentioned) plus 3, 4, 5, 6 & 7.  The same for 3, 4, and so forth.  

For my mother’s side, I started off with the color red, my grandparents’ grandchildren. We all match each other.  Then going to my 2nd cousins (orange), they come from two different sisters of my grandfather Joseph.  So they all match each other, plus match me and my 1st cousins.  Then going back further on Joseph’s mother’s side (blue), they match each other plus my 2nd cousins plus my 1st cousins as we are descended from my grandfather’s mother Elena’s ancestors. Then finally my grandfather Joseph’s father’s side (lavender).  So while those cousins will match my 2nd cousins and my 1st cousins, they will not match my grandfather Joseph’s mother’s side.  That is the basic concept of how this will visually work.

With endogamy, or with Polynesian matches, that same cluster would basically have all the dots connecting each other.  So imagine my grandfather’s father’s side (lavender) matching my grandfather’s mother’s side (blue).   See the grey lines connecting the two sides.

 

Example of every dot connecting to each other – what you would expect to see with endogamous matches.

In reality, that is what we will see because of how we all match each other.

Gephi

I finally took the time to try to use a network analysis software called Gephi to demonstrate what this interconnected group of DNA matches could look like.  Previously I used a website’s tools.  That website is RootsFinder.com, and used their Triangulation tool that produced nearly identical results as Gephi.  But for now, just demonstrating what Gephi has to offer.

This diagram consists of 196 nodes (dots) and 9,494 edges (lines).  To get that, I had to import a csv (spreadsheet) file, the icw file which has 9,494 lines of names into Gephi.

As I said earlier, while these clustering tools do not work due to the fact that we connect to each other and usually at a very high amount of shared DNA, I was able to extract some information from it.  I probably could have extracted and gathered all of this data manually but taking it directly from a spreadsheet is not as easy as it is just data that are organized by columns, rows, and/or categories.  This is why these tools are available in order to provide a more visual way of interpreting your matches.

What I did gather from this and thought was interesting was that the longest segment size showed 12cM for 24% of my matches.  I noticed this years ago that the size of the longest segment, largest segment, or longest block (depending on the DNA testing company) for many of these predicted 2nd – 3rd cousins would be between 12cM – 14cM.  Rarely would it go over 20cM.  In my previous blog entries, I mentioned the importance of the longest segment size in determining a true 2nd – 3rd cousin.

Looking at that same data, we see that only a single DNA match has the longest segment size of 64cM.  That DNA match is actually my 2C2R (2nd cousin twice removed). 

This next image is the same data except now it’s showing the number of shared segments.  Prior to  Ancestry providing us the longest segment size, we only had to go by the amount of total shared DNA and the number of segments.  So the top (28% of my matches) shows 28 segments.  They seem to range between 25 – 29 for the most part.

An important thing to notice about this particular data, unlike other people who could actually produce nice clusters, is that when I ran this icw file that took about 4hrs to do, I had to limit the amount of shared cM (centimorgans).  This particular diagram in which the icw file I finished running last night range from 185cM – 199cM.  Yet I had 98 matches that fell into that range.

Prior to this particular icw file, I ran one back in May 2022 where I went as low as 90cM.  So it is 90cM – 190cM.  This was the result of that older icw file.

Looking at the data, 13 segments seems to be at the top making up about 14% of these matches.  That particular file had 1,215 matches, which the icw file produced 2,049 nodes and 1,046,502 edges.  That is a lot of dots and lines.

A few people had suggested using Gephi as I could tweak the data. I have been tweaking it for about a week, and as I knew I would not be able to get anything unique from it.  

The problem with this, something that any endogamous group would encounter is running the icw file.  Imagine having only 10 DNA matches.  But for an endogamous person where you could match nearly all the other people even if you are not really closely related at all, that could be easily multiplied.  So match #1 would match all of the other 9 matches on that list.  Match #2 would have about the same matching all 9 other matches on that list.  And the same for match #3, match #4, etc.  So that icw file gets larger and larger.  Now complicate that issue as the less amount of DNA you share, you probably match more people or have a longer list of icw people to add.  This is why I initially ran it again since last May but going down only as low as 185cM from 199cM rather than 90cM – 190cM.  As I go lower, the number of matches, the number of nodes and edges will greatly increase.

For non-endogamous populations, expect to see something that would be more clear.  Utilizing Gephi you could easily attach names and whatever data you would like to the nodes and distinguish each cluster from each other easily.

Auto-Cluster

As I mentioned MyHeritage as one of the DNA testing sites that offers auto-clustering with your DNA  matches.  If you have tested at MyHeritage, you could run an auto-cluster as often as you would like.  Unlike GeneticAffairs.com where you could adjust the parameters, MyHeritage seems to do it automatically.  So depending on the number of matches that you have, or in my case could have a lot of icw, they (automatically) decide what would be best to produce a decent amount of matches.

First, an example of what you would see with autoclusters:

What are autoclusters

Image from MyHeritage’s FAQ page.

What you would get are colored blocks assigned randomly.  The grey square are DNA matches who happen to match someone in one cluster as well as in another cluster.  This could indicate that you have a DNA match who might not have enough shared DNA to match everyone in a particular cluster, something that you would see in a more distant relative like a 2nd cousin of yours not matching a lot of your common 3rd cousins. 

That is basically how clusters work.  They are to help you figure out how your DNA matches match each other.  Then of course it is up to you to figure out based on their trees how all of you connect.

This autocluster of mine I generated back in June.

I actually now have two clusters.  MyHeritage puts a limit as how the maximum amount of shared DNA to be used in autoclusters.  400cM, since that is about the level what you would share with 2nd cousins, not with 1st cousins, maybe a few 1C1R (1st cousins once removed).  My second cluster which reflects my paternal (Filipino) side actually does consist of two 1C1R, a 1/2 1C and a 2C (2nd cousin).  One of those 1C1R in my second cluster is also Kanaka Maoli like myself, so that cousin did produce a few grey squares with some of my other DNA matches in that larger cluster.

What I also did was extract the data which I put on the right-hand side.  I sorted it by the least amount of shared DNA and identified the person if I knew their ancestry. You can also see the size of the largest segment and the number of segments.

A reminder that with MyHeritage’s autoclusters they implement a maximum threshold of 400cM.  The minimum threshold will vary depending on the person’s DNA matches, how much they share with you as well as how much they share with each other.

In my case, there were 494 matches taken from my list who share less than 400cM with me but more than 95cM (actually 108.1cM was the lowest amount shared).   They also decided that in order to be considered a shared DNA match, my matches need to match at least 95cM with each other.

Conclusion

While these tools are great for separating your DNA matches and possibly help you figure out how each one is connected to you and to each other, Polynesians will not benefit from these at all.  They actually could be misleading if they one does not understand what they are looking at, which is a lot of closely predicted 2nd, 3rd and 4th cousin matches.

Largest Segment – Is it the best way to gauge the closeness of relationship?

In my earlier blog posts I have mentioned how significant the largest segment size is when determining a true 2nd to 3rd cousin relationship.  Polynesians can have a total shared amount that can easily exceed 100cM.   These totals tend to over-estimate the predicted relationships.

From the ISOGG Wiki’s page, you can see that the average for 2nd cousins once removed (2C1R) is 106cM while 2nd cousins are averaging around 212.50cM.

So we tend to get a lot of these 2nd – 3rd cousin matches, depending on the company you tested with.  This is why the largest segment size has become important.  Blaine Bettinger has a post entitled The Shared cM Project – Longest Shared Segment where people had submitted their longest segment size based on their known relationships.  You can compare 2nd and 3rd cousins there and see what the average is for the longest segment size for specific relationships.

A quick look at the type of numbers just by looking at my own ONE TO MANY from GEDmatch.com.

My cousin Allen who is a 2C1R to me (his maternal grandmother & my mother are 1st cousins) has a large segment of 35.9cM.  You can see more comparisons of the largest segment for 2nd to 3rd cousins from Blaine’s Shared cM Project but I also have been keeping my own numbers from my known relatives.


Only one of those 2nd cousins shared a large segment of 21.8cM, pretty small, and then it gets even lower as you go more distant.  But normally 2nd cousins will share a rather large segment, which is why more than 20cM has always been advocated and also among the Ashkenazi Jewish community.  In fact, I thought they used 25cM, but I could be wrong.  I even mentioned 30cM would be good.

But is it a requirement?  Absolutely not.  However, if you cannot find a connection, or the same geographical origin i.e. New Zealand or Hawaii, then that would be a strong indicator that you are not as closely related as it was predicted.

I have been noticing how I do have a few Hawaiians whose largest segment is more than 30cM but have not been able to find a connection.  I also notice that these matches will not have the same geographical origins as I do.  So could it be that these large segments remain in our population for many generations?

Here’s an example of how it actually has remained for centuries by comparing my Hawaiian mother and a Maori.

Taking my mother’s ONE TO MANY matches, I sorted them by the largest segment size.  I indicated the known relatives in blue and the unknown in red.  My mother has a Hawaiian match as 44.6cM for the largest segment.  I still have not been able to find a connection, although one of that match’s branch goes back to the area of a few of my ancestors.  But even for us, that was more than 3 generations ago from my mother.  Another at 39.9cM, not sure if that person is a Hawaiian or Maori.  And there is a Maori match with the largest segment of 25.9cM.   At FTDNA, there is a Maori match whose largest segment is 23cM.

Here is the largest segment sized match with a couple of Hawaiians from MyHeritage.

37.2cM and 33.6cM.  They have pretty good trees but their ancestry goes back to totally different islands from my own ancestors.  And I saw in their trees the origins of the different islands is further back while the more recent ones were born in Honolulu where some of my more recent ancestors were born.  I did trace many of my ancestors’ descendants who remained in Honolulu but none are those connect to these matches.

Here is a Maori match from MyHeritage.

Notice that the largest segment is 34.2cM.  The highest I’ve seen with a Maori.  How can a large segment last that long after many centuries?

And while the focus here is utilizing the largest segment to get a more accurately find a true 2nd to 3rd cousin match, we know how in one generation a large segment can quickly be reduced.

Comparing with the largest segment that my mother shares with her 1/2 3C.  This is how they connect.  I outlined in yellow all testees in this particular comparison.

The largest segment that my mother & her 1/2 3C share is 49.6cM (FTDNA indicated 52cM) according to GEDmatch.  But that particular segment was not inherited entirely by my mother’s sister and seemed to have been broken up thanks to recombination and turned into a 10.7cM and a 25.1cM segment.

My mother’s deceased brother seemed to have received that same segment or maybe even slightly larger.  And while he is not alive to test, his son did, and he shares 50.2cM with this 1/2 3C of our parents, or our 1/2 3C1R.

This is what the comparisons look like.

My younger brother got nearly the entire segment as my mother got it but I got a very small portion of it, just 14.3cM.  That’s a huge difference from 49cM.  Had my mother nor my younger brother got tested, I would not have been able to find this good match and would have concentrated on matches with large segments more than 20cM or even 30cM.  My older brother got DNA tested however he does not share any of this same matching segment.  In fact, he shares 0cM on this particular chromosome.

This 1/2 3C was key in finding my mother’s biological parents.  At the time I did not know how we were related but I did concentrate on this match because of the large segment size.

So how do we really filter all of these matches?  By solely concentrating on the largest segment?  You should definitely not spend too much time on large segments that are less than 20cM and whose shared total is way over 200cM.  With those particular matches, if you compare trees and notice no common geographic area, that would be a big indicator that it is a distant match.

Remember that with a 2nd cousin you would share a pair of great-grandparents.  With a 3rd cousin you would share a pair of 2x great-grandparents.  By that generation or even a generation further back or two if you find that you do not share the same geographic location, then the match is a distant match.  The same applies for large segments greater than 30cM.  If no common geographic location, then it is probably a distant match.