Author Archives: Amy Williams

Amy Williams is an associate professor of Computational Biology at Cornell University. Her research focuses on using DNA to help individuals uncover their genealogical relationships. The tools on this website include work she has co-authored with several students.

Cumulative distribution of a human genetic map

Minimal viable genetic maps

In our first blog post on what is a centiMorgan?, we talked about genetic maps. Many of the planned tools at HAPI-DNA (and all of the current ones) use genetic maps to calculate lengths of segments or to simulate segments. One of the most commonly used genetic maps (the HapMap map) contains nearly 3.4 million entries. When doing web-based analyses like those we feature here, it’s good to reduce the number of map entries. This post talks about how to drop sites from a genetic map without dramatically reducing its usefulness. Using the tool we produced for this reduces the HapMap map to just over 32,000 entries (a >100-fold reduction!). We might call this a minimal genetic map. (The title of this post is inspired by minimal viable genomes.)

Readers primarily interested in genetic genealogy may find this post a bit less useful than others. More posts about segment sharing among relatives are in the works.

Genetic maps only give a limited number of entries—not one for all >3 billion base-pairs in the human genome. Therefore, finding a genetic position for a physical position that’s not directly included in a map typically involves interpolation. A map entry lists a physical position and its corresponding genetic position, usually in centiMorgans (cM). Most IBD segments won’t have physical start and end positions listed in the map, so the standard approach is to linearly interpolate using the map positions before and after to find the genetic positions. For example, if the genetic map lists physical position 1,000,000 as being at genetic position 1.0 cM, and if the next physical position in the map is 1,200,000 at 1.2 cM, we could linearly interpolate to get the location of physical position 1,100,000. That physical position is halfway between the two flanking physical positions in the map, so its genetic position would be 1.1 cM—halfway between the two flanking genetic positions.

Cumulative distribution of a human genetic map

HapMap genetic map for human chromosome 10.

Genetic maps do not always change in a linear way between positions. This means that, if we drop entries in our genetic map arbitrarily, linear interpolation could end up giving cM positions that are far off from their true values. The image here plots the HapMap map for chromosome 10, with physical positions on the x-axis and the corresponding genetic (cM) position on the y-axis. The relationship is not linear (a zoomed in view in a smaller region would make this even more obvious), so we can’t drop positions without some care.

Instead of arbitrarily dropping map entries, the (non-web-based) tool for reducing genetic maps only drops positions where, if that location were to be linearly interpolated from the flanking locations, the difference (error) to the original map would be less than 0.05 cM. This is a tiny difference and should not meaningfully impact nearly any analysis we might want to do with IBD segments. The details of how this tool works are beyond our scope, but in general it scans a set of positions until it finds one that has minimum linear interpolation error (below 0.05 cM) and drops that entry. Following this, the tool restarts its scan to find the next entry with minimal error and drops this if it can, repeating the process until any remaining entries would produce more than 0.05 cM of error if they were linearly interpolated from their two flanking positions.

A map with only 32,143 entries, as this minimal map has, is great for the WebAssembly tools hosted on HAPI-DNA. WebAssembly tools run on the viewer’s computer—on your computer. The map is stored in only roughly 251 KB (with a four byte integer for the physical position and four byte floating point number for the cM position for each entry). The original sex-specific genetic map used for web-based simulating contains 833,777 entries, but after removing positions that can be interpolated with ≤ 0.05 cM error, it contains only 43,128 entries. With two cM positions per site—one male and one female—this map fits in 518 KB of memory.

If there is interest in comments or on Twitter, we’ll post the WebAssembly code for the segment length tool.

Thanks to Jonny Perl for collaborating on the idea of having a web-based cM calculator.

Amy Williams is an associate professor of Computational Biology at Cornell University. Her research focuses on using DNA to help individuals uncover their genealogical relationships. The tools on this website include work she has co-authored with several students.
Posted in Genetic maps on . 1 Comment
Rates of half relatives sharing ≥ 7 cM segments

How often do two half relatives share DNA?

Following up on the last post about full relatives, the plot below shows the rates that half relatives share ≥ 7 cM IBD segments. This uses the same abbreviations as in the previous post—1C means first cousins, 3C1R stands for third cousins once removed—but all relationships are prefixed with an `h’ for half. The first relationship, hAV, is half-aunt/uncle-niece/nephew. Papers often refer to aunt/uncle and niece/nephew relatives as “avuncular,” so the plot uses AV as an abbreviation.

Your browser does not support the HTML5 canvas tag.

The rates are very similar to the “roughly” equivalent full relatives from before (see the table of equivalent relationships), but are a bit lower here. For example, half-third cousins (h3C) share at least one ≥ 7 cM segment in 71.1% of pairs versus the rate in third cousins once removed (3C1R—a roughly equivalent full relationship) of 72.7%.

Details of the simulation are the same as in the last post. This includes using the same number of pairs (100,000) for each relationship type.

Some have asked about rates for different minimum segment lengths. This is perhaps best to represent in a tool, and we’ll work on getting one up in the coming weeks.

Amy Williams is an associate professor of Computational Biology at Cornell University. Her research focuses on using DNA to help individuals uncover their genealogical relationships. The tools on this website include work she has co-authored with several students.
Posted in Identity-by-descent on . Leave a comment
Rates of relatives sharing ≥ 7 cM segments

How often do two relatives share DNA?

Close relatives like two full siblings, an aunt and nephew, or a grandparent and grandchild always share IBD segments, so they show up in testing companies’ relative matches. However, more distant relatives may not share any IBD segments. In fact, the chance that two people share DNA decreases with the distance of their relationship. This is important to remember when doing genetic genealogy: if you don’t share segments with someone that doesn’t necessarily mean you’re not related to them. As the numbers below show, even some rare second cousins (0.02% based on this analysis) may not have any detected IBD segments.

To find out the rates that relatives share segments, one option is to simulate. We did this previously (Figure 3, the SS+intf bars), but that work counted all segments, regardless of their length. Unfortunately, reliably detecting segments shorter than 6-7 cM is hard and most companies only look for 6 or 7 cM or longer segments.

Considering only 7 cM or longer segments changes the rates that relatives share DNA, as shown in the plot below. The numbers above each bar give the percent of each relative type that share at least one ≥ 7 cM segment. (Here 1C represents first cousins, 2C second cousins, etc., and NC1R represents Nth cousins once removed.) From this, we see that first cousins share five or more ≥ 7 cM segments 100% of the time, while only 0.286% of eighth cousins share such a segment (and nearly all share only one). (See below for details on how we simulated.)

You can hover over the bars to see the percentage breakdowns across segment counts.

Your browser does not support the HTML5 canvas tag.

These numbers are from simulated relatives: 100,000 pairs for each type. If the segment is present, the simulator always reports it. A caveat therefore is that, while companies report many of the ≥ 7 cM segments, they sometimes miss some. (They also sometimes report a segment that is not real, unfortunately, though in most cases a ≥ 7 cM segment will be real.) Therefore, these numbers should be used as a guide. We could—and a future blog post may—update the numbers based on probabilities of detecting segments, but a challenge is that detection rates depend on many factors, including how many SNPs were tested in the two relatives and the method the companies use to detect the segments.

Other relative types

The simulations considered a range of full cousins and full cousins once removed. It turns out, a full Nth cousin has the same shared segment properties as a full (N-1)th cousin twice removed, so the sharing rates here apply to many more types of relatives. Specific examples of equivalent relatives are shown below along with general cases. (This table doesn’t list all relative types.)

Relationship Equivalent
Roughly equivalent
1C great-aunt/uncle half-aunt/uncle
1C1R 2nd great-aunt/uncle half-1C
2C 1C2R half-1C1R
2C1R 1C3R half-2C, half-1C2R
3C 2C2R, 1C4R half-2C1R, half-1C3R
3C1R 2C3R, 1C5R half-3C, half-2C2R, …
4C 3C2R, 2C4R, 1C6R half-3C1R, half-2C3R, …
4C1R 3C3R, 2C5R, 1C7R half-4C, half-3C2R, …
NC (N-1)C2R, (N-2)C4R, (N-3)C6R, … half-(N-1)C1R, half-(N-2)C3R, …
NC1R (N-1)C3R, (N-2)C5R, (N-3)C7R, … half-NC, half-NC2R, …

Half relatives such as half-first cousins (who share one common grandparent instead of two as in full first cousins) have very slightly lower rates of sharing segments than full relatives of the roughly equivalent type. If there’s enough interest (on Twitter or in the comments), we can put up another post on half-relatives. Update: See the next post for rates in half relatives.

Simulation details

The numbers in the plot above are based output from the Ped-sim program where we used a sex-specific genetic map and modeled crossover interference. We found that Ped-sim very accurately captures the total segment length that real relatives share, so the numbers in the plot should be very reliable in a scenario where a company detects all ≥ 7 cM segments with no false segments. You can run Ped-sim with sex-specific maps and interference here.

Thanks to Jonny Perl for asking about sharing rates of 4C2R, which helped motivate this post.

Amy Williams is an associate professor of Computational Biology at Cornell University. Her research focuses on using DNA to help individuals uncover their genealogical relationships. The tools on this website include work she has co-authored with several students.
Posted in Identity-by-descent on . 14 Comments
Transmission of colored DNA across three generations

What is a centiMorgan?

Genetic testing companies and geneticists in general use centiMorgans (cM) to measure lengths of DNA that relatives share. You may have heard that DNA contains sequences of nucleotides—adenine, cytosine, guanine, and thymine, which are abbreviated as A, C, G, and T. One natural way to measure lengths of DNA is in terms of the number of nucleotides a segment of DNA contains. This is used in many contexts and is known as a sequence’s physical length. Physical lengths are measured in units of base pairs (bp) and give the number of nucleotides a sequence contains. So, for example, “GATTACA” is 7 bp.

46 human chromosomes

The human genome: Chromosomes 1 through 22 and X and Y.

To understand centiMorgans, it’s useful to have a bit of background. We all have 23 pairs of chromosomes, having inherited one set of 23 from our father and another set of 23 from our mother. These chromosomes are physically small, with all 46 contained in our bodies’ cells, but they contain all of our DNA. The length of human chromosome 1 is roughly 249 million bp, whereas chromosome 22 is about 50.8 million bp.

When it comes to heredity, perhaps the most important cell types are the germ cells: sperm and eggs. While most human cells carry 23 pairs of chromosomes, germ cells contain only one copy of each chromosome. This is so that, once these cells fuse, the resulting fertilized egg will have 23 pairs of chromosomes.

Transmission of colored DNA across three generationsThe chromosomes in germ cells are not simply an exact copy of one of the 23 chromosomes a person has, but are formed by recombination. A visualization helps capture this. The image with squares and circles shows how DNA from a couple might be transmitted to two children and three grandchildren. Here, circles represent females, squares represent males, and the vertical bars below these shapes give a colored representation of that person’s pair of chromosomes.1For simplicity, we will talk about recombination on only one chromosome. The same principles apply to all of them—chromosome 10, 2, etc.—except the X and Y chromosomes in fathers. At the top, the man has a dark and a light blue chromosome, and the woman has a red and a pink chromosome. Just below them are their two children, both of whom inherited one chromosome from each parent. Because of recombination, the children’s chromosomes are multi-colored, containing copies of DNA from the their mother’s two chromosomes and from their father’s chromosomes. In this case, both children received a copy of their dad’s dark blue chromosome at the top and both also received some amount of the light blue chromosome. Similarly, the mom transmitted a chromosome to each child containing some portions from her red chromosome and some from the pink chromosome. The bars get even more colorful in the next generation—for the shapes at the bottom—because these grandchildren inherited a chromosome that is recombined from their parents’ chromosomes. This means their chromosomes can contain pieces of all four of their grandparents’ chromosomes, and indeed, copies of DNA from all four chromosomes were transmitted to at least one grandchild.

Considering all the chromosomes, a germ cell contains an average of 36.4 recombinations.2Technically, we should use the word crossover here. Strictly speaking, recombinations include both crossovers and another very small (10-100s of bp) form of recombination. We will follow this more typical use and say “recombination.” Said differently, there are an average of 36.4 recombinations per generation. In fact, this number is the Morgan length of all the chromosomes. That is, a Morgan is the average number of recombinations that occur in some piece of DNA in one generation. Of course, 36.4 Morgans is equal to 3640 cM: as its name implies, a centiMorgan is 1/100th of a Morgan.3Morgans are named for Thomas Hunt Morgan who led pioneering work in the study of recombination.

Researchers have analyzed DNA from many parents and children to measure how likely a region of DNA is to recombine in one generation. They have counted not just the average number of recombinations across the full genome—i.e., 3640 cM for all the chromosomes collectively—but in specific regions, like the average on chromosome 1, or some small section of chromosome 17. A 100 cM long section of DNA (which is the same as 1 Morgan long) will have, on average, 1 recombination per generation—so a parent will usually transmit one recombination in such a section. A piece of DNA with a length of 10 cM = 0.1 Morgans has a recombination in 1 out of 10 transmissions (10%). The parent-child DNA transmission data allow researchers to produce genetic maps that anyone can use to calculate the cM length of any physical span of DNA. Genetic testing companies use these maps to calculate the length of shared segments for relatives. Perhaps the most widely used genetic map measures chromosome 1 as 286 cM and chromosome 22 is 74.1 cM. It also shows the distance from chromosome 10 physical position 34,726,104 to 83,988,506 is 49.1 cM.

In an upcoming post, we’ll talk more about cM lengths of DNA and how recombination leads more distant relatives to share fewer segments that are also on average smaller than those that close relatives share.

Amy Williams is an associate professor of Computational Biology at Cornell University. Her research focuses on using DNA to help individuals uncover their genealogical relationships. The tools on this website include work she has co-authored with several students.
Posted in Genetic maps on . 1 Comment