A gentle introduction to the fastLS algorithm¶
This notebook summarizes and illustrates a series of lectures given by Nick Patterson on his take on the fastLS algorithm for fast haplotype matching in large cohorts. The original paper by Lunter can be found here, and slides provided by Nick Patterson here.
Instead of the Burrows-Wheeler transform used in the original paper, the current approach relies on the radix-sort algorithm.
The objective¶
The objective of the fastLS algorithm is to find, for a given haplotype, the best matching haplotypes in a reference panel in a runtime that is independent of the size of the reference cohort, which allows it to be used on very large reference panels.
We will proceed in two stages:
- We will first describe the data structure used to represent the reference panel, which is key to reaching this runtime.
- We will describe the algorithm itself, which relies on the data structure described in the first stage.
Radix-Sort¶
We aim to find a clever way to represent our reference panel $H$, which contains $m$ haplotypes and $n$ SNPs. The reason why this representation is useful will be clear in the next section, when we describe the algorithm itself. For now, we just aim to understand how to build this data structure and which information are available in it.
Disclaimer, I used AI to help with the visualization of the data structure.
Let's start by visualizing our reference panel:
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Hap_ID | ||||
| 0 | 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| 2 | 1 | 0 | 0 | 1 |
| 3 | 0 | 1 | 1 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
We next perform the radix-sort algorithm on the columns of $H$, starting from the last column and moving to the first one (we will later describe some other quantities that we need to keep track of during the sorting process, but for now let's just focus on how the sorting works).
The idea is simple: we look at the last column of $H$, and we split the rows into two piles. The first pile contains the rows that have a 0, the second pile contains the rows that have a 1. We then concatenate the two piles together, and we repeat the process for the next column, until we reach the first column.
The process is illustrated below:
================================================================================ PART 1: PREPROCESSING (Radix Sort) We process columns from right (N-1) to left (0). ================================================================================
--- Processing SNP3 ---
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Hap_ID | ||||
| 0 | 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| 2 | 1 | 0 | 0 | 1 |
| 3 | 0 | 1 | 1 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Hap_ID | ||||
| 0 | 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| 2 | 1 | 0 | 0 | 1 |
| 3 | 0 | 1 | 1 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Hap_ID | ||||
| 0 | 1 | 1 | 0 | 0 |
| 3 | 0 | 1 | 1 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| 2 | 1 | 0 | 0 | 1 |
--- Processing SNP2 ---
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Hap_ID | ||||
| 0 | 1 | 1 | 0 | 0 |
| 3 | 0 | 1 | 1 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| 2 | 1 | 0 | 0 | 1 |
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Hap_ID | ||||
| 0 | 1 | 1 | 0 | 0 |
| 3 | 0 | 1 | 1 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| 2 | 1 | 0 | 0 | 1 |
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Hap_ID | ||||
| 0 | 1 | 1 | 0 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 1 |
| 3 | 0 | 1 | 1 | 0 |
| 1 | 1 | 1 | 1 | 1 |
--- Processing SNP1 ---
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Hap_ID | ||||
| 0 | 1 | 1 | 0 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 1 |
| 3 | 0 | 1 | 1 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Hap_ID | ||||
| 0 | 1 | 1 | 0 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 1 |
| 3 | 0 | 1 | 1 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Hap_ID | ||||
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 1 |
| 0 | 1 | 1 | 0 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 3 | 0 | 1 | 1 | 0 |
| 1 | 1 | 1 | 1 | 1 |
--- Processing SNP0 ---
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Hap_ID | ||||
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 1 |
| 0 | 1 | 1 | 0 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 3 | 0 | 1 | 1 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Hap_ID | ||||
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 1 |
| 0 | 1 | 1 | 0 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 3 | 0 | 1 | 1 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Hap_ID | ||||
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 3 | 0 | 1 | 1 | 0 |
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 1 |
| 0 | 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 |
With this method (radix-sort), we managed to sort all rows of $H$ without comparing pairs of rows, so we built the new array in $\mathcal{O}(mn)$ time, which is the best we can do since it corresponds to the complexity of reading the reference panel.
🤨 Why did we do this?¶
That's a valid question I was asking myself when following the sorting process. Let's illustrate that with a first example: finding the longest match for a given haplotype $w$ in the reference panel $H$.
Example: Haplotype matching¶
Before diving into the haplotype matching algorithm, let's introduce a few variables that keen readers might have noticed in the previous code. These variables does not add to the runtime complexity, and will allow us to perform the haplotype matching in $\mathcal{O}(n)$ time, which does not depend on the size of the reference panel!
Our first variable is $lf$. At each step, $lf$ is a vector of length $m$ that tells us where the haplotype will land after remerging the two piles of 0 and 1s. In the example below, we see that during the first step, $lf[0]=0$ (the first haplotype stays in the first position), but $lf[7]=5$. Indeed, the 7th haplotype starts with a $0$, so it will land higher in the deck after sorting. While running the radix-sort algorithm, we keep track of $lf$ as a matrix, where each column corresponds to a step of the sorting process.
Our second and third variables are $U$ and $V$. At each stage, we define U and V as:
- U[k, x] gives the next position (at or after k) where allele $x$ appears.
- V[k, x] gives the previous position (at or before k) where allele $x$ appears.
I was confused when reading these definitions, so it is worth looking at an example:
| bit at SNP2 | U(k,0) | U(k,1) | V(k,0) | V(k,1) | |
|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | -1 |
| 1 | 1 | 2 | 1 | 0 | 1 |
| 2 | 0 | 2 | 6 | 2 | 1 |
| 3 | 0 | 3 | 6 | 3 | 1 |
| 4 | 0 | 4 | 6 | 4 | 1 |
| 5 | 0 | 5 | 6 | 5 | 1 |
| 6 | 1 | 7 | 6 | 5 | 6 |
| 7 | 0 | 7 | 8 | 7 | 6 |
With these defintions at hand, we can derive the haplotype matching algorithm, which is illustrated below.
We start by finding the first and last haplotypes matching the target haplotype at the last SNP. In our example, where the last SNP of $w$ is 0, this corresponds to first index = $U[0, 0]$ and last index = $V[m, 0]$. We then trace these haplotype through the second stage of the radix-sort. The first index becomes lf(U[0, 0]) and the last index becomes lf(V[m, 0]). Due to the nature of the radix-sort algorith, all values between these two indices will match the target haplotype at the previous SNP (they are in the same pile).
We then repeat this process, looking for the U and V values of the current first and last indices, and tracing them through the lf matrix. The resulting interval will contain all haplotypes matching the target haplotype at the two last SNPs.
We repeat this process until we find an interval of size 0, which means that there is no more match between the target haplotype and the reference panel, or until we reach the first SNP, in which case we found a perfect match for the target haplotype in the reference panel.
Let's see how this unfolds in an example, where we are looking for the reference haplotypes matching the target haplotype $w = 1000$.
================================================================================ PART 2: MATCHING ALGORITHM We search for the longest match for a specific query. ================================================================================ Query Sequence (Modified Hap -1): [1 0 0 0]
--- Checking SNP 3 (Target Bit: 0) ---
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Pos(Deck_4) | ||||
| 0 | 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 |
| 2 | 1 | 0 | 0 | 1 |
| 3 | 0 | 1 | 1 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 6 | 1 | 0 | 0 | 0 |
| 7 | 1 | 0 | 0 | 0 |
Mathematically: U[0, 0] -> 0, V[7, 0] -> 7 Mapping to Deck(3): lf[0]->0, lf[7]->5
--- Checking SNP 2 (Target Bit: 0) ---
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Pos(Deck_3) | ||||
| 0 | 1 | 1 | 0 | 0 |
| 1 | 0 | 1 | 1 | 0 |
| 2 | 0 | 1 | 0 | 0 |
| 3 | 0 | 1 | 0 | 0 |
| 4 | 1 | 0 | 0 | 0 |
| 5 | 1 | 0 | 0 | 0 |
| 6 | 1 | 1 | 1 | 1 |
| 7 | 1 | 0 | 0 | 1 |
Mathematically: U[0, 0] -> 0, V[5, 0] -> 5 Mapping to Deck(2): lf[0]->0, lf[5]->4
--- Checking SNP 1 (Target Bit: 0) ---
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Pos(Deck_2) | ||||
| 0 | 1 | 1 | 0 | 0 |
| 1 | 0 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 | 0 |
| 3 | 1 | 0 | 0 | 0 |
| 4 | 1 | 0 | 0 | 0 |
| 5 | 1 | 0 | 0 | 1 |
| 6 | 0 | 1 | 1 | 0 |
| 7 | 1 | 1 | 1 | 1 |
Mathematically: U[0, 0] -> 3, V[4, 0] -> 4 Mapping to Deck(1): lf[3]->0, lf[4]->1
--- Checking SNP 0 (Target Bit: 1) ---
| SNP0 | SNP1 | SNP2 | SNP3 | |
|---|---|---|---|---|
| Pos(Deck_1) | ||||
| 0 | 1 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 1 |
| 3 | 1 | 1 | 0 | 0 |
| 4 | 0 | 1 | 0 | 0 |
| 5 | 0 | 1 | 0 | 0 |
| 6 | 0 | 1 | 1 | 0 |
| 7 | 1 | 1 | 1 | 1 |
Mathematically: U[0, 1] -> 0, V[1, 1] -> 1 Mapping to Deck(0): lf[0]->3, lf[1]->4
This algorithm illustrates that, once we have built our data structure in $\mathcal{O}(N M)$ time, we can find the longest match for a specific query in $\mathcal{O}(N)$ time.
FastLS algorithm¶
Next, we want to find the "best" path through the reference panel, which is the path that minimizes the number of mismatches between the target haplotype and the reference panel, as well as minimizing the number of switches between haplotypes in the reference panel.
We can weight these two objectives differently, with the total loss being defined as:
$$ S = J \rho + G \mu $$
Where $J$ is the number of jumps between haplotypes, $G$ is the number of mismatches, and $\rho$ and $\mu$ are the weights associated with these two objectives.
The idea is to build a tree of paths from right to left, where each node summarizes the cost associated with it to reach the rightmost SNP of the reference panel.
Initialization of FastLS¶
We will walk from right to left, and at each step we will keep track of the score required to reach the rightmost SNP of the reference panel, as well as the path taken to reach it. Of course, we cannot keep track of all possible paths, as this number grows exponentially with the number of SNPs. Instead, we will do two clever tricks:
- Merge paths, or part of paths, that are similar into a single state, thanks to our data structure.
- Prune regions that cannot be part of the optimal path.
To make this work, we will define some nodes. A node contains the following information:
- An interval of haplotypes in the reference panel for this SNP
- The score associated with this node, which includes the optimal cost for reaching the rightmost SNP of the reference panel following the optimal path throught one of the haplotype in the interval.
- A back pointer to a previous node
Fo reasons that will become clear later, we start with a dummy root node, which covers more haplotypes than there are in the reference panel and has the highest score possible.
================================================================================ STEP 1: INITIAL TREE SETUP ================================================================================
Initial Tree State
Interval: [-1, 8]
Score: ∞
Building a tree¶
Now, we look at the right-most SNP. Due to the radix-sort, this SNP is covered by two intervals of haplotypes: one interval for the haplotypes that have a 0 at this SNP, and one interval for the haplotypes that have a 1 at this SNP. We create two nodes for these two intervals, and we compute the score associated with these two nodes. Let's assume that the target haplotype has a 0 at this SNP, then the score of the node corresponding to the interval of haplotypes with a 0 will be 0 (no mismatch, no jump), while the score of the node corresponding to the interval of haplotypes with a 1 will be $\mu$ (one mismatch, no jump). We interpret these nodes in the following way: "if I am copying from one of the haplotypes in this interval, I need to pay a score of X to reach the rightmost SNP of the reference panel". For this interpretation to hold, we also need to add the possibility of jumping. "Whoever I am copying from, I can always jump (paying a cost of $\rho$), and then copy from one of the haplotypes yielding the lowest score." Therefor, we add a third node, which covers the whole reference panel and has a score of $\rho + 0$ (jumping plus the minimum score, which is 0). This third node's back pointer will point to the node with score 0, the two other nodes' back pointer will point to the root node.
If we look at the three nodes we obtained, we see that they form a nice nested structure:
Tree State After Processing SNP (with Jump)
Interval: [-1, M]
Score: ∞
Interval: [0, M-1]
Score: ρ
Interval: [0, m]
Score: 0
Interval: [m+1, M-1]
Score: μ
🤨 The tree does not correspond to the backpointer you mentioned earlier!
💡 True: the tree is built by looking at the intervals each node contains. The backpointers are simply a hidden variable we will use in the end to know who we are copying from.
Now, we realize immediately something interesting. If we were copying from the interval [m+1, M-1], we will either always want to jump to the interval [0, m] (if $\rho < \mu$), or never want to jump (if $\rho > \mu$). This means that one of the two nodes will never be part of the optimal path. Let's assume for now that $\rho < \mu$, then we can prune the node corresponding to the interval [m+1, M-1], and we are left with only two nodes:
Tree State After Pruning (Assuming ρ < μ)
Interval: [-1, M]
Score: ∞
Interval: [0, M-1]
Score: ρ
Interval: [0, m]
Score: 0
Interval: [m+1, M-1]
Score: μ
Now, we can move one SNP to the left, and we repeat the process. Using our $U$ and $V$ variables, we can split the intervals of our two surviving nodes into two sub-intervals each, depending on whether the haplotypes have a 0 or a 1 at this new SNP. Again, due to the radix-sort, these sub-intervals will be either disjoint or nested! The node containing each sub-interval will have as backpointer the node who contained the original interval.
We will have 4 new nodes (fictional example, where the target haplotype has a 0 at this new SNP):
- 2.0 [0, a] (score 0 + 0, backpointer to node 1.0)
- 2.1 [b, c] (score 0 + $\mu$, backpointer to node 1.0)
- 2.2 [a+1, b-1] (score 0 + $\rho$, backpointer to node 1.1)
- 2.3 [c+1, m] (score $\mu + \rho$, backpointer to node 1.1)
🤨 ** What are a,b,c etc? **
💡 So, the haplotypes at the next position will be sorted with 0s on top, from 0 to b, and 1s at the bottom, from b+1 to m. Part of the interval of node 1.0 will have haplotypes with a 0 at this new location ([0, a]) and another part to a 1 ([b, c]). Then, part of the interval of node 1.1 will have haplotypes with a 0 at this new location ([a+1, b-1]) and another part with a 1 ([c+1, m]).
We can, as before, also allow to jump to the best state (2.0), which would create a new node 2.4 [0, m] (score $\rho + 0$, backpointer to node 2.0).
Building a tree out of these nodes, we get the following structure:
Tree State For New SNP
Interval: [-1, M]
Score: ∞
Interval: [0, m]
Score: ρ
Interval: [0, a]
Score: 0
Interval: [a+1, b-1]
Score: ρ
Interval: [b, c]
Score: μ
Interval: [c+1, m]
Score: μ + ρ
Now, we can look at these nodes and prune the ones that cannot be part of the optimal path. In our example, with $\rho < \mu$, we can prune nodes 2.1, 2.2 and 2.3. Indeed, for a similar cost or a lower one, we can be in any state and jump to node 2.0!
Before deriving the general algorithm, let's see what would happen if instead, we had $\rho > \mu$. We are restarting the entire process from stage 1. Last time, we said that we would always prefer jumping rather than paying the price of a mismatch at this last SNP, this time, it is the opposite:
Tree State After Pruning (Assuming μ < ρ)
Interval: [-1, M]
Score: ∞
Interval: [0, M-1]
Score: ρ
Interval: [0, m]
Score: 0
Interval: [m+1, M-1]
Score: μ
Tree State For New SNP After Pruning (Assuming μ < ρ)
Interval: [-1, M]
Score: ∞
Interval: [0, m]
Score: ρ
Interval: [0, a]
Score: 0
Interval: [a+1, b-1]
Score: ρ
Interval: [b, c]
Score: μ
Interval: [c+1, m]
Score: μ + ρ
These examples illustrate that as $\rho$ increases, we will need to keep track of more and more paths.
Anyway, now that we have worked through some examples, let's define the two pruning rules:
The two rules of pruning¶
Rule 1: The "Umbrella" Rule
If the parent interval of a node has lower score and includes the node's interval, then we can prune the node. Indeed, we can always use the parent node to have more options and at least the most score as the child node.
Rule 2: The "Redundant Parent" Rule
Now, after applying rule 1, we might end up with a parent node whose children include together the full interval of the parent node. In this case, we also know that the parent node will always have a higher score than each of the children (due to rule 1). The parent node offer no better option than any of the children node, and we can prune it. In this case, we will attach the children nodes directly to the grand-parent node.
Tracing back the best path¶
We keep iterating this process until we reach the first SNP. That is, for each surviving node, we use $U$ and $V$ to split them into two sub-intervals, we compute the score associated with these sub-intervals (adding 0 for the sub-interval matching the next SNP, $\mu$ for the other one), and we add a back-pointer to the previous node. We then add the jumping node, which has a score of $\rho$ plus the minimum score among the nodes we just created. Its backpointer is to this best node. We then apply the two pruning rules, and we move to the next SNP on the left.
When we reach the first SNP, we look at the surviving nodes, and we find the one with the lowest score. This node will be part of the optimal path. We can then use the backpointers to trace back the sequence of nodes that led us to this node, which will give us the sequence of intervals in the reference panel that we are copying from at each SNP.
Complexity of the algorithm¶
We saw in our earlier example that the pruning was more efficient when $\rho$ was low compared to $\mu$. In general, the efficiency of pruning depends on the ratio $\rho/\mu$. Intuitively, if jumping becomes expensive, we need to track more paths.