Genome-wide off-target cleavage prediction of CRISPR/Cas nucleases

During this project, we predicted the collateral effects (off-targeting) of CRISPR/Cas nucleases. This project responded to the growing interest in genome editing using RNA-guided nucleases (RGNs), particularly Cas9. We can estimate the collateral cleavage sites based on a model from statistical physics.

Model explanation

In our model, we assume that R-loop formation occus in a lineair fashion from seed to end: first the first nucleotide binds (R1), then the second (R2), then the third (R3) etc. The possibility of going to the next state depends on the forward rate k_f. It is also possible to go back and unwinde the R-loop (from R3 to R2, for instance). That probability depends on the backward rate k_b.

Transition Rate Matrix

The above model can be represented by a set of first-order, homogeneous, lineair differential equations with constant coefficients, which is usually reffered to as the the Master equation. Please, read this document (chapter 2) for the precise definition of the transition rate matrix M and its eigenbase expansion.

The transition rates were fitted with the Metropolis-Hastings algorithm using high-throughput biochemical profiling data from other labs.

Supercomputer Code Structure

With high performance computing clusters (HPCs, ‘supercomputers’), we applied this model for whole-genome prediction. More precisely, we calculated the average cleavage time for all possible target sites of length 20. The genomic data was stored in HDF5 files.

Model evaluation

Since the biochemical validation data from other labs was binary (cleavage or not), we had to ‘downgrade’ our model to a binary classifier by picking a threshold for the predicted cleavage time.

The figure below visualises a comparison of the in vitro validation set (inner ring) and the in vitro predicted set. In the middle, a confusion table is calculated based on the threshold that optimised the F₁ score. The figure also nicely displays the data before the binary conversion. For a complete explanation, I refer to this document, chapter 4.

Sawtooth PR curves

Further evaluation was performed using precision recall curves. Interestingly, they all displayed a sawtooth pattern. We can explain this with a fictional dataset displayed below.

Pick three arbitrary consecutive cleavage rates θ₁ < θ₂ < θ₃ that are all chosen from the test set. In other words, they all correspond to an off-target site and there is no off-target with a k_clv between them. If θ increases in the range [θ₁;θ₂), RCL(θ) cannot change because the threshold will not pass any off-targets. Only false-negatives can be converted to true negatives, increasing PRC(θ). When θ = θ₂, the RCL(θ) and PRC(θ) will drop at once, because a true positive is changed into a false negative. If θ grows further in the range [θ₂;θ₃), only PRC(θ) is affected again. This will result in a sawtooth.

Want to learn more?

Please, feel free to contact me for questions. I am also happy to give presentations. The complete project report can be found here.

Terug naar Biotech