24
Aug
2021

Computation is the Backstage Enabler in Gene Editing

Anika Gupta, correspondent, Timmerman Report

Gene editing technologies have stirred the imaginations of scientists for close to a decade.

Many companies are aspiring to disrupt chronic care models with single-dose, curative treatments for monogenic diseases. Others see gene editing becoming an increasingly important tool for rapidly recognizing novel pathogens for pandemic response.

Emboldened by the latest clinical data from Cambridge, Mass.-based Intellia Therapeutics — which delivered a first-of-its-kind successful gene editing trial in six humans with transthyretin amyloidosis — there is increasing confidence in the scientific community that gene editing is on its way to becoming a potent and enduring treatment option for many more patients.

The rhetoric can get lofty at times. But to understand the enthusiasm requires going back to first principles.

Designing and developing a gene editing system for either diagnostic or therapeutic use involves a series of sequential, iterative steps. Computation plays an integral role at various stages along the way. Increasingly, creating an effective product relies on first amassing large amounts of data, picking up on patterns, and making decisions in the context of what needs to be optimized for.

Genomic discoveries lay the groundwork for target identification

In the post-genomic era starting in the early 2000’s, when next-generation sequencing instruments made it possible to collect vast amounts of genomic data, computational biology began to provide better analyses of the emerging data sets. Software and computation have aided in the discovery of thousands of genetic variants associated with both rare and common diseases. These findings set the stage for identifying high risk genes contributing to disease that could possibly serve as potent therapeutic targets.

Sekar Kathiresan, co-founder and CEO, Verve Therapeutics

For over 15 years, folks such as Sekar Kathiresan at Mass General Hospital and the Broad Institute (now at Verve Therapeutics), tried to understand the inherited risk or resistance individuals have to coronary artery disease.

Typical workflows included isolating and statistically quantifying patient DNA variation, particularly between cases and controls.

Computational approaches have now moved beyond standard correlations of genotype with phenotype to include methods that distinguish cause from correlation (e.g., Mendelian randomization), separate polygenicity from confounding (e.g., LD Score Regression), define the contribution of specific cell types or functional regions of the genome which contribute to heritability (e.g., stratified LD Score Regression), and incorporate millions of DNA sequence variants in polygenic scoring (e.g., genome-wide polygenic scores)

Kathiresan and others asked the question: why do some people seem to be naturally resistant to heart attack? They and others critically discovered that mutations in any of eight genes—all involved in the control of blood lipids—can confer resistance to heart attack. The resistance mutation would turn off the gene in the liver, leading to lifelong low levels of any of three blood lipids (low-density lipoprotein (LDL) cholesterol, triglycerides, or lipoprotein(a)), thereby conferring protection from heart attack.

These observations led to a therapeutic hypothesis that a medicine that mimicked these natural resistance mutations could be an effective treatment for heart attack.

Now at Cambridge, Mass.-based Verve Therapeutics, Kathiresan and team are testing that hypothesis by developing an in vivo liver gene editing medicine which would mimic the protective effect of a resistance mutation.

Verve’s first program is designed to target the PCSK9 gene in the liver. With a one-time treatment, Verve seeks to permanently turn off this cholesterol-raising gene, durably lowering blood LDL cholesterol with an ultimate goal of reducing the risk of heart attack, stroke, and death from cardiovascular disease.

Towards finding editing machinery with exquisite specificity

In minimizing the “cumulative exposure” to LDL, which begins at birth, the Verve team is aided by extensive prior pharmacology around statins that has validated the benefits of a sustained lowering of LDL in avoiding heart attacks.

However, whereas lowering LDL levels by 39 mg/dl for 5 years with a statin medicine reduces heart attack risk by 22%, that same difference over a lifetime (through a DNA resistance mutation) can lower risk for heart attack by 88%, enabling near complete protection by going after the root cause in the DNA itself. 

These data highlight that lowering cumulative exposure to LDL cholesterol is a key to averting heart attack.

In their proof of concept in cynomolgus monkeys earlier this year in Nature, the team observed a near-complete knockdown of PCSK9 in the liver after a single infusion of lipid nanoparticles with their base editing machinery (Musunuru et al, Nature, 2021). Key to their efforts was both the on-target editing efficiency and the minimal off-target mutagenesis.

Every step in finding and testing the guide-editor pair was aided by computation. Selecting a guide sequence involved evaluating guide sequences in the gene that would be orthogonal to the rest of the genome (i.e. with minimal sequence overlap) and ensuring the sequence was identical between monkeys and humans, to allow for higher confidence that findings in this study would translate to humans.

In order to find a guide-editor pair with the “exquisite specificity” of interest, the team engaged in systematic evaluation of all possible guides that could turn off the gene; for one gene, there are a finite number of changes that can be made, limiting their search.

The readout included the fraction of ultimate reads with the A-to-G change at the target site as well as the number of sequences with any level of similarity (i.e. few mismatches) to the candidate base editor’s protospacer 20 base pair guide sequence. In evaluating different pairs, they prioritized lists of edits by each editor based on the above criteria and eventually found VERVE101: a pair that in primary hepatocytes had no editing at >100 different potential off-target sites.

Fussy editors and combinatorial challenges

Base editing relies on fusing a DNA deaminase enzyme to a Cas9 to create a single nucleotide change in a target region of the genome. It takes advantage of Cas9’s programmability to target a specific location but hijacks the process to allow for focused edits on one DNA strand by inducing an even greater change at the complementary strand that our cells then correct.

Mandana Arbab, postdoctoral fellow, David Liu Lab, Harvard University

Intuitively, the technology aims to “trick the DNA repair system into thinking the single base-edited strand is the correct one,” says Mandana Arbab, a postdoctoral fellow in David Liu’s lab at Harvard University, a pioneering group in base editing.

Key parameters researchers track to evaluate base editing machinery are the purity (the fidelity of converting one base to another desired base), efficiency (what the relative frequencies of modified genotypes are after cell targeting), and bystander editing (editing that occurs in bases near the guide RNA’s target site; different from off-target editing).

In designing and selecting which combination of deaminase and guide RNA to use, it is critical to decide what the objective is: do only a fraction of cells need to be corrected? Is sensitivity more important than specificity?

With >10 deaminases and >15 Cas proteins to choose from, trying every combination empirically is extremely resource-intensive; thus, being able to sift through the noise through computation can be invaluable. Ideally, it can cut down the time and expense of an otherwise tedious trial-and-error process and increase the probability of success in preclinical and clinical development.

Base editing clearly works, but not always as desired or expected, Arbab states. Different deaminases will be more active or processive (making many edits in one run) and will prefer certain motifs and/or contexts (some are very sequence-dependent versus others that are more agnostic), and target sequence also affects Cas protein kinetics. 

There is a logic to base editor behavior, but there are so many of them, and their higher order interactions are so complicated, that it’s usually not easy to tease apart this logic.

Design and selection of editing repertoire becomes systematic

Addressing this challenge led Arbab and collaborators to develop a machine learning-based “BE-HIVE” model (Arbab*, Shen* et al., Cell, 2020) that predicts, for one base editor at a time, which guide RNAs and target sequences it is most well-suited to edit.

Inputs to the regression trees-based model are data from a massive, diverse library of cells with the paired guide RNA and its target sequence, treated with one editor at a time. In order to predict the best deaminase + guide pairs for a given task, features such as guide RNA GC content (sequence-based), guide RNA melting temperature, cell type, nucleotide characteristics (i.e. where they fall within the enzyme’s editing window, total counts within the window), and sequence composition with motifs (which reflects where enzymes have affinity to deaminate) are inputted.

This breadth of data on guide-editor properties and effects provides a fertile ground on which ideal editors for a given task can be chosen. Referencing this “bible” of which sites are best targeted by which guide-editor pairs can save tremendous amounts of time and resources—rather than empirically discovering good matches, researchers can rely on trends that machine learning has picked up on.

Interestingly, Arbab and her colleague Max Shen found that they achieved similar model performance with about half of the initial sequences inputted (6k instead of 12k), indicating some redundancy in the patterns found and an ability to scale such approaches with even less data.

Both the web tool from the team, which allows inputting features to optimize for and returns the ideal guide-editor pair, and the dataset itself are valuable resources now available to the public.

Leveraging natural patterns to optimize engineered toolkits

Spun out of Jennifer Doudna’s lab in 2017, Brisbane, Calif.-based Mammoth Biosciences is another group pioneering gene editing, using their expanded CRISPR toolkit of Cas proteins for both diagnostics and therapeutics.

However, they have a parallel effort that enables them to lean on nature’s engineering before diving into lab-based engineering. Starting with metagenomic data collected over decades from microbes, the team uses homology-based hidden Markov models to expand CRISPR diversity and unbiased methods to look for completely novel CRISPR systems in a pool of proteins.

Their initial discovery of the Cas14 protein (Harrington et al. Science 2018) involved looking for genes that were close to CRISPR arrays, and clustering proteins with related features. In this way, they were able to identify potential new editing systems that were extremely compact compared to previously used systems.

Starting with nature’s repertoire strengthens the search process by filtering to the systems that were robust enough to make it through the funneling forces of natural selection—these are thus more likely to function well when adapted for other use cases, says Mammoth co-founder and CSO Lucas Harrington.

Computationally-guided diagnostic and therapeutic design

In parallel with being a therapeutics company, the Mammoth team mobilized this past year to develop diagnostics as well to combat the COVID-19 pandemic.

Lucas Harrington, co-founder and chief scientific officer, Mammoth Biosciences

When designing CRISPR editors for both therapeutics and diagnostics, data science comes in handy for learning intrinsic properties of CRISPR proteins, as well as different sensitivities within the ~20 base pair guides with respect to how well they might tolerate mismatches.

Testing every possible sequence (4^20 possible, for the 20 positions, each of which can be occupied by one of four bases) is impractical, says Harrington. However, pulling out trends in how guide efficacy changes by target sequence, for example, from large datasets can prioritize optimal ones for a given editing task.

When selecting for an optimal guide RNA, the team breaks up the guide sequence, giving each section precise weights based on its effect on a variety of parameters. For diagnostics, the efficacy (how fast the guides can identify target sequences) and accuracy (detecting multiple strains of pathogens but not related sequences) of guides are key.

The team’s goal here has been: “how can we, within a matter of weeks, spin up a new test?” says Harrington. Having both the manufacturing kits and the software to prioritize guide RNAs given target sequences and co-infections to avoid has allowed for a fast response by eliminating the need for testing every possible guide.

For each new variant that appears in the population, the team simply switches out the guide they use in their test.

Where therapeutics and diagnostics diverge

As they are biochemical assays, CRISPR-based diagnostics can be developed in a controlled setting: starting with an amplicon of DNA only and no chromatin provides minimal competition from reagents. As a result, researchers can make very clean training datasets from assaying thousands of guides that capture the intrinsic properties of the proteins themselves rather than confounders.

Whereas for diagnostics, inclusion (guides that enable detection of all SARS-CoV-2 variants) and rapid development are useful, safety is the foremost goal in therapeutics. Thus, guide exclusivity is key, and the team tests every possible sequence to ensure the best possible one is chosen. The stakes are much higher when artifacts might not only limit therapeutic efficacy but also risk safety.

Looking ahead, role of computation in gene editing

Gene editing as a therapy in some ways is paradigm-changing and in others fits a traditional mold. Just like the permanence of surgeries, “in some sense, making a single base pair change in the liver to improve someone’s health is molecular surgery,” says Kathiresan.

Computation is important in every key step of the journey: from identifying new possible proteins and guide sequences as components of the editing machinery to evaluating the medicines both in the lab and in the clinic. Every new technology that generates its own kind of data will require clever analyses from which to extract key design insights to maximize the efficiency of both diagnostic and drug development.

Harrington says that the biggest challenge today remains in the data itself—specifically, extending beyond small or “dirty” datasets so that biological conclusions are reliable and not artifact-driven.

We are still at the point where scientists must ask the right questions and feed in the variables that are likely to be most informative to pattern recognition models. However, there is a consensus that patterns exist in biology, and that computers will be better at extracting them due to their ability to hold more information at once than humans (who can typically hold 4-5 sequences in their mind simultaneously).

Tools like Arbab’s that serve as community resources will be increasingly useful as each team builds off shared learnings for their disease area(s) of focus. Bridging discovery with applications will allow closing the loop, enabling more informed development of gene editing tools. This will require crews of software engineers to build out databases in a way that will be “mine-able,” as well as constant communication with wet lab scientists who can flag and curate learnings and report them upstream.

There’s a palpable optimism in the field of gene editing. The potential for one-shot, curative treatments provides clear focus and ample motivation for the long term. The advanced computational tools, shared datasets, and collaborative spirit from multiple disciplines are all there. It’s an exciting time for the field, with potential benefits for human health that could arrive faster than people would have forecasted just a few years ago.

You may also like

UK’s 100K Genomes Study Ends Diagnostic Odyssey for Some With Rare Diseases
5 Themes to Track at ASHG 2021
Single Cell Analysis Comes of Age, Lighting the Way for Precision Therapies
Public Health-focused Adjuvant Capital, Backed by Merck & Gates, Raises $300m