How to Model Splice Variation in Animal Models

I frequently get asked:

How do you model splice variations in your animal model systems?

Splice variation is an important consideration in genomic analysis of patient variations and it is often overlooked (PMID: 29680930).  It is estimate that 15%–60% of human disease mutations are due to splicing defect ( PMID: 29304370). So, with close to 40% of disease causing variation likely being attributable to splicing defects, it becomes an important variation to be able to model in functional studies to determine if the variant is pathogenic.

But let’s first look at the process of splicing and what is known.

image credit (Abramowicz and Monika, J Appl Genet. 2018; 59(3): 253–268. PMID: 29680930)

This complex process is managed in a complex way.  Certain cell types will favor one form of splicing, while other tissues will select other forms. This is the natural splice isoforms variation that gives us more than just the number of genes in the genome to control biological output. In fact, this is part of the explanation for why a C. elegans nematode or a zebrafish, with roughly the same number of genes as a human, have such different levels of output complexity. Currently the number of functional isoforms in humans may be an order of magnitude more than what occurs in the nematode. Furthermore, the ways splice variation can take place gets bewildering quick.

Image credit: (Park et al., Am J Hum Genet. 2018 Jan 4; 102(1): 11–26. PMID: 29304370)

How is splicing observed in the patient?

Layer on top of this the aberrant spice variations that can cause disease, and we have a tough interpretation problem. Thankfully RNAseq is providing a huge amount of diagnostic discovery for splice variation. We can compare the splicing patterns in healthy populations with a patient suspected of a genetic disease and visualize where the splicing is going wrong (PMID:28424332).

image credit: (Cummings et al., Sci Transl Med. 2017 Apr 19;9(386):eaal5209 PMID: 28424332)

Modeling Splice Variation in Animal Models

To reduce the complexity of biology and yet bring more comparative biology relevance, often we can take a human cDNA sequence and use it to rescue the function of the animal’s version of the gene.  To do this, we use CRISPR to remove the animal’s version of the gene (a gene “knock out”).  Next we take a human cDNA sequence optimized for expression in the animal and either replace the deleted locus or express the sequence in trans (at a safe harbor site using the promoter that is either endogenous to the removed gene, or a promoter well established for appropriate tissue expression).  In C. elegans, we have been pleasantly surprised that more than half the time for orthologs of at least 30% identity, we can get significant rescue of the loss of function seen in the knock out.  In zebrafish, we have started applying the same techniques of gene replacement. The result, a set of gene humanized animals where the conservation of biology means we are looking at functional outputs that are highly similar

Missense variations are conceptually easy to model.  An amino acid change that is pathogenic (ex R235Q in STXBP1) is is installed with CRISPR using a simple donor homology that instructs the cell’s HDR to alter the DNA coding for Q (glutamine) into a code for R (Arginine) in our “wildtype” humanized locus.  

But how do we mimic a splice variation?

It is actually quite simple. We create a donor homology that makes any splice form of interest.  We are not interested in the mechanism to answer “if” it occurs – RNAseq already answers that. We are after functional consequence.  We want to answer “does a particular splice form in question have a measurable defect compared to the normal splicing.” 

Let’s look at one of the patient examples in detail.

image credit: (Cummings et al., Sci Transl Med. 2017 Apr 19;9(386):eaal5209 PMID: 28424332)

In the red we have 4 patients with a collagen gene splice defect suspected of involvement in their diagnosis for Ullrich Congenital Muscular Dystrophy. Since all persons have two copies of the COL6A1 gene, we can see that one copy is splicing normally while the other copy is defective and its splicing brings in a pseudoexon. “The resulting inclusion of 24 amino acids occurs within the N-terminal triple-helical collagenous G-X-Y repeat region of the COL6A1 gene, the disruption of which has been well established to cause dominant-negative pathogenicity in a variety of collagen disorders” (PMID: 28424332)

Creating Knock-in for Animal Model of Disease

In regards to disease modeling of splice variations, we use a cDNA rescue approach.  The variation seen in the patient is made as a plasmid coding for expression of a modified cDNA.  This cDNA contains the human gene code that is suspected of creating an aberrant spice variation. Using CRISPR techniques, the segment coding for human DNA is inserted into the genome, typically at the orthologous locus of the animal.

Modeling in the C. elegans nematode. 

To model the COL6A1, we would first seek to understand the phenotype from loss of function of the animal’s ortholog version of the human gene. For COL6A1, this is the C16E9.1 gene in C. elegans. This gene is not well studied in the nematode, but does show high expression in the alternative life state of dauer.

The first step is to make a gene knock-out to remove the C16E9.1 gene from the worm genome. Next, a series of functional assays are run to determine if a functional defect can be detected for the C16E9.1 knock-out as a loss-of-function allele. For essential genes, the ultimate manifestation of loss of function is lethality as a homozygote. In other genes critical genes will often manifest with functional defect after a battery of functional screens are performed. Once a defect in activity is observed, human cDNA can be introduced to see if rescue of function can be obtained. When rescue is obtained with human cDNA, we know we are looking at conserved biology for gene function between the animal and humans.

Once we have rescue of function, the fun begins. We can use CRISPR to put in the exact content that RNAseq indicates is occurring in the human gene. The pseudoexon seen in one copy of the patient’s chromosome pair can be made in the animal. Often if the patient variant is problematic from a loss of function perspective where haploinsufficiency drives disease. When a defect is made as a homozygote in the animal, the effect is usually a severe phentoype (often lethal) and is similar to what is seen in the gene knockout. Yet in the specific case from above with the pseudoexon in COL6A1, we are dealing with a dominant negative effect, so the defective splice not only disrupts this protein, it also causes the good copy to fail to function properly. Animals homozygous for the pseudoexon defect may actually have a less strong defect phenotype than when the animals are made as heterozygotes. Creation of the patient’s heterozygous condition is achieved by crossing the splice-variant-containing humanized animal model into the wild type humanized animal model and examining the cross progeny for defects in activity.

Modeling in the Zebrafish. 

We can do similar modeling in zebrafish using the Tol2 system. In zebrafish there is one ortholog for the COL6A1. The col6a1 zebrafish gene has 55% sequence identity and 70% sequence similarity to humans. Like the work in the C. elegans nematode, we can remove the native gene and look for functional consequences. CRISPR techniques are used to create a knockout by inserting a stop codon early in the gene. If designed right, this results in loss of all expression for col6a1. Next we can measure the functional consequence of the gene knock-out by first trying to see if the animal can be made homozygous. If it is not lethal, the animal can be screened by a battery of assays to determine if a functional defect exists. Finding either lethality as homozygote, or observing a functional defect, allows testing for capacity of human COL6A1 cDNA to rescue function. A gene insertion approach using Tol2 is used to bring in the cDNA with an appropriate tissue-specific promoter.  Rescue of function in specific tissues, for instance with the use of the 195 bp unc45b promoter for skeletal muscle expression (PMID: 27295336), will help elucidate the important roles of COL6A1 in dystrophy diseases.

The pseudoexon insertion defect seen in COL6A1 is a dominant negative variation. So, when a single copy of this gene is brought into the animal, it will have the capacity to suppress the activity of the unmodified copy of the gene.  By inserting the cDNA with the patient into a safe harbor site we create a pseudo heterozygote.  The dosage of the cDNA comes from two chromosomal positions while the wildtype locus provides expression of two copies of the normal gene.  If the cDNA is dominant negative on its effect on the zebrafish gene, then defect of gene function will manifest.

Recap of Splice defect Modeling in Animal Models

In summary, the ability to model splice variants is done from a cDNA level. A modified cDNA rescue construct containing the human gene of interest is designed in three forms:

Positive Control (blue):  The humanized wildtype cDNA provides a reference of the normal gene seen in healthy individuals.  

Negative Control (red): A knockout deletion of the animal’s gene provides reference for full loss of function of the gene. 

Test (yellow): A variant is tested for its functional activity. A range of activities is expected and depends on the pathogenic variant’s mechanistic role in disease pathology.  It may be a dominant negative that creates a pathology worse than the loss of function allele because it binds to and causes bad behavior from the remaining good copy of the gene.  Alternatively, the variant may cause loss of function. This will be either recessive and manifest as a homozygous, or it will be dominant and manifest by haploinsufficiency as a heterozygote.  Finally, the variant of interest may cause a gain of function, which is typically manifest only the heterozygote.

Arg…What up with that?!! Arginine is Enriched in Pathogenic Variants

You know when that hunch seems to get reinforced over and over again, then your mind starts speculating it as a fact.

!Danger! Will Robinson… it’s time for a serious fact check.

My hunch was that the amino acid arginine (Aka: “Arg” or “R) seems to be showing frequent association with pathogenicity. It started with the observation that many of the established pathogenic variants in the coding sequence of STXBP1 seem to involve a preference for arginine. Extracting from ClinVar for missense that are pathogenic and likely pathogenic gives the following table:

Indeed arginine (R) is disproportionately represented. Assuming all amino acids as equals, then there should be 4.3 for each amino acid. Disproportionally low are things that make sense. Like methionine (M), only one codon (ATG) instructs for insertion of this amino acid in a sequence. Similarly tryptophan (W) also has only one codon (TGG). These two amino acids should be represented below the average. A little bit oddly, we have similar low levels from lysine (K), phenylalanine (F) and glutamate (Q) who each have two codons. If codon dosage was key to variant proportioning, then these should have been seen at least 2x more than M and W, so perhaps something more than codon dosage mediates amino acid choice in creating pathogenic variations.

Arginine has 6 codons which still could drive its outsized proportion in the graph. Yet Serine (S) and Leucine (L) also have 6 codons. But respectively they are at 7 and 3 for being involved in pathogenicity. Only mighty arginine accounts for 13 of the 43 pathogenic variants in STXBP1 (30%). Tempering my enthusiasm is the observation that for 3 amino acid positions R292, R406 and R451, we have multiple changes being called pathogenic. Yet no other amino acid in the STXBP1 pathogenics has this changling capacity, so why is it that arginine is at high proportion in the assigned pathogenics – perhaps it is just a consequence of a biased investigator focus specific to STXBP1 and they fixed their gaze onto the repeating de novo clinical variants at positions 292, 406 and 451.

Is arginine involved in fragility elsewhere in the genome?

To normalize for possible investigator bias and find a method that can be applied to other portions of the genome, I took advantage of the Ensembl database to list and rank a gene’s codon sequence variants by bioinformatics analysis. Ranking on CADD was used to list protein coding sequence variations by their severity.

Ensembl allows us to identify which variations are theoretically likely to be disruptive of protein function. The choice to rank by CADD (stand for Combined Annotation-Dependent Depletion) allows us to use a sophisticated algorithm that avoids investigator bias because it intentionally avoids using “known” pathogenicity databases when it creates it ranking. A key test is to see if CADD can independently observe the pathogenicity known to exist in STXBP1. To construct the test, we compare the top scoring CADD variants with the lowest scoring CADD variants.

With CADD, we get an independent call for possible pathogenicity that still picks up what you might expect. Nearly half the calls in the Top-30 CADD pull up known pathogenicity and no benign calls are found. In the Bottom-30 CADD we get one known benign call and no pathogenics.

Healthy population data also is consistent. STXBP1 is autosomal dominant. That means you only need one of your two chromosomal copies to be defective and disease will occur. Selection pressure has been very tight on autosomal dominant genes. Variants in healthy population cannot occur at higher than the known frequency of the disease in the population. Published frequency in STXBP1 for causing early-infantile epileptic encephalopathy is 1/90,000. The largest healthy population database is in GnomAD. At 141,456 individuals, and the fact that STXBP1 needs to distribute across at least 43 pathogenic alleles, the likeliness of even one pathogenic variant being in healthy populations is pretty close to zero. Some of our Top-30 CADD have 1x or more frequency in healthy populations. Most of these are unassigned. For these unassigned that are seen at 1x or more, the disease frequency argument strongly implicates that they are benign variants.

So the CADD is not perfect, the top scoring hits are a mix of known pathogenic and probably benign. But the bottom scoring CADD seems to be more efficient at pulling out benign. In the Bottom-30 CADD, only one variant, I271V, is labeled Likely Benign by ClinVar, yet nearly everyone of these alleles (27 of 30) is seen in healthy populations, so they too are probably benign.

At this point in the analysis, we can pinpoint an anomaly. Y264C is labeled in ClinVar as a Likely Pathogenic. But from the population frequency argument, this assignment is highly unlikely. Y264C has been observed to occur in healthy populations. So a a bare minimum, it should be downgraded to a VUS, but probably be called a Likely Benign for causing early-infantile epileptic encephalopathy.

Finding Arginine-associated Fragility Throughout the Genome

This top-30 / bottom-30 approach was applied to a large set of genes. As a form of internal control, we add isoleucine (I) in the screen. With less conviction, I have felt this amino acid was associating with benign variants. If true, it should show an enrichment in the Bottom 30 CADD scores. So in my gene set experiment, I measured 4 bins. 2 bins for how many arginine and isoleucine in the Top 30 and 2 bins for how many arginine and isoleucine in the Bottom 30.

30% of top 30 CADD scoring variants contain arginine???!!!

An assumption of even distribution of amino acids, combined with an even more absurd assumption of an average 3.05 codons per amino acid, gives us 4.3% as average amino acid fraction per each 30 (dashed line). Arginine is 7.2x more than this average number. Yet, we need to account for the fact arginine uses about 2x more than the average codon usage. A a result Arginine bias in the Top 30 is about 3.5x more than expected. For isoleucine, the enrichment in the bottom 30 appears to be about 2x more than expected.

Test dataset – 30% arginine in Top-30 CADD prevails

The noisiest data in the Top-30 CADD appears to be the Arginine data. A cumulative trending plot was used to see how many genes were need before the trend to 30% becomes apparent. After assessing 7 genes the trend starts to stabilize. A new set of 7 genes were chosen. This time the genes were chosen from the Undiagnosed Disease Network (UDN). The UDN recently listed 54 genes as in desperate need for animal modeling to provide gene function studies. A sub-selection of these were identified as having good sequence similarity to genes in the animal models which we hold dear to our heart and expertise (zebrafish and C. elegans). The Top-30 / Bottom-30 CADD selection was applied to these genes and plotted for Arg and Leu enrichment. 30% prevails for arginine – it occurs at least 3.5x more than expected for being the top CADD variants as hypersensitive to substitution.

This all assumes that the representation of amino acids is uniform across all proteins. But the that is a reach. Louis Gross at University of Tennessee, Knoxville, has observed the amino acid distribution in vertebrates has some anomalies.

Most notable anomaly is arginine. 6 codons are use by arginine, but the observed frequency is low at 4.2%. To illustrate how low, they calculated the expected frequency for each amino acid biasing only for the GC richness of vertebrate genomes.

The expected frequency for arginine is quite high at about 10.5% due to its GC richness in its codons. Yet the actual observed frequency is quite low at about 4%. Based on this observed frequency, we bounce back – we now assess that we are observing arginine in the top 30 at 8x more than expected. No explanation for the anomaly and it just became more pronounced!

Taking a different approach, we can ask what percentage of ALL known pathogenic and likely pathogenic variants in a gene involve arginine substitution. 7 genes analyzed and we get the same 30% for arginine. Yet the calculations are that it should be below 4%. 8x more than expected prevails.

Are your arginines special too?

This analysis has uncovered a unique phenomenon. It appear everyone’s arginines are special. Exactly why arginine has this special status is not entirely clear. It is highly likely arginine has been strongly selected against its random incorporation during evolution. As a result of this strong negative selection (much more than what is happening for all other amino acids), arginine’s frequency in all proteins is much lower than predicted. The observed pathogenic sensitivity may be a read out of this hyperselectivity of evolution. Basically, arginine’s use in any given protein is very particular. A possible driver for this is arginine’s amazing capacity to bring high order to neighboring side chains in most protein structures. When it is gone, chaos reigns. When it is introduced where it should not be, chaos still reigns.

Arginine is special. I suggest we need to ditch Douglas Adam’s “42”.

Instead, we make like a pirate and just say:


VUS at 44% in ClinVar Assessments and Growing

How prevalent are Variants of Uncertain Significance?

ClinVar database for variant interpretation was analyzed for its levels of ACMG-AMP assessments. With help from the data dumps from ClinVar Miner, the yearly distribution of assessments was plotted. Since 2016 and shortly after the ACMG-AMP guidelines came out in 2015, the number of assessments assigned to the VUS category has grown rapidly. These are the variants that clinical genetics researchers have examined, but cannot decide if they are pathogenic or not.

How big will the VUS problem get?

To estimate how large the VUS problem will become, we must first understand how big is the human genome. Controversy abounds, but current estimate are there are 21,306 protein coding genes and 21,856 non-coding genes. To be conservative, and for simplicity sake, let us use 20,000 genes as the number. The next question is how many of these are disease associated. When we look to ClinVar the number of “genes with variants specific to one protein-coding gene” we get 7221 genes. More conservatively, we can look to ClinVar’s “gene_condition_source_id” which list 4242 genes as being associated with a diagnostic condition. This lower number is reinforced by OMIM in which the “Total number of genes with phenotype-causing mutation” is 4162 genes. These list have been growing rather steady at 5% per year, so in a few years the likely number of gene-disease associations will probably approach 5000 genes, or roughly 1/4 the human genome.

VUS problem may eventually approach 7 Million variants

A recent attempt to preload the human genome with pathogenicity assessment potential has been made. InterVar database applied ACMG-AMP guidelines to ~80,000,000 amino acid positions in the genome to provide a database for easier variant interpretation. Since at least 20% of these positions are likely to be in genes with known disease association, there are roughly 16,000,000 variants that will eventually occur in patient-derived genome sequencing. If the current trend of 44% VUS translates across that number, then there will be close to 7,000,000 variants in need of functional studies to resolve their pathogenicity.

A novel animal model systems for rapid variant interpretation

The team at Nemametrix just produced a wonderful set of preliminary data that we showed at the recent American Society of Human Genetics. It shows it is possible to use a training set of known benign and pathogenic alleles in a gene to “teach” a ML algorithm to determine if pathogenicity is present in a VUS. When applied to the STXBP1 gene, a set of 5 benign and 5 pathogenic was sufficient to train for segregation in an LDA plot and the Y75C was assessed as pathogenic.

Once this type of system is trained with a set of known pathogenic and benign variants, the assessment of pathogenicity can be achieved in a soon as 10 days from start of a VUS transgenesis project.

Total Domination – Uncovering the Phenomenon of 1:2 Dominant vs Recessive ratio for Variation in the Genome

In a prior blog post, the presence of dominant alleles in my genome gave me pause when trying to interpret the data from sequencing my DNA. Dominant alleles can be the cause disease when only one pathogenic variation occurs in only one gene copy of the chromosome pair. Contrast this to a recessive allele where you must get a defect in both chromosome copies of the gene to cause disease. In the recessive condition, if you only have one defective copy, you can expect to remain healthy, but you are a carrier of a disease allele. With the lack of immediate consequence to being a carrier status, many more individuals should be walking around with variations that are recessive towards disease. In fact, the CFTR gene variation (p.Arg117His) for Cystic Fibrosis that was highlighted for me in my Veritas Genomic sequencing report is quite common. It occurs globally at 1 per 2,500 persons, and that increases close 1 per 1,000 for northern europeans, which is a dominant portion of my ancestral genomic composition. In contrast, the CACNA1S variant (p.Arg419His) that most concerns me in my genome, has a prevalence of 1 in 25,000. Thats of low enough to be Rare Disease in Europe, but still probably way to high for disease manifestation rates.

Rare domination in CACNA1S needs to be rare enough to cause Hypokalemic Periodic Paralysis.

Dominant disease causality with the Arg419His variation in CACNA1S is unlikely because it is too frequent for the 1 per 100,000 population frequency for the disease of Hypokalemic Periodic Paralysis. Yet there are two variations known to be causative in CACNA1S, Arg528His and Arg1239His. Arg528His occurs at close to 1/100,000, while Arg1239His has yet to be detected in healthy populations. Clearly the Arg11239His is low enough population frequency to be causative for Hypokalemic Periodic Paralysis. Yet for my Arg419His, the frequency is too high for it to be causative. A variant effect that is Autosomal Dominant (AD) is extremely unlikely for my lone Arg419His allele.

If dominant alleles need to be rare in the population, how frequent is dominant status for variants of a disease?

The frequency of Autosomal Dominance (AD) for any given disease gene appears to be quite high. It is estimated that there are about 7000 Rare Diseases. If we assume the On-line Mendelian Inheritance in Man (OMIM) already represents most of these genes, then rare disease variants will map to the 4346 gene entries in OMIM with published allelic variations. Next, I listed these variations in blocks of 100 to reveals the number of genes for which they are known to exclusively Autosomal Dominant (AD) or Autosomal Recessive (AR), or some kind of hybrid.

When one runs down the inheritance pattern and tabulates them per gene, the first 100 variants have about twice as many genes in the AR category when compared to the AD category.

Running thru the another 400 more variants in the 100 variant blocks shows the trend continues – Dominance of a genetic conditions occurs for about 1/3rd of the disease genome.

Axiom for the individual : “I am not very dominating, but there are lots out there who are.”

So at the individual basis, it appears the AD status of pathogenic or likely pathogenic variants in your genome is very rare. Yet, at a population level, a large proportion of Rare Disease is caused by Autosomal Dominant variation. Rare disease calculate to occur at about 1 per 15 persons. So, for about 1 in 50 (150 million persons), their disease casing variation is likely to be Autosomal Dominant.

What is hot – or not??? – Looking at Hotspots and Coldspots of Pathogenicity in the STXBP1 Gene

In today genomic medicine era, it remains challenging to understand the functional consequence of a gene variant’s contribution towards disease. Guilt by association is one of the criteria upon which a new variant is judged. We can look at healthy populations data and compare it to established Pathogenic and Likely Pathogenic variants.  This helps us understand if a new variant may have a propensity to cause disease. The thought is that if a new variant is occurring at a region previously established as causing pathogenicity, then the new variant may be pathogenic too (ACMG guideline: PM1 “moderate” assessment criteria).

Is my variant guilty of pathogenicity because of its proximity to a pathogenicity hotspot?

In the image above, we see that there are hotspots (red) and coldspots (blue) for pathogenicity in STXBP1.  The hotspot values were generated from the known Pathogenic and Likely-Pathogenic listed in ClinVar.  The coldspot values (highMAF) come from variants seen in healthy populations. In yellow we have Variants of Uncertain Significance (VUS). Intensity of the peak is a measure of both how many times different variations are seen at an amino acid position and if their nearest neighbors have the same assignment. This plot suggest there are spots in STXBP1 that can tolerate sequence diversity (blue bars) and spots where a hit leads to pathogenic behavior (red bars).  Further, the VUS are landing in both red bar and blue bar regions. Perhaps we can consider VUS to be either pathogenic or benign by this association? Yet, there is a critical assumption that leads to a question: How legitimate is it that every variant in healthy populations (“highMAF”) is ASSUMED to be benign?

2,504 healthy population genomes – Calculating the rare variants in each person

To dig into the validity (or invalidity) of this assumption, we can look to a large population study and ask how many times do we see variation and what are their types. The 1000 Genomes Project Consortium shows an average person has about 4,500,000 million variations. Of these, about 100,000 are somewhat rare because they are seen in less than 1/200 persons (<0.005 MAF). The even more rare “singletons” of the study occur at a frequency of 1 per 2504 persons. This restriction gives us about 10,000 more rare variations to think about per each person. Yet, to get even more rare and be able to ask the question how many variants per person meet the 1 per 200,000 USA definition for Rare Disease frequency, the study size would need to be 100x bigger. Nevertheless, we have interesting data reported in the 1000 Genomes study on healthy population variants that are also seen as pathogenic in Human Gene Mutation Database (HGMD) and ClinVar datasets. Filtering the observed path in healthy population as frequency per individual, every person can expect to harbor 20-25 variants of established pathogenicity.

A larger study by Karczewski et al. 2019 is approaching the scale need for assessing Rare Disease.  A dataset of 141,456 human genomes (125,748 exomes and 15,708 genomes) was harvested from the wildtype controls used in various disease studies. The exomes observe variation mostly in the coding sequence of a gene, while the genomes record variant information across the gene (coding + upstream/downstream/introns). The result is a deeper measure of the frequency of missense variation that approaches the 1 in 200,000 genomes needed for Rare Disease designation.  Currently the National Organization for Rare Disease (NORD) list 1258 disease in their database. STXBP1 cross references to two of these (Dravet and West Syndromes).  Both of these syndromes each have a support group, which are two of the 283 total family foundation groups that are listed in the NORD member list.

Yet the situation for Rare Disease is larger.  In the NIH’s Genetics and Rare Disease (GARD), there are 6264 unique genetic diseases listed. This suggest there are thousands of genes for which we can expect to have gene variant issues leading to disease. ClinVar currently list 7046 is the number of “Genes with variants specific to one protein-coding gene.”  Basically it appears that a third of your 20,000 protein coding genes could take a hit that increases your risk or likeliness of coming down with genetic disease symptoms. 

The GARD lists an intriguing statistics that 20-25 Americans are living with Rare Disease.  The USA’s current population is 327.2 Million, so roughly 1 in 15 individuals world wide are probably living with rare disease. Assuming monogenic cause, then at least 51 million pathogenic might residing in the human population.  Add polygenic burden and the number may be a multiple (100, 150, 200, 250….??) for variants associated with disease currently being experienced today. Guilt by association to hotspots and coldspots might provide some answer, but functional studies are the more definitive proof, and +50 million is a lot of animal models to build!!

Rare and Not-So-Rare – Finding 100 Impactful Targets for Modeling Disease-Gene Associations in Alternative Animals

What genes are good candidates for alternative animal modeling?

I set out to determine which important disease genes are good candidates for creating animal models in C. elegans. The first step was to turn to a database that has a comprehensive listing of human genes and their disease association. The DisGeNet database has nearly every human gene annotated for its level of disease association (17,549 genes as of June 2019). They provide a curated list that has 8400 genes with Gene-Disease Association (GDA) score of 0.1 or higher. For the top 1000 genes the GDA scores are 0.69 or higher, which indicates they scored high for having a significant disease association. These top 1000 were selected for examination of their ortholog status in C. elegans using the Diopt database. 749 othologies were detected, of which 411 had clear reciprocal nature (back-blast gives starting gene for the ortholog as best hit). The top 100 of these genes for high homology and detectable loss-of-function consequence were selected.

Tabulation of disease-associated genes with properties favorable for C. elegans humanization

The top 100 are tabulated in gene-alphabetical format below. These 100 genes have 8360 variants as known to be as problematic (Path, Likely Path, or VUS).

Use a search tool to quickly find out if your favorite gene occurs below.

(Note: gene knock out for 58% of these genes results in lethality.)

Human geneDisease associationNematode gene (LOF)Problem variants
ABCB1Colchicine resistance; Inflammatory bowel diseasepgp-9 (development)3
ABCB6Dyschromatosis universalis hereditaria; Microphthalmia; Pseudohyperkalemiahmt-1 (development)11
ABCC1Peripheral Neuropathymrp-1 (development)38
ACTG1Baraitser-Winter syndrome 2; Deafness, autosomal dominant 20/26act-4 (lethal)57
ADAAdenosine deaminase deficiency; Severe combined immunodeficiency C06G3.5 (development)81
ADAM10Reticulate acropigmentation of Kitamura; Alzheimer diseasesup-17 (lethal)5
AGO2Alcoholic Intoxication, Chronicalg-1 (lethal)1
AHRRetinitis pigmentosa; Malignant neoplasmahr-1 (development)2
ALDH2Alcoholic Intoxication, Chronicalh-1 (lethal)3
APEX1Malignant neoplasmexo-3 (development)2
ATG5Spinocerebellar ataxiaatg-5 (morphology)2
BMS1Aplasia cutis congenitaY61A9LA.10 (lethal)1
CALRschizoaffective disordercrt-1 (lethal)1
CATtype 2 diabetes mellitusctl-1 (development)2
CDC42Takenouchi-Kosaki syndromecdc-42 (lethal)10
CHEK1Malignant neoplasmchk-1 (lethal)3
CIB2Deafness, autosomal recessive; Usher syndromecalm-1 (morphology)11
CTSBKeratolytic winter erythemaF57F5.1 (lethal)0
CTSDCeroid lipofuscinosis, neuronal, 10asp-4 (morphology)103
DECR1SchizophreniaF53C11.3 (morphology)1
EIF4EAutismife-3 (lethal)0
ENO1Enolase deficiencyenol-1 (lethal)0
EPHX1Hypercholanemia; Malignant neoplasm; W01A11.1 (development)3
ERCC1Cerebrooculofacioskeletal syndromeercc-1 (lethal)17
ERCC2Cerebrooculofacioskeletal syndrome 2; Trichothiodystrophy 1, photosensitive; Xeroderma pigmentosum, group Dxpd-1 (lethal)72
ERCC3Trichothiodystrophy 2, photosensitive; Xeroderma pigmentosum, group Bxpb-1 (lethal)33
FASNObesity diseasefasn-1 (lethal)83
FHFumarase deficiency; Leiomyomatosis and renal cell cancerfum-1 (lethal)1830
G6PDHemolytic anemia, G6PD deficient (favism); Resistance to malaria due to G6PD deficiencygspd-1 (lethal)91
GAD1Cerebral palsy, spastic quadriplegic, 1 unc-25 (movement)35
GAPDHhepatocellular carcinomagpd-2 (lethal)0
GCH1Dystonia, DOPA-responsive, with or without hyperphenylalaninemia; Hyperphenylalaninemia, BH4-deficient, Bcat-4 (movement)76
GGT1Glutathioninuria; chronic hepatitis BH14N18.4 (development)1
GNA12ulcerative colitisgpa-12 (lethal)2
GPIHemolytic anemia, nonspherocytic, due to glucose phosphate isomerase deficiency gpi-1 (development)11
GPTnon-alcoholic fatty liver diseaseC32F10.8 (lethal)46
GSK3Bbipolar disordergsk-3 (lethal)0
GSRHemolytic anemia due to glutathione reductase deficiency gsr-1 (lethal)0
HCCSMicrophthalmoscchl-1 (lethal)8
HDAC2chronic obstructive Airway Diseasehda-1 (lethal)1
HPRT1 HPRT-related gout; Lesch-Nyhan syndromehprt-1 (morphology)55
HRASBladder cancer, somatic; Congenital myopathy with excess of muscle spindles; Costello syndrome; Nevus sebaceous or woolly hair nevus, somatic; Schimmelpenning-Feuerstein-Mims syndrome, somatic mosaic; Spitz nevus or nevus spilus, somatic; Thyroid carcinoma, follicular, somaticlet-60 (lethal)81
HSP90AA1Breast Cancerhsp-90 (lethal)0
HSPA4bipolar disorderhsp-110 (lethal)0
HSPA5hepatocellular carcinomahsp-3 (lethal)0
HSPA9Anemia, sideroblastic, 4; Even-plus syndromehsp-6 (lethal)6
HSPD1Leukodystrophy, hypomyelinating, 4; Spastic paraplegia 13, autosomal dominanthsp-60 (lethal)35
IDH1Glioma, susceptibility to, somaticidh-1 (development)3
ILKcardiomyopathypat-4 (lethal)33
ISYNA1Malignant neoplasminos-1 (development)0
ITPR1Gillespie syndrome; Spinocerebellar ataxiaitr-1 (lethal)139
MAP2K1Cardiofaciocutaneous syndrome 3mek-2 (lethal)93
MAP2K7Malignant neoplasmmek-1 (development)2
MAPK1gastric carcinogenesismpk-1 (lethal)2
MAPK14schizophreniapmk-1 (lethal)1
MFN2Charcot-Marie-Tooth disease; Hereditary motor and sensory neuropathyfzo-1 (development)202
MRE11Ataxia-telangiectasia-like disordermre-11 (development)515
MSH2Colorectal cancer; Mismatch repair cancer syndrome; Muir-Torre syndromemsh-2 (development)1905
MTHFRHomocystinuria; Neural tube defects; Schizophrenia; Thromboembolism; Vascular diseasemthf-1 (development)212
MTORFocal cortical dysplasia; Smith-Kingsmore syndromelet-363 (lethal)65
MTRHomocystinuria-megaloblastic anemia, cblG complementation type; Neural tube defects, folate-sensitive, susceptibility tometr-1 (development)200
NME1Neuroblastomandk-1 (lethal)144
NT5C2Spastic paraplegiaY71H10B.1 (lethal)14
ODC1Colonic adenoma recurrenceodc-1 (fecundity)5
P4HBCole-Carpenter syndrome 1 pdi-2 (lethal)2
PCPyruvate carboxylase deficiencypyc-1 (development)329
PCNAAtaxia-telangiectasia-like disorder 2 pcn-1 (lethal)1
PCYT1ASpondylometaphyseal dysplasiapcyt-1 (development)15
PEPDProlidase deficiencyK12C11.1 (lethal)60
PGK1Phosphoglycerate kinase 1 deficiencypgk-1 (development)28
PHGDHNeu-Laxova syndrome; Phosphoglycerate dehydrogenase deficiencyC31C9.2 (development)42
PLK1Neoplasmsplk-1 (lethal)0
PNPImmunodeficiencyK02D7.1 (movement)46
PNPLA2Neutral lipid storage disease with myopathyatgl-1 (lethal)66
PPP3CAArthrogryposis, cleft palate, craniosynostosis, and impaired intellectual development; Epileptic encephalopathy, infantile or early childhood, 1tax-6 (movement)8
PSEN1Acne inversa; Alzheimer disease; Cardiomyopathy, dilated; Dementia, frontotempora; Pick diseasesel-12 (development)134
PTDSS1Lenz-Majewski hyperostotic dwarfismpssy-1 (morphology)6
RAD51Fanconi anemia, complementation group R; Mirror movements 2; Breast cancer, susceptibility torad-51 (lethal)11
RAP1AKabuki syndromerap-1 (lethal)0
RPS19Diamond-Blackfan anemia 1rps-19 (lethal)37
SDHBGastrointestinal stromal tumor; Paraganglioma and gastric stromal sarcoma; Paragangliomas 4; Pheochromocytomasdhb-1 (lethal)276
SDHCGastrointestinal stromal tumor; Paragangliomasmev-1 (lethal)134
SDHDMitochondrial complex II deficiency; Paragangliomas; Pheochromocytomasdhd-1 (development)146
SFRP1Narcolepsysfrp-1 (morphology)1
SLC6A2Orthostatic intolerancedat-1 (movement)50
SMARCA1brain malformationisw-1 (lethal)1
SMARCA2Nicolaides-Baraitser syndromeswsn-4 (lethal)90
SMARCB1Coffin-Siris syndrome; Rhabdoid tumors; Schwannomatosissnfc-5 (lethal)93
SMC1ACornelia de Lange syndrom3him-1 (lethal)110
SMC3Cornelia de Lange syndromesmc-3 (lethal)75
SOD1Amyotrophic lateral sclerosis 1 sod-1 (development)43
SOD2Cardiomyopathy, Dilatedsod-3 (development)2
TATTyrosinemiatatn-1 (lethal)59
TBPSpinocerebellar ataxia; Parkinson diseasetbp-1 (lethal)3
TIMM8AMohr-Tranebjaerg syndromeddp-1 (development)19
TYMSColorectal Carcinomatyms-1 (lethal)2
USO1Malignant neoplasmuso-1 (morphology)0
VCPAmyotrophic lateral sclerosis 14, with or without frontotemporal dementia; Charcot-Marie-Tooth disease, type 2Y; Inclusion body myopathy with early-onset Paget disease and frontotemporal dementia 1cdc-48.2 (lethal)77
WLSBone densitymig-14 (lethal)31
XPR1Basal ganglia calcificationY39A1A.22 (development)5

Dealing with Uncertain Significance of the Genome

“Risky Business”

Most of us are wired to be risk averse. Yet, I have been giving serious contemplation to the “risky business” of having my whole genome be sequenced as a preventative medicine approach to my healthcare.

What will happen if go there?

I find myself staring into the murky abyss from the edge of the data cliff. A creepy feeling urges in my belly from the depths of my ski-bum days… Should I….


photocredit: Bradly J. Boner for Jackson Hole Magazine

Upon landing, I will need “spoon-my-tracks” to get the data to be interpretable and informative as possible.

photocredit: James Fagedes at Foothill freak

Mental Health

There is plenty of reason to be cautious. A recent publication by the Hasting Center report urges a high level of caution (Johnston 2018).  Although we can be somewhat dismissive of their dismissiveness –  the cost of whole genome sequencing is dramatically dropping and phenotyping technology is rapidly improving – one observation is likely to hold true for a while:

“Given the psychosocial costs of predicting one’s own or one’s child’s future life plans based on uncertain [Genomic] testing results, we think the hope and optimism deserve to be tempered.”

So it is clear there will be quite a bit of uncertainty when one opens the Pandora’s box of the genome, but hope and optimism will remain. Whether there is clearly actionable results, or frustrating uncertainty, the knowledge gained means there are things to be done, platforms to build, and cures to be discovered.

Not If, but When

Caution keeps us safe. But safe for how long? By not “going there” we might be just deluding our selves from the inevitable.  At some point, we will have a deep understanding of the consequence of genome variation.  The first to fall into line will be variants delivering functional consequence on the monogenic side of the spectrum.  These will be the easiest to model and uncover biological consequence because the variant will have clear and sometimes deterministic output on life quality and healthspan. More challenging will be the variants whose effects predominate in polygenic contexts.  These are the more subtle “risk factor” effects where the other variants in one’s genome are the influencers that either enhance or suppress the capacity of the risk factor variant to manifest.  Adding to the challenge of understanding a risk factor is the influence of external factors, such as diet and exercise. Or the more internal factors such as genomic imprinting and gene methylation status.

Yet it is clear where we are heading. Much of the uncertainty will be resolved and we will soon be living in the genome-actionable era where medicine becomes highly personalized to the individual’s variant profile.  For a glimpse of what the future holds, and if you can make the time for an amazing Rob Reid interview of Dr. Robert Green from Harvard, put the headphones on for the following podcast:


1. Johnston J et al. Sequencing Newborns: A Call for Nuanced Use of Genomic Technologies. Hastings Cent Rep. 2018 Jul-Aug;48(2):S2-6.

C. elegans as Fast and Affordable System for Variant Phenotyping

Systems for Functional Studies

A variety of modeling systems can be used to explore variant function. Initially, many researchers turn to a computational approach to aid variant assessments (Eilbeck 2017). A recent bioinformatics study was used to refine the variant classification of voltage-gated sodium channels (KCNQs) for their contributions to epilepsy (Hol 2017). Yet many of the KCNQ variants remained VUS alleles after the bioinformatic refinement, so alternative functional assays are needed to capture full functional assessment. Various techniques applied for obtaining functional data from clinical variants range from bacterial and yeast expression systems to mammalian cell models (Rodríguez-Escudero 2015, Woods 2016). In one example, bacterial expression of recombinant USP6 was used to detect pathological enzymatic activity in one of 18 clinical variants (liu 2016). Recombinant protein assays can be effective for enzymatic genes, but for disease genes with complex interaction phenotypes, expression in bacteria or yeast removes the gene from its native context and prevents the exploration of pathologies that involve these complex interactions. A mouse or rat animal model is the “gold standard” for finding clear phenotypes from a clinical variant, yet the cost and time spent are more than 10x relative to C. elegans animal models. Further, humanized rodent models are expensive to deploy early in drug development (McGonigle 2014). Disease modeling with Induced Pluripotent Stem Cells (iPSC) offers an exciting platform to study clinical variants (Csöbönyeiová 2016), but the removal of the cells from their native context of the intact animal regrettably removes the important effect of a tissue-based environment. 3D cell culturing techniques and organs-on-a-chip can be useful in restoring proper microenvironment context (Breslin 2013), but the ease of use for routine analysis clinical variants has yet to evolve.

Alternative Animal Models

An emerging approach used by the Undiagnosed Disease Network (Wang 2017), a variant observed in humans is homology modeled by installing the same amino acid change at a conserved position in the disease gene homologs. For instance, clinical variants in CACNA1A were installed in zebrafish and Drosophila and pathogenic activities were observed (Luo 2017). In a recent publication using C. elegans, CRISPR was used to install a patient sequence variant suspect of having CLIFAHDD syndrome (Congenital Contractures of the Limbs and Face with Hypotonia and Developmental Delay) due to defects in the NALCN gene (Bend 2016). These authors demonstrated a gain-of-function pathogenicity assignment could be made for a patient V637F sequence variation. Similar results have been observed for variant installs in other C. elegans disease gene orthologs (Sorkaç 2016, Bulger 2017, Prior 2017, Canning 2018, Pierce 2018, Troulinaki 2018). These results reinforce the C. elegans animal model as a good platform for modeling gene function of pathological variants. The fast life-cycle and ease of genome engineering allow direct modeling of pathogenic alleles to occur within a one-month timeline needed to create and analyze the transgenic animals.

1. Eilbeck K et al. Settling the score: variant prioritization and Mendelian disease. Nat Rev Genet. 2017 Oct;18(10):599-612. doi: 10.1038/nrg.2017.52.

2. Hol et al. Comparison and optimization of in silico algorithms for predicting the pathogenicity of sodium channel variants in epilepsy. Epilepsia. 2017 Jul;58(7):1190-1198. doi: 10.1111/epi.13798. Epub 2017 May 18.

3. Rodríguez-Escudero I et al. Yeast-based methods to assess PTEN phosphoinositide phosphatase activity in vivo. Methods. 2015 May;77-78:172-9. doi: 10.1016/j.ymeth.2014.10.020. Epub 2014 Oct 25.

4. Woods NT et al. Functional assays provide a robust tool for the clinical annotation of genetic variants of uncertain significance. NPJ Genom Med. 2016;1. pii: 16001. doi: 10.1038/npjgenmed.2016.1. Epub 2016 Mar 2.

5. Liu YL et al. The impacts of nineteen mutations on the enzymatic activity of USP26. Gene. 2018 Jan 30;641:292-296. doi: 10.1016/j.gene.2017.10.074. Epub 2017 Oct 27.

6. McGonigle P. Animal Models of CNS Disorders. Biochemical Pharmacology 87, no. 1 (January 1, 2014): 140–49. doi:10.1016/j.bcp.2013.06.016.

7. Csöbönyeiová M et al. Recent Advances in iPSC Technologies Involving Cardiovascular and Neurodegenerative Disease Modeling. General Physiology and Biophysics 35, no. 1 (January 2016): 1–12. doi:10.4149/gpb_2015023.

8. Breslin S and O’Driscoll L. Three-Dimensional Cell Culture: The Missing Link in Drug Discovery. Drug Discovery Today 18, no. 5–6 (March 2013): 240–49. doi:10.1016/j.drudis.2012.10.003.

9. Wang J et al. MARRVEL: Integration of Human and Model Organism Genetic Resources to Facilitate Functional Annotation of the Human Genome. Am J Hum Genet. 2017 Jun 1;100(6):843-853. doi: 10.1016/j.ajhg.2017.04.010. Epub 2017 May 11.

10. Luo X et al. Clinically severe CACNA1A alleles affect synaptic function and neurodegeneration differentially. PLoS Genet. 2017 Jul 24;13(7):e1006905. doi: 10.1371/journal.pgen.1006905. eCollection 2017 Jul.

11. Bend EG et al. NALCN channelopathies: Distinguishing gain-of-function and loss-of-function mutations. Neurology. 2016 Sep 13;87(11):1131-9. doi: 10.1212/WNL.0000000000003095. Epub 2016 Aug 24.

12. Sorkaç A et al. In Vivo Modelling of ATP1A3 G316S-Induced Ataxia in C. elegans Using CRISPR/Cas9-Mediated Homologous Recombination Reveals Dominant Loss of Function Defects. PLoS One. 2016 Dec 9;11(12):e0167963. doi: 10.1371/journal.pone.0167963.

13. Bulger DA et al. Caenorhabditis elegans DAF-2 as a Model for Human Insulin Receptoropathies.G3 (Bethesda). 2017 Jan 5;7(1):257-268. doi: 10.1534/g3.116.037184.. Available at:

14. Prior H et al. Highly Efficient, Rapid and Co-CRISPR-Independent Genome Editing in Caenorhabditis elegans. G3 (Bethesda). 2017 Nov 6;7(11):3693-3698. doi: 10.1534/g3.117.300216.

15. Canning P et al. CDKL Family Kinases Have Evolved Distinct Structural Features and Ciliary Function. Cell Rep. 2018 Jan 23;22(4):885-894. doi: 10.1016/j.celrep.2017.12.083.

16. Pierce SB et al. De novo mutation in RING1 with epigenetic effects on neurodevelopment. Proc Natl Acad Sci U S A. 2018 Feb 13;115(7):1558-1563. doi: 10.1073/pnas.1721290115.

17. Troulinaki K et al. WAH-1/AIF regulates mitochondrial oxidative phosphorylation in the nematode Caenorhabditis elegans. Cell Death Discov. 2018 Jan 29;4:2. doi: 10.1038/s41420-017-0005-6.

Explosive Growth of Gene Variant Numbers

Genetic Testing

Clinical geneticists have an acute need to understand pathogenicity in genomes of their patients (Figure 1). Cost per human genome has now approached $1000 each (Wetterstrand 2018). This affordable cost is allowing clinicians to start incorporating next-generation sequencing (NGS) technology into the patient diagnosis.

Variant Diversity

The American College of Medical Genetics and Genomics and the Association for Molecular Pathology recommend that variants be classified in five groups (Pathogenic, Likely Pathogenic, Uncertain Significance, Likely Benign, Benign) (Richards 2015). Further, they suggest greater efforts are needed to resolve variants of Uncertain Significance (VUS) into either Pathogenic or Benign status. The re-classification of a VUS is no small task. In a recent publication on the analysis of the clinical variant database (ClinVar), the number of known genetic variants in human disease was reported on 9/9/2016 to be 72,472 (Manolio 2017). One year later (9/21/2017), a survey of ClinVar using an online database viewer reported 359,938 known variants (Henrie 2018). This 4.9-fold increase in the number of known clinical variants over one year reflects the explosive application of whole genome sequencing in clinical diagnostics (Stavropoulos 2015, Ellingford 2016, Volk 2017).

Need for Innovation

The daunting task now is to determine the significance of each new variant. Bioinformatics can provide some insight into variant pathogenicity (Oliver 2015). Unfortunately, the number of VUS alleles has remained close to 40% year-on-year since 2015. As of the July 2018, there are 192,843 identified VUS alleles. With such high numbers, there is a pressing need to quickly correlate genotype to phenotype and determine if VUS alleles are benign or pathogenic (Cox 2015). Model systems that reconstitute mutations in a physiological context are a robust method to demonstrate variant pathogenicity (Eilbeck 2017). Traditionally, mouse models have been used to characterize defective function in VUS alleles, but the expanding universe of clinical variants is overwhelming the current capacity. Higher throughput animal modeling is needed to address the growing demand.

1. Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Accessed July 12, 2018.

2. Richards S et al. Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology. Genet Med. 2015 May;17(5):405-24. doi: 10.1038/gim.2015.30. Epub 2015 Mar 5.

3. Manolio TA et al. Bedside Back to Bench: Building Bridges between Basic and Clinical Genomic Research. Cell. 2017 Mar 23;169(1):6-12. doi: 10.1016/j.cell.2017.03.005.

4. Henrie A. et al. ClinVar Miner: Demonstrating utility of a Web-based tool for viewing and filtering ClinVar data. Hum Mutat. 2018 May 23. doi: 10.1002/humu.23555.

5. Stavropoulos DJ et al. Whole Genome Sequencing Expands Diagnostic Utility and Improves Clinical Management in Pediatric Medicine. NPJ Genom Med. 2016 Jan 13;1. pii: 15012. doi: 10.1038/npjgenmed.2015.12.

6. Ellingford JM et al. Whole Genome Sequencing Increases Molecular Diagnostic Yield Compared with Current Diagnostic Testing for Inherited Retinal Disease. Ophthalmology. 2016 May;123(5):1143-50. doi: 10.1016/j.ophtha.2016.01.009.

7. Volk AE and Kubisch C. The rapid evolution of molecular genetic diagnostics in neuromuscular diseases. Curr Opin Neurol. 2017 Oct;30(5):523-528. doi: 10.1097/WCO.0000000000000478.

8. Oliver GR et al. Bioinformatics for clinical next generation sequencing. Clin Chem. 2015 Jan;61(1):124-35. doi: 10.1373/clinchem.2014.224360.

9. Cox TC. Utility and limitations of animal models for the functional validation of human sequence variants. Mol Genet Genomic Med. 2015 Sep;3(5):375-82. doi: 10.1002/mgg3.167.

10. Eilbeck K et al. Settling the score: variant prioritization and Mendelian disease. Nat Rev Genet. 2017 Oct;18(10):599-612. doi: 10.1038/nrg.2017.52.