Let’s have a COVersation about SARS-CoV-2

I’ve become a fan of sequence data, so a few week ago when there were the initial collections of sequences of SARS-CoV-2 (of which I’m sure we are all familiar) I thought I’d do some rudimentary analyses on them.

I was then asked by a friend from Zimbabwe (Tapfumeni Mashe) if I wanted to present to some of his colleagues, where they host WhatsApp based, interactive presentations. I promptly agreed as I thought it would be a nice challenge.

Then the numbers started piling up in the group, and it filled up. The maximum participants in a WhatsApp group is apparently 358. This is a slightly larger audience than I anticipated, and then I was informed that policy makers from Zimbabwe were present, as well as health officials and scientists.

*Gulp*

Despite not expecting this, I did the best I could in detailing the pure, unadulterated, sublime power of genomics & sequencing, especially in light of inferences made for SARS-Cov-2, the causative agent of Covid-19.

Since it was all done in a group, preparation was easy, as I could write the whole core talk before just posting it slowly. I thought in case anyone was interested, I would post my script here:

“Discussing methods of analysis and visualisation for SARS-Cov-2 sequence data – The power of genomics in epidemiology”

Good afternoon everyone.

My Name is Oliver Charity, and I work at the Quadram Institute in the United Kingdom. I am just finishing my PhD looking at genomic evolution of Salmonella Typhimurium. In my spare time I have been undertaking some SARS-Cov-2 analyses, in order to try and play my part in this terrible crisis.

This is me:

https://quadram.ac.uk/people/oliver-charity/

Today I thought that I would use this opportunity as a double-edged Sword – I will try to convince you that genome sequencing is a powerful and necessary tool in biological sciences, especially in the face of a pandemic such as this.

Doing this over WhatApp is an interesting experiment for me, so I will try and post a section, and then if anyone wants to discuss something then please say so, and we can continue afterward.

Introduction to the topic –

I think that we all know a current pandemic strain of SARS-Cov-2, which  originated from Wuhan, China, has since spread globally, and official figures stand at

2,004,989 cases

126,830 deaths

But thankfully

485,362 recovered.

As far as we know from epidemiologists, the numbers of cases are probably much higher as it is completely biased toward amounts of tests taken in any country.

Similarly, the recorded deaths are probably skewed, for example in the UK we only count those who have died in hospital, and patients positive with Coronavirus. Other countries are measuring their deaths either from Coronavirus (Covid-19), or those comorbid with Covid-19, etc.

The disease appears to primarily affect the respiratory tract, the most common symptoms being a dry cough and fever. It seems many patients die from secondary pneumonia, both viral and bacterial, producing fluid in the lungs, or a cytokine storm due to the pneumonia which can damage multiple organs, but the mechanism is unknown.

Here is a paper on Covid-19 pathogenesis for those interested, but that is not why I am here today:

https://www.sciencedirect.com/science/article/pii/S2095177920302045

Today I am here to sell to you the power of sequencing, explain its applications, and how we are using it to fight SARS-Cov-2 (this is how I will refer to the virus, where Covid-19 is the name of the disease it causes).

SARS-Cov-2 is a type of Coronavirus, which have single stranded RNA (ssRNA) genomes encased in a protein and lipid capsule.

The RNA genome is positive sense, so the same orientation as a coding strand of DNA.

(As a side note I’ve always found ssRNA viruses to be interesting because they need to replicate in order to survive, and in cases such as Corona and Ebola viruses, this replication has to be done by a RNA-dependant, RNA polymerase, of which human cells do not have. I believe these enzymes are already widely studied as drug targets.)

Even though it contains an RNA genome, when sequencing we use reverse transcriptase to produce a DNA polymer for use in sequencing technologies. Sometimes this is not as easy as it sounds, and sequencing RNA viruses is a field of its own, here is a paper for anyone interested:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3708773/pdf/1471-2164-14-444.pdf

So for sequencing, the RNA is usually first reverse transcribed into DNA, so it is still ‘DNA’ sequencing

Advances in DNA sequencing methods have made genome sequencing extremely cheap, and in fact now it costs only a few pounds for us to sequence a Salmonella genome, for example.

This is a widely used graphic showing the cost of sequencing a million base pairs of a human genome (the Y axis is in US dollars):

money_sequencing

Next generation sequencing technologies use slightly different methods to decide what nucleic acid, A,T, G, or C are in the sequence, but usually it is done by fluorophores, light emitting molecules, that flash when being added to a single strand of DNA. Typically this is done in small fragments of the genome (amplicons), about 100-250 base pairs, producing ‘reads’. This is not my speciality, so for anyone interested in the chemistry of DNA sequencing, here is an article:

https://www.nature.com/articles/nrg.2016.49.pdf

But more recently we have Oxford Nanopore technology, which uses small pores bound to a membrane that emit a different signal depending on the nucleic acid passing through it. This enables the machine to be very small, and sequence very long molecules, enabling very long ‘reads’:

https://link.springer.com/article/10.1186/s13059-016-1103-0

What I am more interested in is what we can do with these types of DNA data, short read and long read, and information we can infer from them etc. For sequencing runs, the DNA is prepared (a library), and various quality control steps are required when sequencing; again this is a field by itself, any questions then our resident sequencing expert is called Dave Baker. This is him:

https://quadram.ac.uk/people/dave-baker/

I’m sure if you have any questions about sequencing he will try his best to accommodate you.

Once you have the data, the reads, you can use them to assemble a genome. This is because you have multiple copies of the genome separated into your sequencing run.

If you imagine that you have 1000 copies of a newspaper, and you shoot them with a shotgun, cutting them all up into thousands of pieces. You would probably be able to recall the original newspaper by finding fragments which overlap at certain points. This is how some software assemble short read sequence data, this is ‘de Novo’, or ‘from new’,  assembly.

The overlapping reads then pile up on top of each other, and you get a picture of data that looks a bit like this:

coverage_depth_bredth

You can also use long reads for assemblies, and these are useful – sometimes genomes have many repeats, like tracts of AAAAAAAAAAAAAA, and this is difficult to piece together an overlap with short reads, as you can’t easily identify how many A’s there are. In fact the best way to assemble is probably through ‘hybrid assembly’. This is where we use both long read and short read data to get a better picture of the sequence:

Here is a paper explaining this in bacteria:

https://www.nature.com/articles/nbt.2288

but it has also been done for human genomes:

https://www.nature.com/articles/nmeth.3865

But assembly is not always required, as sometimes the reads can provide valuable information. For example, when sequencing from a colony of bacteria, an area which has very low amounts of read depth might be in low frequency in the population. It also might just be low coverage from your sequencing run. But it can be worth investigating in certain cases.

Back to Covid-19 and SARS-CoV-2

And as you might expect people all over the world have been sequencing the genome of SARS-Cov-2.

This is where one of my interests particularly comes in: what you can do with the sequence data

One of the obvious and useful applications of sequence data is establishing the structure of the genome, and the organisation and orientation of genes.

These can then easily be aligned to each other, and you can see how different isolates compare to each other

For genomes such as viral genomes and bacterial genomes, alignments are important as you can see events such as recombination. This is where a large chunk of the genome has been inserted into a different place.

Here is a genomic diagram which I have constructed using R package genoplotR, using 13 downloaded sequences of SARS-Cov-2, at the bottom including an isolate of SARS-Cov-1 from China, 2003.

genoplotR_and_tree

Here we see a phylogenetic tree of 13 representative sequences of SARS-CoV-2 from different locations in the world (left), with each of their genomes aligned on top of each other (right), with a scale bar of 5 kilobases, that’s 5,000 base pairs. The bottom sequence is from 2003, China outbreak of SARS-Cov-1, and I was interested in how similar these viruses were. Where there is grey matching in the top genomes, this means that they are very similar (>95% similarity), and as you can see, the bottom genome has 3 main sections of difference. It also seems there are two clear lineages of SARS-CoV-2, and an isolated virus from Minnesota appears to be slightly more divergent than the other SARS-CoV-2 sequences, in the 4th line up from the bottom.

So just from the assemblies we can see that it has a fairly small genome, 30kb, which, for example, is 100 times less than a Salmonella genome (4Mb), or 200,000 times smaller than the human genome (6Gb). Similarly we can assess how different this virus is from that which was epidemic in China in 2003.

This suggests that a section of the first coding sequence, which is a large coding sequence encoding multiple proteins, is quite different in SARS-Cov-2, as with the start of the surface glycoprotein. This is interesting as the surface glycoproteins are usually used for mechanisms such as attachment to the outside of human cells, so we can infer that perhaps these two viruses have different capabilities (phenotypes) in this aspect. Also we can see a protein toward the right end of SARS-Cov-2 that is not present in SARS-Cov-1

We can similarly infer the distance of these different genomes by looking at how many single nucleotide changes (known as SNPs, for Single Nucleotide Polymorphisms) there are, so here is a distance matrix:

distance matrix

The left shows a squashed version of the phylogenetic tree, to put them in an order of relatedness. Here, the darker the colour in each square, the more distantly related the genomes are. So we can see that SARS-Cov-1 is very different to SARS-Cov-2, over 5000 SNPs different, which means that 1/6th of its genome has a different DNA sequence. Again, we see this isolate from Minnesota seems to be slightly more divergent than the other isolates of SARS-CoV-2.

And these are a fairly simple type of analysis which can show you what changes have occurred in your genome, and how similar they are, giving you hypotheses to work with in future experiments and analyses.

One excellent application of this was understanding the origin of SARS-Cov-2, which is shown in this paper:

https://www.nature.com/articles/s41591-020-0820-9?fbclid=IwAR3QtKR9Z6C5wyVclIetOkzHggkgS_H10Sk-_y8CDoTINs10NXQo4QQEU1Q

in which they have this figure:

recombination_of_SARS_Cov_2

Here they are showing the spike protein, or surface glycoprotein, has likely undergone recombination, which is probably why in my alignment figure previously, this protein had a section different to SARS-Cov-1. It appears that recombination, DNA from different sources combining, has occurred between a bat Coronavirus (RaTG-13), and one isolated from a pangolin. This would make sense when thinking that in China they have ‘wet’ markets where they sell wild animals to be eaten. It also may have occurred in nature, and been transferred to the market. But if these animals are in close proximity, then a recombination event between related RNA viruses is far more likely.

This event has occurred specifically in the receptor binding domain, and the receptor which SARS-Cov-2 uses is an outer layer (epithelial) surface protein called Angiotensin Converting Enzyme 2 (ACE2). So from the sequence data, we can hypotheise that this may have caused an altered ability to bind to this receptor, allowing better attachment, and possibly transmission between humans. So, again, we can infer useful hypotheses from this type of information.

Although this seems like a neat explanation, other papers suggest natural selection and amino acid changes might have been more fundamental to the development of the virus, here is a second SARS-Cov-2 genomics paper:

https://academic.oup.com/nsr/article/doi/10.1093/nsr/nwaa036/5775463

there are similar inferences in their analysis:

phylogeny_and_origin_picture

Here in B and C you can see a protein alignment inferred from sequence data, showing the specific changes in protein sequence which have occurred in SARS-Cov-2 development. The top shows a phylogenetic tree, how related each different virus is, so we can infer its closest relative that has been sequenced is a Coronavirus from a bat, the previously mentioned RaTG13. It also describes that the Coronaviruses from pangolins are closely related and mirrors the above analysis from nature that perhaps a recombination event has occurred in development of SARS-Cov-2.

Alignments are useful, but when you accrue enough data (a few hundred sequences for example) you can start to extrapolate more complex pictures.

A key utility of sequencing the genome is improved data for seeing how similar viruses are from each other, inferring their ‘genetic relatedness’ – phylogenetics. Using the whole genome as a template for identifying relatedness gives you a much more in depth picture of the relatedness of isolated viral genomes.

For example, I did this I did this easily from my home computer, again using the 13 representative isolates of SARS-Cov-2 and one isolate of SARS-Cov-1, from the left side of the previous figure:

just_phylogeny_real

The scale bar at the top shows how many SNPs, nucleotide differences, occur in that length of tree branch. The root, or common ancestor, of the viruses is toward the left, and the branches of the trees show further changes and evolution of the isolates. Most of these isolates are within 1-5 SNPs, but, as mentioned, surprisingly an isolate from Minnesota was 12 SNPs separate than the rest. I have ben wanting to find out where the SNPs are in this isolate, and maybe it can tell us something about the evolution. (So far I haven’t had time.) I have manually added the SARS-Cov-1 isolate as a dotted line, as when left in the phylogeny during construction of the tree it confounds the picture, because it is so different compared to SARS-CoV-2 sequences.

But in this analysis you can see two clear lineages. So I wondered if anyone else had seen this, and found this. Here, as the amount of data has grown, the population structure of the pathogen – the evolutionary relationship – starts to become apparent:

sars_cov_2-phylogeny

https://academic.oup.com/nsr/article/doi/10.1093/nsr/nwaa036/5775463

So, in this analysis, where the data is starting to mount up, we have 2 clear lineages, just as I had seen with my 13 representatives. This paper refers to them as L and S. As you can see this phylogeny is unrooted, but they mention that it is likely the S lineage emerged first and is possibly less aggressive.

This in itself can be useful to clinicians and scientists, as these lineages may need to be taken into consideration when developing a vaccine, it’s possible it would be effective against one lineage, but not the other – but such a thing obviously requires molecular testing. Again, we can see how we develop very useful hypotheses from this kind of data.

The most recent phylogenetic network for SARS-Cov-2 in the literature that I can find is from March:

SARS-CoV-2_march-phylogeny

https://www.pnas.org/content/pnas/early/2020/04/07/2004999117.full.pdf

This is actually a phylogenetic network rather than a phylogenetic tree, and it shows the origin from China toward the middle, and then describes how the lineages branch into different countries. This is technically unrooted, and shows how related each isolate is.

I can imagine that from March to now there have been many changes in the number of sequences, and I will be looking to construct a phylogeny with new data, time permitting, and will be offering my help when UK isolates are sequenced in our institute.

Similarly to sequencing technology, phylogenetic inference is a science in itself, but there are lot of tools that can be used to start creating your own phylogenies:

https://en.wikipedia.org/wiki/List_of_phylogenetics_software

I personally very much enjoy studying the statistical aspects of phylogenetics, such as deciding which model of nucleotide substitutions to use, for example some phylogenetic trees need to take into account different rates of nucleotide substitution along different branches, as you might expect when a certain lineage is under positive selection.

The power of genomics is already being used in this pandemic. But much more is possible with genomics – genomics data analysis can identify outbreaks, map the spread of an outbreak, estimate which lineage is going where, and even identify the source of an outbreak.

A great example of the use of this was done during the Ebola outbreak in 2013, where Josh Quick from Birmingham University took a mobile sequencing lab, using a Nanopore MINion, which is the size of a mobile phone, he managed to sequence and track the Ebolavirus in real time.

Here is a picture of how big a MINion is:

minion_size

And here is the nature paper where his research was published:

https://www.nature.com/articles/nature16996/figures/

In the united kingdom Public Health England have now almost exclusively switched to genome sequencing for some of their reference laboratories. From what I gather it has been a good transformation, as they can obtain a lot of information from each genome sequence, which becomes even cheaper once it is automated, and then make all the data publicly available.

Here is an example of PHE utilising genomics and DNA sequencing:

https://www.sciencedirect.com/science/article/pii/S0740002016308796

These techniques can be applied to any pathogen, and understanding pathogen microevolution over time and in certain niches gives you a good understanding of both what genetic material it requires for persistence, and also how the pathogen is changing over time. This is more specifically what we study in our research group, for example my supervisor previously published a paper where using sequence data he could infer what changes were required for a lineage of Salmonella Typhimurium to become a dominant clone of Salmonella in the UK, and monitor it’s small changes (microevolution) during the epidemic. This clone has since become pandemic as well (albeit of different degree):

https://wwwnc.cdc.gov/eid/article/22/4/15-0531_article

and similarly to SARS-Cov-2, it seems a recombination event (acquisition of a heavy metal resistance island) was at the beginning of the evolution of the Salmonella lineage:

https://www.frontiersin.org/articles/10.3389/fmicb.2019.01118/full

So this is similar to the Covid-19 pandemic, where as soon as sequence data was available from China and Hong Kong, they could search sequence databases and identify the probable origins and changes that were required for the pathogen to cause an outbreak, and eventually become pandemic.

When you have extrapolated and combined sequences, collaborative efforts can give excellent insights into the specific dynamics of the pandemic, how the virus is spreading, moving, and mutating.

Specifically for SARS-CoV-2, it might be useful to look at changes in SNP density – that is changes in certain genes that happen faster, while being transmitted in a human host. This is called positive selection, and is a form of ‘microevolution’ that improves the organisms ability to continue persisting in a certain niche. In this case, probably human to human spread and respiratory infection.

Identifying the regions of different SNPs occurring can also influence vaccine development – using variable regions would be less effective than stable regions which encode for proteins that seem essential for the virus to keep occupying it’s evolutionary niche – a human host.

As a final note, I will point out some excellent resources already being developed for SARS-Cov-2 sequence data.

Genomic sequences are being uploaded daily to genbank (NCBI), currently they have 927:

https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/

and currently for SARS-Cov-2, I think GISAID is one of the best resources. Initially set up to share influenza data, but they currently have >9,000 sequences, and also run infographics based on all the data that is submitted:

https://www.gisaid.org/

So their phylogenetic tree is much more up to date, but quite hard to visualise properly:

gisaid_phylogeny

Here you can also see the two clear lineages, L and S, and purple indicates those from china, red the USA, and orange those from Africa. This also shows a possible third sub lineage emerging at the top of the tree, but also infer that USA have both lineages circulating within the country.

GISAID are also tracking the locations where nucleic acid changes are occurring, so here you can see 14 main spots where the SNPs are occurring:

gisaid_genome

Once again the spike protein (surface glycoprotein) has a specific region of comparatively high SNP density, perhaps suggesting selection for this to change, or at least suggesting variation within the gene

they also have a pretty cool infographic where you can see the spread happening across the world:

gisaid_world

and by pressing play on the link here you can see how the sequence data can infer the spread of the virus

https://www.gisaid.org/epiflu-applications/next-hcov-19-app/

In my opinion, this is all interesting data visualisation, but a bit of a gimmick, because when you get to modern day levels of the virus, looking at the difference of spread in the entire world is certainly useful, but difficult to infer something meaningful from this infographic, as it is so widespread.

This is why local consortiums require collaboration to sequence and track the local isolates of SARS-Cov-2. As previously mentioned, if you are trying to develop a vaccine, for example, the genome sequences can give you an idea of which regions are variable, and which regions are stable, possibly guiding a more effective vaccination. In the UK this kind of data is used to try to develop a yearly vaccine for influenza, which mutates to continue it’s seasonal spread:

https://www.gov.uk/government/publications/national-flu-immunisation-programme-plan

And this consortium for sequencing local isolates of SARS-Cov-2 is precisely what has been set up in the United Kingdom, we have gathered a consortium of institutes who will all collaborate to sequence isolates of SARS-Cov-2, share all the data, track the virus, understand it’s microevolution, and use this data to help beat the virus as quickly as we can. The Quadram Institute are playing a fundamental role in this process:

https://quadram.ac.uk/quadram-institute-whole-genome-sequence-map-spread-of-coronavirus/

£20 million has been allocated for the sequencing of SARS-Cov-2, and a large collaborative effort will no doubt significantly aid the fight against this pathogen.

If you would like to enquire to any of these institutes about sequencing SARS-CoV-2, or any other query, I’m sure they would be very happy to collaborate, as we at Quadram have been doing with Tapfumeni Mashe (Moderator 1) from Zimbabwe on a project studying Salmonella Typhi:

Quadram Institute Bioscience
https://quadram.ac.uk/
The Earlham Institute
http://www.earlham.ac.uk/
Public Health England
https://www.gov.uk/government/organisations/public-health-england
The Welcome Sanger institute
https://www.sanger.ac.uk/
Birmingham University
https://www.birmingham.ac.uk/index.aspx

There are other institutes involved, but I cannot find a complete list at this time.

Thank you everyone, and I hope this was a good insight into the power and increasing necessity of genome sequencing, it’s application in sequence data analysis, and crucially through helping us monitor, and understand SARS-Cov-2 and other pathogens on a global scale.

Thank you for your attention, and I’m happy to answer any other questions you may have.

Leave a comment