Performing Mothur pipeline with Ezbiocloud database

[GUEST ARTICLE]

This guest article by Sein Park, a student at Seoul National Univ., contains his own work and opinion which is not of or related to ChunLab, Inc.

seinpark Sein Park, Department of Biological Sciences, Seoul National Univ.

Email: psi103706@gmail.com

Introduction

Mothur is one of the representative pipelines widely used for bioinformatics analysis of microbial ecosystems. Its functions include phylogenetic analysis of microbial communities based on 16S rRNA data generated via 454 pyrosequencing, and so on. The Silva and RDP databases are basically used in this process. However, these data are restricted to genus level analysis and detailed analysis of species level is known to be difficult.

This article introduces the process of implementing the Mothur pipeline using the Ezbiocloud database to support detailed analysis of the species level rather than the above reference database, which is basically used in Mothur. In addition, I looked at the differences from the classification result obtained by using the QIIME pipeline having the similar function. I also examined the advantages of Ezbiocloud DB and considered how to improve the Ezbiocloud DB in terms of the species level classification.

Materials & Methods

Preparation

At first, you need to install the Mothur program. It is distributed as open source at https://www.mothur.org/wiki/Main_Page and can be installed based on the manual posted on the site. Sequence data (input data file) is required for analysis. The format of the file must be fasta. The following manual assumes that quality control of input file is already completed.

In addition, you need the Ezbiocloud DB fasta file and tax file, which are necessary for classification and alignment. However, the form and conditions of the file are somewhat different from those of the QIIME pipeline, and the differences will be discussed later. In this article, the fasta file and tax file are stated as eztaxon_full.fasta and eztaxon_id_taxonomy.tax, respectively, the aligned fasta file as eztaxon_full.align, and the sample file analyzed in manual as example.fasta.

Performance Manual

figure-1
Figure 1. Performance flow of Mothur pipeline (Compared to QIIME). 1)

Figure 1 shows the execution process of Mothur, and the process of QIIME is also attached for comparison. In this article, I will explain from “Align sequence” stage to the “Classify OTU” stage where the Ezbiocloud database is used. For the rest of the process, please refer to https: //www.mothur.org/wiki/454_SOP. Commands described below can be adjusted by using various parameters. I will briefly explain the main parameters affecting execution speed.

unique.seqs(fasta=example.fasta)

The unique.seqs command removes duplicate sequences from the fasta file. As a result, example.unique.fasta is created.

align.seqs(fasta=example.unique.fasta, reference=eztaxon_full.align)

           The align.seqs command aligns the input file based on the reference file (the reference file should be aligned). Mothur provides Silva and Greengenes reference data at the official page. If there is aligned data of the Ezbiocloud database, you can write reference parameters like the above command. Having enough memory, you can run faster using “processors” parameters. (Eg, processors = 4) The output is the example.unique.align file.

screen.seqs(fasta=example.unique.align)

The screen.seqs command deletes the rest of the sequence, leaving only a portion of the sequence matching the condition inserted in the parameters. For example, if you insert the “end=n” parameter (where n is a natural number), nucleotides after n-th position are removed and the “maxambig=n” parameter deletes all sequences with n or more ambiguous bases. Available parameters are listed at https://www.mothur.org/wiki/Screen.seqs. You can get example.unique.good.align as an execution result.

filter.seqs(fasta=example.unique.good.align)

The filter.seqs command looks at the alignment pattern of the screened data and filters the sequence based on the condition inserted in the parameter. For example, setting “vertical=T” deletes the sequence in which all characters are – (gap character). Also, inserting “trump=-” will delete the whole column with – (gap character) in all arrays. Specific parameters can be checked at https: //www.mothur.org/wiki/Filter.seqs. When this command is executed, example.unique.good.filter.fasta is output.

unique.seqs(fasta=example.unique.good.filter.fasta)

It deletes the sequence of duplicates among the sequence after screen.seqs and filter.seqs commands. As an execution result, example.unique.good.filter.unique.fasta file is generated.

pre.cluster(fasta=example.unique.good.filter.unique.fasta)

The pre.cluster command removes the sequence that may cause an error. Based on the idea that abundant sequences are more likely to generate erroneous sequences than rare ones, Mothur ranks abundance and remove abundant sequences over a certain threshold value. When you are done, the example.unique.good.filter.unique.precluster.fasta file is generated.

classify.seqs(fasta=example.unique.good.filter.unique.precluster.fasta, template=eztaxon_full.fasta, taxonomy=eztaxon_id_taxonomy.tax, method=knn, numwanted=1)

The classify.seqs command classifies each sequence of input files based on template and taxonomy files. Based on criteria, there are various parameters, which can be used selectively. According to the experimental results, however,  you should add the parameter “method=knn, numwanted=1” in order to classify the species level. The reason for this will be stated in the discussion. Once this command completes, you can obtain classification information for each sequence.

remove.lineage(fasta=example.unique.good.filter.unique.precluster.fasta, taxonomy=example.unique.good.filter.unique.precluster.eztaxon_id_taxonomy.knn.taxonomy, taxon=Mitochondria-Chloroplast-Archaea-Eukaryota-unknown)

The remove.lineage command removes data from the classification file that meet the conditions set in the taxon parameter. The “Mitochondria-Chloroplast-Archaea-Eukaryota-unknown” condition deletes data classified as mitochondrial and chloroplast DNA, archaea and eukaryotes, and deletes data classified as unknown above the kingdom level.

dist.seqs(fasta=example.unique.good.filter.unique.precluster.pick.fasta)

           The dist.seqs command calculates the distance of each aligned sequence pair. This distance increases if there are much gaps or mismatchs, and as the distance increases, it can infer that they belong to another OTU. You can get the dist file as the execution result.

cluster(column=example.unique.good.filter.unique.precluster.pick.dist, name=example.unique.good.filter.unique.precluster.pick.names)

           The cluster command assigns each sequence to the OTU according to the calculated distance above. You can set cutoff of the distance to be classified as OTU based on the parameters.

classify.otu(taxonomy=example.unique.good.filter.unique.precluster.eztaxon_id_taxonomy.knn.pick.taxonomy, list=example.unique.good.filter.unique.precluster.pick.an.list)

The classify.otu command classifies each OTU based on the classification data of the sequence and the list of clusters of each OTU.

Result

The following data show the result of the mock data used in Schloss et al. (2011), executed according to the manual above. The control group used the Silva database for align.seqs and used the RDP database for classify.seqs. You can download the dataset from Mothur manual page, https://www.mothur.org/wiki/454_SOP.

Commands with specific parameters are as follow:

unique.seqs(fasta=GQY1XT001.shhh.trim.fasta, name=GQY1XT001.shhh.trim.names)

align.seqs(fasta=GQY1XT001.shhh.trim.unique.fasta, reference=eztaxon_full.align, processors=16)

  • Control Group: reference= silva.nr_v123.align

screen.seqs(fasta=GQY1XT001.shhh.trim.unique.align, name=GQY1XT001.shhh.trim.unique.names, group=GQY1XT001.shhh.groups, optimize=start, criteria=95, processors=16)

filter.seqs(fasta=GQY1XT001.shhh.trim.unique.good.align, vertical=T, trump=., processors=16)

unique.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.fasta, name=GQY1XT001.shhh.trim.unique.good.names)

pre.cluster(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.names, group=GQY1XT001.shhh.good.groups, diffs=2)

classify.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.names, group=GQY1XT001.shhh.good.groups, template=eztaxon_full.align, taxonomy=eztaxon_id_taxonomy.tax, method=knn, numwanted=1, processors=16)

  • Control Group: template=silva.nr_v123.align, taxonomy=silva.nr_v123.tax

remove.lineage(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.names, group=GQY1XT001.shhh.good.groups, taxonomy=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.eztaxon_id_taxonomy.knn.taxonomy, taxon=Mitochondria-Chloroplast-Archaea-Eukaryota-unknown)

  • Control Group: taxonomy=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.nr_v123.knn.taxonomy

dist.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.fasta, cutoff=0.15, processors=16)

cluster(column=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.dist, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.names)

classify.otu(taxonomy=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.eztaxon_id_taxonomy.knn.pick.taxonomy, list=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.an.list)

  • Control Group: taxonomy=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.nr_v123.knn.pick.taxonomy

The results were organized and analyzed in a unified form. I investigated the number of each taxonomy level of classification succeeded for each database. (Table 1) In addition, I compared both result files and quantified which classification level was in coincidence with each other and what proportion it was. (Figure 2)

table-1
Table 1. Taxonomy level in which the pipeline succeeded to classify for each database.
figure-2
Figure 2. Ratio of coincidence in each taxonomy level between Ezbiocloud & Silva / RDP database.

Table 2 and Figure 3 show the modified result of the above, where the Ezbiocloud data were aligned by their own align data, and mock data were classified using the new version of Silva database.

table-2
Table 2. The Modified result of Table 1.
figure-3
Figure 3. Modified result of Figure 2.

As shown in Table 1, it was possible to classify all data using Ezbiocloud at the species level, and genus level classification was possible in all cases using Silva / RDP. In Figure 2, comparing the results using both databases showed that the match rate was highest at the order level. In other words, the order was the same, but the family did not match. The final concordance rate in family and genus level followed that in order level.

On the other hand, when using the 123 version of the Silva database, as shown in Table 2, not all classifications were succeeded at the genus level, and family level classification was almost confirmed. Also, as shown in Figure 3, about 92% of the data was consistent with family and genus levels, but much higher at family level. Such trend would be explained in detail in the discussion section.

Discussion

Differences from Performing QIIME

The file format required to run the Ezbiocloud database in QIIME and Mothur was somewhat different. In Mothur, a semicolon (;) should be added at the end of each line of the taxonomy file. Also, unlike QIIME, Mothur required an aligned template of the Ezbiocloud fasta file. Since I could not get aligned data in this article at first, I used aligned Ezbiocloud fasta files based on Silva’s aligned data.

After complementary work, I used align data of Ezbiocloud DB made by itself. Initially Mothur did not recognize that this data was aligned. When I looked directly at the source code, Mothur recognized that they were not aligned unless all the aligned sequences in the file had the same length. Thus, in order to match the length of the longest sequence, you just need to add some gap characters (-) to the end of every sequence and then mother would successfully recognize it as aligned data.

Difficulties in Performing

Since the information in the Ezbiocloud database is too large, when classifying and aligning, memory overflow may occur. In fact, you can include wang or knn in the method parameter when you run the classify.seqs command. In the case of wang method, it is necessary to calculate the matching probability of each classification. The problem is that when you inspect a huge amount of databases, the memory may be insufficient and the program could be dead.

The knn method is the method of selecting the nearest k data and classifying the sequence with them. The default value of k = 10 would be impossible to classify as species level. Perhaps there could be a lot of ambiguity to speculate from the ten candidates, since the species level is very detailed. Therefore, only one candidate with the highest degree of coincidence was selected and “numwanted=1” was inserted to strictly classify the species level. (However, you should concern the problem of overfitting. You may select other value like 3.) For this reason, all classification results classified by Ezbiocloud and RDP databases are determined at the species level and genus level.

Idea of Improvement & Future Suggestion

Each database was able to successfully classify species level and genus level – and family level in version 123 Silva DB – as expected. However, when comparing both results, the family or genus match rate was lower than the match rate of order level. Nevertheless, the consistent tendency was observed in the unmatched results. For example, the species that Ezbiocloud expected to be included in S24-7 at the order levels were classified as Porphyromonadaceae by Silva / RDP, and if Ezbiocloud predicted as Eisenbergiella at the family level, it is expected to be Clostridium_XlVa or Blautia by SILVA / RDP. Considering why this consistent incorrectness occurs would be an unobtrusive problem.

The microbial species level has very subtle differences. In particular, it is not easy to determine significant differences in 16S rRNA sequences of microbes that do not cause mutations well over time. Thus, it is required to find the maximum value of “numwanted” parameter in classify.seqs which can be successful to categorize into species level, and also has statistically significant without overfitting. Finally, if you build a database that can efficiently identify species-levels using information on other core genes, you will get more reliable results.

Supplements & Improvements

Unlike previous results (Table 1 & Figure 2), sequences from the new mock data with Silva DB from version 123 were not all classified in genus level. About 60-70% were able to classify only at the family level, but this is perhaps because of the characteristic of the data used in the experiment. Of the approximately 6,000 data of the community used for the experiment, I found approximately 2400 family of S24-7 group. Unlike earlier version, in which this family was never able to be classified, 123 version of Silver DB could classify the S24-7 family, but information on the sub-family level was not included. Therefore, the ratio of family level was much higher than the genus-level classification. Genus-level concordance rate, when the Silver database can classify the sub-S24-7 family level, is expected to reach 80-90%.

I would also be able to analyze unknown community data via OTU clustering and OTU classification process. Furthermore, by adjusting various parameters, you can classify the unknown community data with high precision.

Conclusion

I tried to describe how to run Mothur using the Ezbiocloud database. Based on the mock data used in Schloss et al. (2011), I confirmed that the Ezbiocloud database is useful for species-level classification. Also, by comparing to control data obtained by using SILVA and RDP database, I was able to investigate the coincidence of the Ezbiocloud database numerically. Finally, I discussed the ways to improve the accuracy. By applying these improvements, it could be possible to classify microbes more accurately.

References

  1. Plummer, Erica, et al. “A Comparison of Three Bioinformatics Pipelines for the Analysis of Preterm Gut Microbiota using 16S rRNA Gene Sequencing Data.”Journal of Proteomics & Bioinformatics 2015 (2015).
  2. Schloss PD, Gevers D, Westcott SL. (2011). Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PloS ONE. 6:e27310.

Last edited January 13, 2017

Bacterial species concept explained

한국어 사이트로 이동하기 中文网站链接 日本語サイトへ移動します btn_sp

Some people may think that science is ruled by a body or something similar to the “government” or “United Nations”. This is partly true for bacterial taxonomy. There is a body called “International Committee on Systematics of Prokaryotes (ICSP)” which plays a similar role like the United Nations for the international politics. Even though ICSP provides recommendation reports from time to time, it cannot formally set up the definition of bacterial species. It is rather decided by community efforts or consensus among scientists. At present, most widely accepted species concept is called “Phylo-phenetic species concept” (Rosselló-Mora & Amann, 2001).

“A monophyletic and genomically coherent cluster of individual organisms that show a high degree of overall similarity in many independent characteristics, and is diagnosable by a discriminative phenotypic property.”

There are several terms that I will explain further:

  • “A monophyletic cluster” means that a species contains a group of bacterial strains that shared a most recent common ancestor.
  • Bacterial strains that form a “genomically coherent cluster” show similar genome sequences, thereby likely similar phenotypes (like you and your brother look more similar than to your friends, as you and your brother share more genome sequences).
  • “Overall similarity in many independent characteristics” means that two strains share phenotypic properties that are not related or interconnected. For example, the height and weight of human are correlated each other, therefore not independent.
  • “Diagnosable by a discriminative phenotypic property” means that two species should be differentiated by phenotypic characteristics (such as biochemical tests, morphology and physiology). If two species can only be differentiated by means of genome or 16S rRNA sequences, this two should not be recognized as different species. This condition ensures that people who do not have access to molecular techniques or facility can still carry out identification. However, I think, as more people, if not most, are moving towards molecular diagnostic methods (PCR, DNA sequencing, Microarray, and even genome sequencing) over traditional biochemical tests, this condition can be omitted in near future (I hope). Furthermore, microbiologists, other than taxonomists, are already using solely molecular concept of bacterial species as molecular OTU (operational taxonomic unit).

 

The practical bacterial species concept

“Phylo-phenetic species concept” sounds very solid. However, its application can be very tricky. Again, let me explain by examples.

4

In the above figure, you can see 3 clearly differentiated clusters that can be confidently called species A, B and C. How about the below case?

4.4

To me, clear divisions of clusters are not apparent. However, we still need to classify and name the species. To achieve this, bacterial taxonomists have introduced a concept of “type strain”. A type strain is a live strain that can serve the center of a species and regarded as the representative of a species. When multiple strains are discovered for a single species, we can choose a likely representative strain as the type strain. In practice, most of bacteria species are described with only one or few strains, the type strain of a species is often designated to the strain which was first discovered. What I am trying to say is that “type strain” may not be very “typical” for a given species! For example, the type strain of Escherichia coli does not kill you, but other strains of E. coli , such as O157 strains, can kill you easily.

Once the type strain is decided by a taxonomist who are willing to introduce new species, it should be deposited to the banks of bacteria, called “culture collections”. The list of culture collections can be found in the website of the World Federation for Culture Collections (WFCC). The type strain of a bacterial species cannot be easily changed to other strain, as the stability is very important for taxonomy!

Let’s assume that a team of taxonomists carried out research to classify the strains in the previous figure and come up with the following result:

5

Here, the team found 3 species and designated 3 type strains for each (red circles). As a species is a coherent group of bacterial strains, we should employ the same measure/criteria of “coherence” or “similarity” for all 3 species. We need to define the followings to make classification process objective.

  • How the similarity can be defined? (A method of measuring similarity/distance)
  • What degree of similar should be used to decide if a bacterial strain belongs to species (similarity between the type strain of species A and a strain; this is called the species boundary).

 

DNA-DNA hybridization as indirect genomic comparison

Defining the above two criteria has been major challenge for modern bacterial taxonomy. In 1987, major players in the field of bacterial taxonomy have gathered in Paris to try to come up with an objective and stable criterion for future classification and identification of Bacteria. They foresaw that genome data (genotypic) are superior to phenotypic data (morphology, physiology and biochemisty), but sequencing of genome was not readily available until 1995. However, at that time, there were other molecular methods, called DNA-DNA hybridization (DDH), to measure the degree of hybridization between a pair of genomes in solution. The more two genomes hybridize, the more similar their genome sequences are.

DDH provides overall, albeit indirect, measure of genomic similarity between two strains, and serves well as a surrogate for genome sequence comparison. In a seminal paper, Wayne and other taxonomists recommend DDH as the method for defining bacterial species and 70% relatedness as a cutoff for the species boundary (Wayne et al., 1987). This Wyane et al. paper has been cited over 4,000 times which means that this proposal was well received. In conclusion, if a strain belongs to a species, it should show 70% or higher DDH relatedness value to the type strain of that species.

 

Genome sequence-based species concept

Thanks to the introduction of next generation sequencing (NGS), bacterial sequencing is now cheap enough to become readily available to many researchers. I believe that genome sequence information is the best you can get for any taxonomic work, and it can eliminate the needs for many tedious and unreliable experimental taxonomic methods. Of course, it can replace notoriously erroneous DDH methods in the definition of bacterial species. “Overall Genome Related Index (OGRI)” is a term for any computational method to calculate similarity between two genome sequences, first coined by Fred Rainey and myself in 2014. There are many different algorithms that can be used for comparing two strains, Average Nucleotide Identity (ANI) has been most widely accepted. The generally accepted cutoff value for the species boundary is about 95% ANI. Here I recommend the OrthoANI algorithm, an improved version of ANI, instead of the original ANI. (See more details here).

For both ANI and OrthoANI, about 95% is the cutoff. Does this mean this cutoff is really a clear and sharp one that can be used without exception? Let’s consider the following case:

qq%e6%88%aa%e5%9b%be20160921153700

Two strains show 95.1 and 94.9% OrthoANI, respectively, to the type strain of “species X”. Does this mean that strain A belongs to “species X” and stain B does not? You may think that I made this case up and it is not a probable case? The below is the real case of Vibrio vulnificus, a notorious pathogen from sea water.

The below is a chart in which 31 V. vulnificus strains were examined for OrthoANI values against to the type strain of the species. Many strains show OrthoANI values around 95% which is the proposed cutoff.

7

6.PNG

When we look at the above dendrogram explaining overall taxonomic structure within the V. vulnificus species, a group of strains (=outlier group) that show about 95% OrthoANI values to the type strain of V. vulnificus may belong to the different species (notV. vulnificus). However, OrthoANI values between the authentic V. vulnificus group (containing type strain) and this outlier group are around the proposed cutoff, i.e. 95% OrthoANI, therefore the decision is not a straight-forward one. In my opinion, two groups can be classified as either different species or at least different subspecies. Anyhow, it is up to taxonomists who will work on the further evidence and make the final conclusion (as a form of publication). Meanwhile, I can only tell you that V. vulnificus is not a really one genomically coherent group.

Take home messages:

  1. Bacterial species is defined as “genomically coherent group of organisms”.
  2. A species must have a type strain that is live. Anyone can obtain this strain for taxonomic study if he/she has a suitable microbiological facility.
  3. If a strain belong to a species,Overall Genome Related Index (OGRI) should be within the cutoff of species boundary. If you use OrthoANI , it will be about 95%.
  4. If your isolate is a new species by the condition (3) and you want to describe it as new species, you need to carry out additional taxonomic research (mostly phenotypic characterizations such as morphological observation, biochemical tests and physiological characterization). Even though we all know that you have new species by genome sequence comparison, this allows us to understand more about the biology of species. Genome sequence tells us a lot, but not sufficient enough for us to understand their physiology and life style (not yet!).

 

Practice by yourself

One of the classical taxonomic debates is the taxonomic status of Escherichia coli and Shigella species. At present, they are classified in the difference species and genus. However, comparison of genome sequences shows differently.

  • Go to http://cg.ezbiocloud.net/ and check out “Escherichia coli representative set”. Examine OrthoANI values between Escherichia coli, Shigella species and other Escherichia species.

Questions to ask:

  • Do Escherichia coli and Shigella species belong to the same species?
  • Why are they classified as the different species? Why don’t they follow the general rule of bacterial classification?

 

Further Readings

  1. Chun, J. & Rainey, F. A. Integrating genomics into the taxonomy and systematics of the Bacteria and Archaea. Int J Syst Evol Microbiol 64, 316-324.

 

References

  1. Rossello-Mora, R. & Amann, R. The species concept for prokaryotes. FEMS Microbiol Rev 25, 39-67 (2001).
  2. Wayne, L.G. et al. Report of the ad hoc committee on reconciliation of approaches to bacterial systematics. Int J Syst Bacteriol 37, 463-464 (1987).

 

By Jon Jongsik Chun (CEO of ChunLab, Inc. & Professor at Seoul National Univ.)

Erroneous DNA G+C content data determined by HPLC and thermal melting experiments are no longer required for species descriptions

In the recent paper (Kim et al., 2015), we evaluated DNA G+C content data obtained by HPLC and thermal melting experiments (Tm) with more accurate experiments directly calculated from whole genome sequences. The range of DNA G+C ratio is different by species, making it a unique characteristic for each species. For example, more variations in DNA G+C content are found among species found in nature (e.g. Pseudomonas) versus species that are pathogenic to humans (e.g. Salmonella). In any case, the range of DNA G+C ratio among species (as defined by Average Nucleotide Identity) is 1%. However, experimentally determined DNA G+C ratio values, reported in past publications, showed high levels of errors when compared with genome-derived values. In fact, about half of these values showed over a 1% difference. The DNA G+C ratio can only be meaningful when the errors are reasonably small. I believe that these levels of errors are not useful and should be rectified using genome sequence data. In the current NGS era, why bother publishing values that are soon to be corrected (likely within a year or two)?

The following figure shows how DNA G+C ratios vary among species, but mostly within a 1% range.

gc_ratio

1. Kim, M., Park, S.C., Baek, I. & Chun, J. Large-scale evaluation of experimentally determined DNA G+C contents with whole genome sequences of prokaryotes. Syst Appl Microbiol 38, 79-83 (2015).

By Jon Jongsik Chun (CEO of ChunLab, Inc. & Professor at Seoul National Univ.)

Updated on April 4th 2016

How to calculate 16S rRNA sequence similarity values for bacterial taxonomy: Why BLAST should be avoided

Nucleotide sequence similarity values are widely used for identification and description of novel species among bacterial taxonomists. There are many different algorithms available for calculating similarity between two gene sequences, and often times it is easy to misinterpret the results. Below, is the method for obtaining nucleotide sequence similarity values for taxonomic purposes.

The calculation of sequence similarity between two genes consists of two steps:
(i) pairwise sequence alignment and (ii) calculation of the similarity value. Pairwise sequence alignment can be achieved either by using the global or local alignment algorithms. It is recommended to use the global alignment algorithm and avoid using the local alignment algorithm (Please see here for details). That’s why a BLAST-series program should NOT be used for calculating similarities. Eventhough, BLAST is still the best tool for identifying the most similar sequences within a large database of sequences.

In the EzTaxon/EzBioCloud server, the closest neighboring taxa are first identified using the BLASTN program, and then a rigorous pairwise sequence alignment algorithm (Myers & Miller, 1988) is used to calculate sequence similarity. When sequence similarity is calculated, gaps are not considered. Using pairwise sequence alignment instead of multiple sequence similarity ensures the reproducibility of the similarity calculation. For example, if you obtain the sequence similarity between A and B from a pairwise sequence alignment, the value will always be the same. However, the values between A and B calculated using multiple sequence alignments among A, B, and C and A, B, and D respectively, may be different as the multiple sequence alignment algorithm tries to find the optimal solution among all sequences, not just between A and B.

These recommendations are also given in the following publications:

1. Kim, M., Oh, H.S., Park, S.C. & Chun, J. Towards a taxonomic coherence between average nucleotide identity and 16S rRNA gene sequence similarity for species demarcation of prokaryotes. Int J Syst Evol Microbiol 64, 346-51 (2014).
2. Kim, O.S. et al. Introducing EzTaxon-e: a prokaryotic 16S rRNA gene sequence database with phylotypes that represent uncultured species. Int J Syst Evol Microbiol 62, 716-21 (2012).
3. Tindall, B.J., Rossello-Mora, R., Busse, H.J., Ludwig, W. & Kampfer, P. Notes on the characterization of prokaryote strains for taxonomic purposes. Int J Syst Evol Microbiol 60, 249-66 (2010).

By Jon Jongsik Chun (CEO of ChunLab, Inc. & Professor at Seoul National Univ.)

Updated on April 4th 2016