This guest article by Sein Park, a student at Seoul National Univ., contains his own work and opinion which is not of or related to ChunLab, Inc.
|Sein Park, Department of Biological Sciences, Seoul National Univ.
Mothur is one of the representative pipelines widely used for bioinformatics analysis of microbial ecosystems. Its functions include phylogenetic analysis of microbial communities based on 16S rRNA data generated via 454 pyrosequencing, and so on. The Silva and RDP databases are basically used in this process. However, these data are restricted to genus level analysis and detailed analysis of species level is known to be difficult.
This article introduces the process of implementing the Mothur pipeline using the Ezbiocloud database to support detailed analysis of the species level rather than the above reference database, which is basically used in Mothur. In addition, I looked at the differences from the classification result obtained by using the QIIME pipeline having the similar function. I also examined the advantages of Ezbiocloud DB and considered how to improve the Ezbiocloud DB in terms of the species level classification.
Materials & Methods
At first, you need to install the Mothur program. It is distributed as open source at https://www.mothur.org/wiki/Main_Page and can be installed based on the manual posted on the site. Sequence data (input data file) is required for analysis. The format of the file must be fasta. The following manual assumes that quality control of input file is already completed.
In addition, you need the Ezbiocloud DB fasta file and tax file, which are necessary for classification and alignment. However, the form and conditions of the file are somewhat different from those of the QIIME pipeline, and the differences will be discussed later. In this article, the fasta file and tax file are stated as eztaxon_full.fasta and eztaxon_id_taxonomy.tax, respectively, the aligned fasta file as eztaxon_full.align, and the sample file analyzed in manual as example.fasta.
Figure 1 shows the execution process of Mothur, and the process of QIIME is also attached for comparison. In this article, I will explain from “Align sequence” stage to the “Classify OTU” stage where the Ezbiocloud database is used. For the rest of the process, please refer to https: //www.mothur.org/wiki/454_SOP. Commands described below can be adjusted by using various parameters. I will briefly explain the main parameters affecting execution speed.
The unique.seqs command removes duplicate sequences from the fasta file. As a result, example.unique.fasta is created.
The align.seqs command aligns the input file based on the reference file (the reference file should be aligned). Mothur provides Silva and Greengenes reference data at the official page. If there is aligned data of the Ezbiocloud database, you can write reference parameters like the above command. Having enough memory, you can run faster using “processors” parameters. (Eg, processors = 4) The output is the example.unique.align file.
The screen.seqs command deletes the rest of the sequence, leaving only a portion of the sequence matching the condition inserted in the parameters. For example, if you insert the “end=n” parameter (where n is a natural number), nucleotides after n-th position are removed and the “maxambig=n” parameter deletes all sequences with n or more ambiguous bases. Available parameters are listed at https://www.mothur.org/wiki/Screen.seqs. You can get example.unique.good.align as an execution result.
The filter.seqs command looks at the alignment pattern of the screened data and filters the sequence based on the condition inserted in the parameter. For example, setting “vertical=T” deletes the sequence in which all characters are – (gap character). Also, inserting “trump=-” will delete the whole column with – (gap character) in all arrays. Specific parameters can be checked at https: //www.mothur.org/wiki/Filter.seqs. When this command is executed, example.unique.good.filter.fasta is output.
It deletes the sequence of duplicates among the sequence after screen.seqs and filter.seqs commands. As an execution result, example.unique.good.filter.unique.fasta file is generated.
The pre.cluster command removes the sequence that may cause an error. Based on the idea that abundant sequences are more likely to generate erroneous sequences than rare ones, Mothur ranks abundance and remove abundant sequences over a certain threshold value. When you are done, the example.unique.good.filter.unique.precluster.fasta file is generated.
classify.seqs(fasta=example.unique.good.filter.unique.precluster.fasta, template=eztaxon_full.fasta, taxonomy=eztaxon_id_taxonomy.tax, method=knn, numwanted=1)
The classify.seqs command classifies each sequence of input files based on template and taxonomy files. Based on criteria, there are various parameters, which can be used selectively. According to the experimental results, however, you should add the parameter “method=knn, numwanted=1” in order to classify the species level. The reason for this will be stated in the discussion. Once this command completes, you can obtain classification information for each sequence.
remove.lineage(fasta=example.unique.good.filter.unique.precluster.fasta, taxonomy=example.unique.good.filter.unique.precluster.eztaxon_id_taxonomy.knn.taxonomy, taxon=Mitochondria-Chloroplast-Archaea-Eukaryota-unknown)
The remove.lineage command removes data from the classification file that meet the conditions set in the taxon parameter. The “Mitochondria-Chloroplast-Archaea-Eukaryota-unknown” condition deletes data classified as mitochondrial and chloroplast DNA, archaea and eukaryotes, and deletes data classified as unknown above the kingdom level.
The dist.seqs command calculates the distance of each aligned sequence pair. This distance increases if there are much gaps or mismatchs, and as the distance increases, it can infer that they belong to another OTU. You can get the dist file as the execution result.
The cluster command assigns each sequence to the OTU according to the calculated distance above. You can set cutoff of the distance to be classified as OTU based on the parameters.
The classify.otu command classifies each OTU based on the classification data of the sequence and the list of clusters of each OTU.
The following data show the result of the mock data used in Schloss et al. (2011), executed according to the manual above. The control group used the Silva database for align.seqs and used the RDP database for classify.seqs. You can download the dataset from Mothur manual page, https://www.mothur.org/wiki/454_SOP.
Commands with specific parameters are as follow:
align.seqs(fasta=GQY1XT001.shhh.trim.unique.fasta, reference=eztaxon_full.align, processors=16)
- Control Group: reference= silva.nr_v123.align
screen.seqs(fasta=GQY1XT001.shhh.trim.unique.align, name=GQY1XT001.shhh.trim.unique.names, group=GQY1XT001.shhh.groups, optimize=start, criteria=95, processors=16)
filter.seqs(fasta=GQY1XT001.shhh.trim.unique.good.align, vertical=T, trump=., processors=16)
pre.cluster(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.names, group=GQY1XT001.shhh.good.groups, diffs=2)
classify.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.names, group=GQY1XT001.shhh.good.groups, template=eztaxon_full.align, taxonomy=eztaxon_id_taxonomy.tax, method=knn, numwanted=1, processors=16)
- Control Group: template=silva.nr_v123.align, taxonomy=silva.nr_v123.tax
remove.lineage(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.fasta, name=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.names, group=GQY1XT001.shhh.good.groups, taxonomy=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.eztaxon_id_taxonomy.knn.taxonomy, taxon=Mitochondria-Chloroplast-Archaea-Eukaryota-unknown)
- Control Group: taxonomy=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.nr_v123.knn.taxonomy
dist.seqs(fasta=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.pick.fasta, cutoff=0.15, processors=16)
- Control Group: taxonomy=GQY1XT001.shhh.trim.unique.good.filter.unique.precluster.nr_v123.knn.pick.taxonomy
The results were organized and analyzed in a unified form. I investigated the number of each taxonomy level of classification succeeded for each database. (Table 1) In addition, I compared both result files and quantified which classification level was in coincidence with each other and what proportion it was. (Figure 2)
Table 2 and Figure 3 show the modified result of the above, where the Ezbiocloud data were aligned by their own align data, and mock data were classified using the new version of Silva database.
As shown in Table 1, it was possible to classify all data using Ezbiocloud at the species level, and genus level classification was possible in all cases using Silva / RDP. In Figure 2, comparing the results using both databases showed that the match rate was highest at the order level. In other words, the order was the same, but the family did not match. The final concordance rate in family and genus level followed that in order level.
On the other hand, when using the 123 version of the Silva database, as shown in Table 2, not all classifications were succeeded at the genus level, and family level classification was almost confirmed. Also, as shown in Figure 3, about 92% of the data was consistent with family and genus levels, but much higher at family level. Such trend would be explained in detail in the discussion section.
Differences from Performing QIIME
The file format required to run the Ezbiocloud database in QIIME and Mothur was somewhat different. In Mothur, a semicolon (;) should be added at the end of each line of the taxonomy file. Also, unlike QIIME, Mothur required an aligned template of the Ezbiocloud fasta file. Since I could not get aligned data in this article at first, I used aligned Ezbiocloud fasta files based on Silva’s aligned data.
After complementary work, I used align data of Ezbiocloud DB made by itself. Initially Mothur did not recognize that this data was aligned. When I looked directly at the source code, Mothur recognized that they were not aligned unless all the aligned sequences in the file had the same length. Thus, in order to match the length of the longest sequence, you just need to add some gap characters (-) to the end of every sequence and then mother would successfully recognize it as aligned data.
Difficulties in Performing
Since the information in the Ezbiocloud database is too large, when classifying and aligning, memory overflow may occur. In fact, you can include wang or knn in the method parameter when you run the classify.seqs command. In the case of wang method, it is necessary to calculate the matching probability of each classification. The problem is that when you inspect a huge amount of databases, the memory may be insufficient and the program could be dead.
The knn method is the method of selecting the nearest k data and classifying the sequence with them. The default value of k = 10 would be impossible to classify as species level. Perhaps there could be a lot of ambiguity to speculate from the ten candidates, since the species level is very detailed. Therefore, only one candidate with the highest degree of coincidence was selected and “numwanted=1” was inserted to strictly classify the species level. (However, you should concern the problem of overfitting. You may select other value like 3.) For this reason, all classification results classified by Ezbiocloud and RDP databases are determined at the species level and genus level.
Idea of Improvement & Future Suggestion
Each database was able to successfully classify species level and genus level – and family level in version 123 Silva DB – as expected. However, when comparing both results, the family or genus match rate was lower than the match rate of order level. Nevertheless, the consistent tendency was observed in the unmatched results. For example, the species that Ezbiocloud expected to be included in S24-7 at the order levels were classified as Porphyromonadaceae by Silva / RDP, and if Ezbiocloud predicted as Eisenbergiella at the family level, it is expected to be Clostridium_XlVa or Blautia by SILVA / RDP. Considering why this consistent incorrectness occurs would be an unobtrusive problem.
The microbial species level has very subtle differences. In particular, it is not easy to determine significant differences in 16S rRNA sequences of microbes that do not cause mutations well over time. Thus, it is required to find the maximum value of “numwanted” parameter in classify.seqs which can be successful to categorize into species level, and also has statistically significant without overfitting. Finally, if you build a database that can efficiently identify species-levels using information on other core genes, you will get more reliable results.
Supplements & Improvements
Unlike previous results (Table 1 & Figure 2), sequences from the new mock data with Silva DB from version 123 were not all classified in genus level. About 60-70% were able to classify only at the family level, but this is perhaps because of the characteristic of the data used in the experiment. Of the approximately 6,000 data of the community used for the experiment, I found approximately 2400 family of S24-7 group. Unlike earlier version, in which this family was never able to be classified, 123 version of Silver DB could classify the S24-7 family, but information on the sub-family level was not included. Therefore, the ratio of family level was much higher than the genus-level classification. Genus-level concordance rate, when the Silver database can classify the sub-S24-7 family level, is expected to reach 80-90%.
I would also be able to analyze unknown community data via OTU clustering and OTU classification process. Furthermore, by adjusting various parameters, you can classify the unknown community data with high precision.
I tried to describe how to run Mothur using the Ezbiocloud database. Based on the mock data used in Schloss et al. (2011), I confirmed that the Ezbiocloud database is useful for species-level classification. Also, by comparing to control data obtained by using SILVA and RDP database, I was able to investigate the coincidence of the Ezbiocloud database numerically. Finally, I discussed the ways to improve the accuracy. By applying these improvements, it could be possible to classify microbes more accurately.
- Plummer, Erica, et al. “A Comparison of Three Bioinformatics Pipelines for the Analysis of Preterm Gut Microbiota using 16S rRNA Gene Sequencing Data.”Journal of Proteomics & Bioinformatics 2015 (2015).
- Schloss PD, Gevers D, Westcott SL. (2011). Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PloS ONE. 6:e27310.
Last edited January 13, 2017