OTU Clustering: A Window to Analyse Uncultured Microbial World

– Clustering is the technique used to deal with higher amounts of data by partitioning the data into some groups based on some attributes. Clustering technique has many applications in different fields of science and technology. It is an important tool in genomics and metagenomics which performs taxonomic profiling of the microbial world by grouping 16S RDNA amplicon reads into clusters called as Operational Taxonomic Units (OTUs). With the help of Next Generation Sequencing (NGS) tools and clustering it has become easy for scientists to find the microbial diversities in different environments without culturing the microbes. Assignment of 16s rDNA sequences to the clusters called as OTUs is the main task in metagenomics algorithms and is also the main bottleneck for analysing microbial communities. Taxonomic profiling of 16S rDNA is an important step in Metagenomic pipeline analysis. There are several OTU clustering algorithms which clusters the amplicon reads of 16S rDNA into OTUs, each algorithm use a specific type of clustering technique to cluster the sequence reads. Some of the mostly used algorithms are Uclust, swarm, SUMACLUST, SortMeRNA, USEARCH. In this paper, we first give a brief overview of major clustering techniques and their types. Furthermore, we provide a comprehensive overview of OTU clustering algorithms


I. INTRODUCTION
Clustering is also called as unsupervised learning is a metalearning tool, which deals with the finding of natural structures based on some metric in a pool of unlabelled data. A cluster is therefore a group of objects which show similar patterns among themselves within a cluster, but dissimilar patterns of the objects belonging to other clusters. In different fields, clustering is referred with different names like cluster analysis, automatic classification, numerical taxonomy, topological analysis, etc. A good clustering means high quality clusters in which intra-cluster similarity is high and inter class similarity is low. The applications of clustering are very wide, it has lot of importance in different fields like it has been widely used in biological systems to gain insights in large-scale biological data, such as gene expression data [1], microbiome to study microorganisms in different environments, histone modifications [2], it has wide application in big data analytics [3], and nucleosome positioning [4], [5].
Next Generation Sequencing (NGS) has changed the way of thinking towards the microbial communities. Metagenomics, the study of uncultured microbes from their environment, has evolved so much with the help of Pyrosequencing so that it"s now racing in parallel with other big data sciences. Taxonomic profiling, using hyper-variable regions of 16S rDNA, is one of the important part in metagenomics. And Operational Taxonomic Units (OTUs) Clustering algorithms are the important tools to perform taxonomic profiling by grouping 16S rDNA reads into OTU clusters. There are several OTU clustering algorithms which clusters the amplicon reads of 16S rDNA into OTUs. Existing OTU clustering tools can be grouped into three approaches: closedreference approach, de novo approach and open-reference approach. The closed approach matches input sequences against a reference database to perform OTU clustering. De novo approach clusters without using a reference database but instead take a sequence as seed, searches it against other remaining sequences and open-referencing is a hybrid of closed and denovo, it first uses the closed approach and after that denovo approach for those sequences which do not hit with reference sequences. Remaining part of the paper is organised in the following way. Section II discusses thoroughly various categories of clustering. Section III discusses OTU Clustering approaches. Section IV discusses various OTU Clustering algorithms.
In this section different types of clustering methods are discussed. Actually there is not a standard scale which can differentiate the various algorithms of clustering properly, because the different classes of algorithms overlap at some times. All types of algorithms are dividing the data in the clusters based on some characteristic threshold. In general clustering algorithms can be broadly classified as follows: A. Hierarchical-based Clustering: In hierarchical-based clustering algorithms, data are organized in a hierarchical manner by combining data into clusters and these clusters in bigger clusters, and so on. In this way it"s creating a hierarchical like structure called as dendrogram. The dendrogram represents the whole dataset, where individual objects are the leaves of the tree, each leaf node represents the individual data item and interior nodes are nonempty clusters. There are two types of Hierarchical clustering methods agglomerative or bottom-up approach and divisive or top-down approach.
An agglomerative clustering is a bottom up approach and which starts with one object for each cluster and the recursively merges most appropriate two or more clusters. On the other hand divisive clustering is topdown approach which starts with the whole dataset as one cluster and then splits in a recursive to the most appropriate clusters. The process continues until a threshold condition is satisfied (i.e. k number of clusters). The issue with the hierarchical clustering approach is that once a step (merge or split) is performed, this cannot be not be done again.
The main examples of this method are BIRCH, CURE, ROCK and Chameleon.
B. Partitioning Relocation Clustering: The partition based algorithms divide the data objects into a number of partitions, where each partition represents a cluster. Iterative optimization is used to relocate the data items between the clusters to improve the cluster quality unlike the hierarchical method where once the cluster is created it"s not revisited. The main thing is that each group should contain at least one data item, and each data item must belong to exactly one group. The main classes of this type are: i. Probabilistic clustering: In this approach, the dataset is assumed as sample independently drawn from mixture model of several probability distributions. Let the randomly picked model j has probability t j , j=1: k, and point x is drawn from corresponding probability. Point x is believed to belong only one cluster, to estimate the probability of point x: iii. K-Means Methods: It is the simplest and most used clustering algorithm in which the centre is the average of all points and coordinates representing the arithmetic mean. The objective function used here is the sum of distances between elements of cluster and its centroid expressed through an appropriate distance function.
C. Density-based Methods: In Density-based clustering methods density, connectivity and boundary are used to separate the data items into clusters based on their regions. The concept is closely related to point-nearest neighbours.
Depending upon the density a cluster can grow in any direction that density leads to. This type of methods locates the regions with high density, which are separated from the regions with low density. For this reason density based algorithms can also form clusters of different or irregular shapes, and this provides a natural protection against outliers. The well-known examples of densitybased algorithms which are used to filter out noise (outliers) and discover clusters of arbitrary shape are DBSCAN, OPTICS, DBCLASD and DENCLUE.

D. Grid-based Methods:
The methods that partition the space is frequently called as grid based methods, the space of the data items is divided into grids and each grid is called as a cluster. Grid-based methods have fast processing time, because such approaches go through the whole dataset once to compute the statistical values for the grids and are independent of the number of data items that employ a uniform grid to collect regional statistical data, and finally performs the clustering on the grid, instead to the database. The performance depends on the size of the grid and the size of the grids is less than the size of the database. ii. Graph-Based Partitioning: Graph based clustering is done by just simply deleting some of the edges from the main graph to get sub partitions. It"s desirable to cut minimum edges, but it is producing unbalanced clusters. Exact optimization of minimum cut leads to NP-hard. Some approaches of graph partitioning uses the idea of graph flows. The most important application of graph partitioning is VLSI. iii. Artificial Neural Networks: The neural network approach uses a set of connected input/output units, where each connection has a weight associated with it. Neural networks have several properties that make them popular for clustering, like they are parallel and distributed processing architectures. And also neural networks get training by learning from their interconnection weights so as to best fit for the data. Neural networks process numerical vectors and require object patterns to be represented by quantitative features only. Many clustering tasks handle only numerical data or can transform their data into quantitative features if needed. The neural network approach to clustering tends to represent each cluster as an exemplar. An exemplar acts as a prototype of the cluster and does not necessarily have to correspond to a particular object. New objects can be assigned to the cluster whose exemplar is the most similar, based on some distance measure. iv. Evolutionary Methods: The two important concepts used in evolutions methods include simulated annealing and genetic algorithms. The perturbation operator in simulated annealing techniques is used to relocate the points from the current to new randomly chosen cluster. These methods are mostly used in surveillance monitoring. Genetic Algorithms are used, for cluster analysis like for fuzzy and hard k-means, and clustering of categorical data. The limitation of evolutionary methods is that they have high computational cost hence are rarely used in data mining.

III. OTU CLUSTERING
Next generation sequencing (NGS) includes the sequencing tools, producing a tremendous amount of data in less time.
After sequencing the data, it is pre-processed before its going for clustering process. Clustering tools group the 16S rDNA sequences into clusters called Operational Taxonomic Units (OTUs). There are different types of OTU clustering tools or algorithms. These OTU clustering algorithms can be grouped in to three approaches: closed reference approach, de novo approach and open reference approach (hybrid of closed and de novo approach). In closed-reference approach the input dataset sequences are searched against a reference database like Greengenes to know the known microbes present in the data set. Although known microbes can be efficiently classified but this approach lacks the ability to find the novel species. According to the "rare biosphere" theory [6], [7], there are still many microbes which have not been identified in existing reference databases. Therefore, grouping unknown microbes is an important task, for which the de novo approach is used. The de novo approach performs microbial profiling by grouping the 16S rDNA sequences of input dataset into OTU clusters. The open-reference approach is just combination of closed reference approach and de novo approach or we can say it is a hybrid approach in which input dataset sequences are first searched against database i.e. closed-referencing and the rest of sequences which fail to cluster in closed referencing are given to de novo algorithm for clustering. Most existing studies and tools use threshold values of 97 and 95 percent for grouping at the species and at the genus level respectively. Depending on the way of forming clusters, most existing algorithms for the de novo approach can be further divided into two categories: greedy heuristic clustering (GHC) and agglomerative hierarchical clustering (AHC).
A. Greedy heuristic clustering is a partitioned based clustering method that works at a specific distance level at a time. Greedy clustering works by first choosing an input sequence as a seed and then each subsequent input read is compared against the existing set of seeds. If this sequence matches one of the seeds within a predefined level of 97 percent sequence similarity, it will be added to the cluster represented by that seed. Otherwise, it will be taken as a new seed. Examples in this category are UCLUST [9], USEARCH6, UPARSE [10], CD-HIT-OTU [17], and QIIME"s pick_otus [8]. UCLUST selects the seed of the cluster based on the percentage identity between a sequence and a seed. USEARCH and UPARSE perform a similar seed choice as UCLUST with additional filtering of clusters with low abundance i.e., small cluster sizes. CD-HIT-OTU groups similar sequences above 97 percent identity threshold and keeps the longest sequence as seeds. QIIME"s pick_otus implements many referencebased and denovo OTU algorithms, but the UCLUST algorithm is the default method in QIIME. All GHC methods have linear time and space complexities.

B. Agglomerative hierarchical clustering (AHC)
is a clustering method works by computing on a pairwise genetic distance matrix derived from an all-against-all read comparison in a bottom-up manner. Examples in this category include Mothur [11], ESPRIT [12] and ESPRIT-Tree [13]. ESPRIT employs the traditional hierarchical approach of first computing an alignment-based allagainst-all distance matrix and then performs either average-linkage or complete-linkage clustering on that matrix. ESPRIT reduces computational complexity by generating only the lower part of a dendrogram. The approach of Mothur and ESPRIT is similar but instead of pairwise global alignment used by ESPRIT, Mothur uses multiple sequence alignment tool such as MUSCLE [14] to compute the pairwise distance matrix. It has been seen that pairwise alignment produces better clustering outcomes than multiple sequence alignments [7], [15]. Different from ESPRIT and Mothur, ESPRIT-Tree uses both greedy and hierarchical strategies. Instead of seeds, it uses "probabilistic sequences" to present a group of similar sequences and then applies a BIRCH-like [16] clustering method to build and refine a "pseudo-metric based partition tree" of probabilistic sequences. ESPRIT-Tree has quasilinear space and time complexity [13]. In general the GHC approaches are often faster than the AHC approaches, but on the other hand AHC tools produce higher quality clusters than GHC tools [15]. The main drawback of the AHC approach is its high computational complexity and hence it is not suited for large datasets. Most existing OTU clustering methods use the threshold cutoff value of 97 percent sequence similarity. This de facto choice is based on the assumption that the pairwise genetic distance between a pair of 16S rDNA short reads from the same full-length 16S rDNA (hence from the same species) is less than 0.03. This assumption holds and hence is only applicable for datasets in which the pairwise distances between reads from the same species are less than 0.03 and the distances between reads from different species are larger than 0.03. When the distance distribution does not follow this assumption, a more flexible approach to determine the final OTU grouping is preferred.

IV. OTU CLUSTERING ALGORITHMS
There are different OTU clustering algorithms, some are closed source some are open and some works separately and some are embedded in different metagenomic sequence pipelines. QIIME is one the metagenomic software pipeline which employs many OTU clustering algorithms, but its default OTU clustering algorithm is UCLUST. The various OTU clustering algorithms mostly embedded in QIIME software pipeline are as: A. Swarm [20], [21] is a de novo clustering algorithm which uses an unsupervised agglomerative hierarchical single-linkage clustering method. There are two steps in Swarm: (i) first the set of OTUs is constructed based on similarity of sequence reads by agglomerative clustering method (ii) Second the abundance value is calculated and which is then used to divide the OTUs into sub-OTUs if needed.
B. OTUCLUST [19] and SUMACLUST, use de novo clustering approach so no need of reference database. Both the algorithms use a greedy heuristic strategy which compares abundance-ordered list of input sequences against the representative set of already-chosen sequences which are initially empty and the clusters are made by increments [24].
C. UCLUST and CD-HIT also functions like that of OTUCLUST and SUMACLUST. But CD-HIT performs exact sequence alignment, rather than depending on fast heuristics. OTUCLUST is the default clustering algorithm of QIIME and also it performs its own sequence dereplication and chimera removal with the help of UCHIME [25]. And also UCLUST is used in all the 3 approaches i.e. closed reference, denovo and open referencing.  [10] which is a de novo amplicon analysis pipeline. UPARSE has in build stringent quality filtering, length trimming to remove erroneous reads, parallel chimera removal and also implements a novel greedy algorithm that performs OTU clustering.

V. CLASSIFICATION OF OTU ALGORITHMS
There are different and many OTU clustering algorithms, some implements hierarchical clustering technique and some use partitioned greedy heuristic based approach. So these algorithms can be classified in different ways. Here these are classified on the bases of whether they are open source or closed source (or proprietary software). The classification is given as under in figure 2:

VI. CONCLUSION
Clustering is one of the essential tasks in data mining and needs more improvement nowadays than before to assist data analysts to extract knowledge from terabytes and petabytes of data in less time. The paper provides a view of various clustering approaches and their main algorithms. Nowadays there are lot of new Next Generation Sequencing methods which produces a tremendous amounts of data in no time. In order to handle such huge amount of data there should methods to mine this type of sequence big data. Clustering is one of the techniques which can not only help to handle such amount of data but also will help to mine the underling information. The main focus of the paper has been the application of clustering in Metagenomics and Genomics fields of biology. A comprehensive details of various OTU clustering algorithms and their classification has been given so that researchers from the fields of computational sciences can get the idea of use of OTU clustering algorithm in their research work. As a future work the existing algorithms need to be analyse properly for their pros and cons. So that we could develop new methods that would be efficient, scalable and can handle the massive amounts of data coming from Next Generation Sequencing (NGS) platforms.