Sentiment analysis on large scale Amazon product reviews

The world we see nowadays is becoming more digitalized. In this digitalized world e-commerce is taking the ascendancy by making products available within the reach of customers where the customer doesn't have to go out of their house. As now a day's people are relying on online products so the importance of a review is going higher. For selecting a product, a customer needs to go through thousands of reviews to understand a product. But in this prospering day of machine learning, going through thousands of reviews would be much easier if a model is used to polarize those reviews and learn from it. We used supervised learning method on a large scale amazon dataset to polarize it and get satisfactory accuracy.


I. INTRODUCTION
As the commercial site of the world is almost fully undergone in online platform people is trading products through different e-commerce website.And for that reason reviewing products before buying is also a common scenario.Also now a days, customers are more inclined towards the reviews to buy a product.So analyzing the data from those customer reviews to make the data more dynamic is an essential field now a day.In this age of increasing machine learning based algorithms reading thousands of reviews to understand a product is rather time consuming where it can polarize a review on particular category to understand its popularity among the buyers all over the world.
In paper, the categorize the positive and negative feedbacks of the customers over different products and build a supervised learning model to polarize large amount of reviews.A study on Amazon last year revealed over 88% of online shoppers trust reviews as much as personal recommendations.Any online item with large amount of positive reviews provides a powerful comment of the legitimacy of the item.Conversely, books, or any other online item, without reviews puts potential prospects in a state of distrust.Quite simply, more reviews look more convincing.People value the consent and experience of others and the review on a material is the only way to understand others impression on the product.Opinions, collected from user's experiences regarding specific products or topics, straightforwardly influence future customer purchase decisions.Similarly, negative reviews often cause sales loss.For those understanding the feedback of customers and polarizing accordingly over a large amount of data is the main aim of the paper.There are some similar works done over Amazon dataset.
In this model, by using both manual and active learning approach to label the datasets.In the active learning process different classifiers are used to provide accuracy until reaching satisfactory level.After getting satisfactory result by taking those labeled datasets and processed it.From the processed dataset extracted features that are then classified by different classifiers.
Two kinds of approaches are used to extract features: the Bag of words approach and TF-IDF& Chi square approach for getting higher accuracy.

1.1Problem statement
In an online shopping website a product may have thousand of reviews and it is hard for a person to go through all those reviews.The aim of this paper is to develop a Supervised learning model to polarize large amount of product review dataset which was unlabeled, which can give a statistical report on the number of reviews where the customer not satisfied with a specific feature of Amazon product.To compare different classification method in sentimental analysis on Amazon dataset to see if it works better in some particular aspects, by using sentiment analysis on different Machine learning algorithm.

Objective
A prototype is build to demonstrate that text reviews can be mined through and extract feature-based feedback for any product.Text reviews from any online shopping website like eBay, Target or Walmart can be used in this system with minor changes in the implementation part.For implementation of such system have used Amazon reviews and products data as the dataset.Natural Language Processing and text mining techniques are used to identify major features of the product.Sentiment analysis is used to identify the polarity (positive or negative) of each review.Machine learning algorithms are used to generate the result.
The paper is organized as section 1 introduction, section 2 literature survey, section 3 as system design, section 4 design and implementation, section 5 result and snapshots and finally section 6 conclusion and future work.

II. LITERATURE SURVEY
Literature survey is the important part of report as it gives direction in the area of research.It helps to set a goal for your analysis thus giving problem statement.It is also a systematic and thorough search of all types of published literature as well as other sources.Following are some of the research papers related to this paper.
The Amandeep Kaur, DeepeshKhaneja [1] in " Sentiment analysis on twitter using apache spark."applied and extended the current work in the field of apache spark.In this paper, the fast and in memory computation framework 'Apache Spark' to extract live tweets and perform sentiment analysis.The primary aim is to provide a method for analyzing sentiment score in noisy twitter streams.This paper reports on the design of a sentiment analysis, extracting vast number of tweets.Results classify user's perception via tweets into positive and negative.
Sentiment analysis is the prediction of emotions in a word, sentence or corpus of documents.It is intended to serve as an application to understand the attitudes, opinions and emotions expressed within an online mention.
Apache Spark is an open source lightning fast cluster computing platform to retrieve streaming data and forwarding to storage system like Database Server.Apache spark is an in-memory fast processing system used for large scale data processing.
The KuatYessenov, SasaMisailovi ˇ c [2] in "Sentiment Analysis of Movie Review Comments."Extended the current work in the field of machine learning techniques and applied sentiment analysis to data from movie review comments from popular social network Digg.
They choosed the domain of social web site comment messages.They obtained the comments from articles posted on Digg.Digg is a social networking web site which enables its users to submit votes and comments.Digg has a voting system which allows users to vote for (+1) or against (-1) posted items and leave comments on posts.The total sum of diggs, that is the difference between thumbs up votes and thumbs down votes, represents the popularity of the post.Besides popularity, which is assigned by other users, there is no clue about the sentiment of the author of the messages.
The focus of this paper is the analysis of the sentiments in the short web site comments.They expect the short comment to express succinctly and directly author's opinion on certain topic.They focus on two important properties of text: 1. subjectivitywhether the style of the sentence is subjective or objective; 2. polaritywhether the author expresses positive or negative opinion.They use statistical methods to capture the elements of subjective style and the sentence polarity.Statistical analysis is done on the sentence level.They apply machine learning techniques to classify set of messages.
The Han-xiaoshi,Xiao-jun [3]li in "Sentiment Analysis in Hotel Reviews Based on Supervised Learning." extracted sentiment from the reviews and then propose a supervised machine learning approach.A classification task usually involves with training and testing data which consist of some data instances.Each instance in the training set contains one target value or class labels and several attributes/features.The goal of support vector machine (SVM) is to produce a model which predicts target value of data instances in the testing set which are given only the attributes.SVM has been shown to be highly effective at traditional text categorization, generally outperforming Naive Bayes.They are large-margin, rather than probabilistic, classifiers, in contrast to Naive Bayes.Here, they pay attention to online hotel reviews.Theypropose a supervised machine learning approach using unigram feature and TF-IDF to realize polarity classification of documents.
Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining.This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus.
In their future work, they want explore semi-supervised machine learning to increase training data set, thus improve the effectiveness of experiment.As well as they will introduce some natural language processing (NLP) techniques to further improve the performance of sentiment analysis.
According to a recent study on "Unfair Reviews Detection on Amazon Reviews using SentimentAnalysiswith Supervised Learning Techniques" carried out by Diekmannet al [4].(2014),vendors with the best reputation have an increased number of sales.However, promoting trustworthy participation also bears an incentive for malicious actors to push their reputation unfairly to gain more benefit.Dishonest reviews or ratings have already become a serious problem in practice.Thus, in this research, their primary goal is detecting unfair reviews on Amazon reviews through Sentiment Analysis using supervised learning techniques in an E-Commerce environment.
The conducted experiments through sentiment classification algorithms have shown the performance measures of precision, recall and accuracy.They have applied NB and SVM classifiers.These classifiers provide a useful perspective for understanding and evaluating many learning algorithms.
Their research is fundamentally focused at the document level of Sentiment Analysis, precisely on datasets of Amazon reviews.Sentiment Analysis methods will have a fundamental positive effect on reputation systems, especially in unfair reviews detection processes in an ecommerce environment and other domains.Feedback reviews in e-commerce is an important source of information for customers to reduce product uncertainty when making purchasing decisions.However, with increasing volume of feedback reviews, customers sometimes make product buying decisions based on unfair or fake feedback reviews.The sentiment classification algorithms are applied with stopwords removal, using three different Amazon reviews datasets.They observed that it is more effective to use the stop-words removal method than not using stop-words and that is also more efficient to detect unfair reviews.
The Shankar Setty, RajendraJadi [5] in "Classification of Facebook News Feeds and Sentiment Analysis.",attempted to classify the users news feeds into various categories using classifiers to provide a better representation of data on users wall.They presented a system for classification of facebook news feeds.They developed a model to classify posts appearing on users facebook wall to find most important news feeds and to automatically detect the sentiments of the user.Automated training dataset, automatically collects the live news feeds from facebook.Various types of classification techniques of machine learning used to build classifier.Sentiment Analysis is used for Pre-processing of news feeds is done by removing of special characters and stemming of words.The various classes considered are as follows: • Liked pages posts: Most of the users like various companies pages.These companies periodically update their status which is subsequently pushed to user's wall.
• Friends posts: Posts or status updates that are mainly contributed by user's friends or followers contribute to the friends posts and these are subsequently pushed to the news feeds of user.The friends posts are further classified into life events posts and entertainment posts classes.
• Life events posts: User posts about the events occurring in their lives like engagement, marriage or their present status like traveling, having fun, etc.All these are considered as life events posts.
• Entertainment posts: Updates like poems, reviews about movies, technologies and products, etc which are not of primary importance are labeled as entertainment posts.
Here, the system developed for sentiment analysis splits the input text into words and then, Tokenization is performed on facebook posts to extract each term.Using POS tagger, part-of-speech of each term is detected.The input facebook post is split into words.The polarity of each word is found using SentiWordNet dictionary.To filter the terms for scoring, stopwords are removed and SentiWordNet dictionary is used to determine sentiment orientation value of a word.• The sentiment score is calculated for each word.Each word has a sentiment orientation value, such as + 1 (strongly positive), +0.5(weakly positive), -l(strongly negative), -0.5(weakly negative).
A sentiment analysis work has been done by KuatYessenov [6] on YouTube comment scraping and discussed in "Sentiment Analysis on YouTube Movie Trailer comments to determine the impact on Box-Office Earning" to analyze the channels satisfaction using machine learning algorithms like Naive Bayes.This study seeks to examine and categorize the types of comments made by YouTube users on popular Hollywood Movie trailers to understand how the sentiments of these users can impact the first day revenue.Also show the trend of box office earnings based on the sentiments after the movie is released.This will help distributors and moviemakers to determine the response rate for the movie in prior by understanding the comments on the trailers and then once the movie gets released the next day earnings can be predicted by looking at the present-day sentiments.
The Rain, Callen [7] in "Sentiment Analysis in Amazon Reviews Using Probabilistic Machine Learning."applied and extended the current work in the field of natural language processing and sentiment analysis to data from Amazon review datasets.Here they used bag-ofwords feature extraction that could be used to make systems that analyze more diverse sets of data, but it may have more use in smaller datasets.The systems performed reasonably well on small data sets even when they trained and tested on products that were completely different.This could be applied not to the testing of different products, but instead to the testing of different features of a product.Naïve Bayesian and decision list classifiers were used to tag a given review as positive or negative.They have selected books and kindle section review from Amazon.
As you can see, the accuracy for Naive Bayes is very high for both review of books and kindle.This is because of the simplicity of Naive Bayes.This algorithm only uses the text which is classified according to the maximum probability, while Descision List does not.As the number of samples increases, the more time it will take to Descision List to complete the classification process and in some cases it won't be able to finish it at all.

This Research work by Hatzivassiloglou& McKeown [8]
on "Opinion Annotation in On-line Chinese Product Reviews" focused on the subjective words extraction and opinion classification at the document.Especially, a practical opinion analysis system for product reviews is expected to provide not only the positive or negative a comment is, but also the attributes target.
In this study, a new scheme for annotating opinions in Chinese product reviews is proposed.An opinion is defined as a person's ideas and thoughts towards a product.This Research paper includes a methodology of annotation scheme .This annotation scheme includes that the annotation of opinion expressions are determined.For each opinion expression, its expression segment, opinion keyword, polarity and degree are annotated.Furthermore, the negations and modifiers relevant to this can be extracted then the status of opinion keywords are observed.
Su SuHtay, KhinThidar Lynn [9] proposed a novel idea for Sentiment Analysis and Sentiment Classification of Amazon Product Reviews to find opinion words or phrases for each feature from customer reviews in an efficient way.In this paper they get the patterns of opinion words/phrases about the feature of product from the review text through adjective, adverb, verb, and noun.The extracted features and opinions are useful for generating a meaningful summary that can provide significant informative resource to help the user as well as merchants to track the most suitable choice of product.
Work consists of first collecting the reviews from Amazon dataset.The dataset which we have used is divided into different categories.Each category consists of Positive, Negative and Unlabeled reviews.For training.The system is given known positive and negative reviews.Preprocessing method is done which consist of removing the stop words.Then the features are extracted using phrase level, Single word and Multiword methods.After feature selection/ extraction is completed vector is generated.The vector is then used for training the system.In this model they did not used any dictionary instead they have generated the vector of extracted features and used it as a dictionary to classify the unlabeled reviews.
Then Naïve Bayes algorithm is used for classification and Feature selection techniques which are used are Phrase level, Single word and multiword.Workings of the different steps are divided as A) Feature Selection/Extraction and B) Sentiment Classification.
Elli Maria, Yi-Fan [10] in "Amazon Reviews, Business Analytics with Sentiment Analysis" extracted sentiment from the reviews and analyze the result to build up a business model.The aim of this paper is to extract sentiment from more than 2.7 million reviews and analyze the implications they have in the business area.The dataset they used in our paper is called Amazon product data.They have claimed that demonstrated tools were robust enough to give them high accuracy.The use of business analytics made their decision more appropriate.They also worked on detecting emotions from review, gender based on the names, also detecting fake reviews.The commonly used programming language was python and R.They mainly used Multinomial Naïve Bayesian (MNB) and support vector machine (SVM) as their main classifiers.
As for the classification problem, They build a Multinomial Naive Bayes (MNB) and a Support Vector Machine (SVM) classifier (Joachims, 1998), (Wu et al , 2004) using the python package.They trained both classifiers with 50% of the data and tested them with the other 50% of the data to calculate the accuracy.
The accuracy for both cases is very high.It is worth mention that the processing time between the two algorithms is very different.This is because of the simplicity of Naive Bayes.This algorithm only uses simple arithmetic operations, while SVM does not.As the number of samples increases, the more time it will take to SVM to complete the classification process and in some cases it won't be able to finish it at all.

III. SYSTEM ARCHITECTURE
Amazon is one of the largest E-commerce site as for that there are large amount of reviews that can be seen.Here data named as Amazon product data.The dataset was unlabeled and to use it in a supervised learning model and had to label the data.The system architecture is as shown in the block diagram below:

Data Acquisition
Here dataset is acquired and labeled.As a large amount of reviews manually labeling was quite impossible for us.Therefore by preprocessed data and used Active learner to label the datasets.As Amazon reviews comes in 5-star rating based generally 3 star ratings are considered as neutral reviews meaning neither positive nor negative.So discard any review which contains a 3-star rating from the dataset and take the other reviews and proceed to next step labeling the dataset.

Pool Based Active Learning:
Active learning is a special case in semi-supervised learning algorithm.As manually labeling the dataset is quite an impossible task so that to reduce time complexity, use a special kind of semi supervised learning approach known as pool based active learning.In the process of active learning we need to provide it some pre labeled datasets as training and testing and take unlabeled dataset.For using active learning, we need to provide some manually labeled reviews as training-testing sets.Then from a pool of unlabeled dataset learning method will ask oracle or user to label few data.And it will run some classifiers to calculate the accuracy.Accuracy shows whether the decision boundary is separating most the values in two classes.Higher the accuracy higher the data is being labeled.If the accuracy is greater or equal to 90% then we take those data and combined it with already prelabeled data to get our labeled dataset.If not, we again consider help from the oracle to label some more data.After the accuracy is greater than 90% we considered the data to be labeled.

Data Pre-Processing
Tokenization: It is the process of separating a sequence of strings into individuals such as words, keywords, phrases,symbols and other elements known as tokens.Tokens can be individual words, phrases or even whole sentences.In the process of tokenization, some characters like punctuation marks are discarded.The tokens work as the input for different process like parsing and text mining.
Removing Stop Words: Stop words are those objects in a sentence which are not necessary in any sector in text mining.So generally ignore these words to enhance the accuracy of the analysis.In different format there are different stop words depending on the country, language etc.In English format there are several stop words.

POS tagging:
The process of assigning one of the parts of speech to the given word is called Parts of Speech tagging.It is generally referred to as POS tagging.Parts of speech generally contain nouns, verbs, adverbs, adjectives, pronouns, conjunction and their sub-categories.Parts of Speech tagger or POS tagger is a program that does this job.

Feature Extraction
Bag of Words:Bag of word is a process of extracting features by representingsimplified text or data, used in natural language processing and information retrieval.In this model, a text or a document is represented as the bag (multiple set) of its words.So, simply bag of words in sentiment analysis is creating a list of useful words.We have used bag of words approach to extract our feature sets.After preprocessed dataset we used pos tagging to separate different parts of speech and from that we select nouns and adjectives and use those to create a bag of words.Then we run it through a supervised learning and find our results and also the top used words from the review dataset.

TF-IDF: Term frequency
It increases the weight of the terms (words) that occur more frequently in the document.Quite intuitive, right ?? So it can be defined as tf(t,d) = F(t,d) where F(t,d) is number of occurrences of term 't' in document 'd'.But practically, it seems unlikely that thirty occurrences of a term in a document truly carry thirty times the significance of a single occurrence.So, in order to make it more pragmatic, tf is logarithmically scaled so that as the frequency of terms increases exponentially, we will be increasing the weights of terms in additive manner.

tf(t,d) = log(F(t,d))
Inverse document frequency It diminishes the weight of the terms that occur in all the documents of corpus and similarly increases the weight of the terms that occur in rare documents across the corpus.
Basically, the rare keywords get special treatment and stop words/non-distinguishing words get punished.It is defined as: idf(t,D) = log(N/N t∈ d ) Here, 'N' is the total number of files in the corpus 'D' and 'N t∈ d ' is number of files in which term 't' is present.By now, we can agree to the fact that tf is a intra-document factor which depends on individual document and idf is a per corpus factor which is constant for a corpus.Finally, tf-idf is calculated as:

tf-idf(t,d,D) = tf(t,d) . idf(t,D)
TF-IDF is an information retrieval technique which weighs a term's frequency (TF) and also inverse document frequency (IDF).Each word or term has its own TF and IDF score.The TF and IDF product scores of a term is referred to the TF*IDF weight of that term.Simply we can state that the higher the TF*IDF score (weight) the rarer the term and vice versa.TF of a word is the frequency of a word.IDF of a word is the measure of how significant that term is throughout the corpus.When words do have high TF*IDF weight in content, content will always be amongst the top search results, so anyone can: 1. Stop worrying about using the stop-words, 2. Successfully find words with higher search volumes and lower competition.

Supervised Learning
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.It infers a function from labeled training data consisting of a set of training examples.In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances.This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way.
In order to solve a given problem of supervised learning, one has to perform the following steps:

Evaluating Measures
Evaluate metrics play an important role to measure classification performance.Accuracy measure is the most common for this purpose.The accuracy of a classifier on a given test dataset is the percentage of those dataset which are correctly classified by classifier.And for the text mining approach always the accuracy measure is not enough to give proper decision so we also took some other metrics to evaluate classifier performance.Three important measures are commonly used precision, recall, Fmeasure.Before discussing with different measures there are some terms we need to get comfortable with- TP (True Positive) represents numbers of data correctly classified  FP (False Positive) represents numbers of correct data misclassified  FN (False Negative) represents numbers of incorrect data classified as correct  TN (True Negative) is the numbers of incorrect data classified Precision: Precision measures the exactness of a classifier, how many of the return documents are correct.A higher precision means less false positives, while a lower precision means more false positive.Precision (P) is the ratio of numbers of instance correctly classified from total.It can be defined as Recall: Recall calculates the sensitivity of a classifier; how many positive data it returns.Higher recall means less false negatives.Recall is the ratio of number of instance accurately classified to the total number of predicted instance.This can be shown as F-Measure: Combining precision and recall produces single metrics known as F-measure, and that is the weighted harmonic mean of precision and recall.It can be defined as Accuracy: Accuracy predicts how often the classifier makes the correct prediction.Accuracy is the ratio between the number of correct predictions and the total number of prediction.
There were several machine learning algorithms used in our experiment like Naïve Bayesian, Support vector Machine Classifier (SVM).We have to conduct cross validation methods to get best accuracy.We have to conduct the best classifiers on 3 categories of product reviews and see the results according to the evaluation measures.The classifiers were applied on different feature selection process for all the datasets.

Sentiment analysis
The sentiment analysis is the process to identify the opinion or the people's emotion about any entity.
Sentiment analysis provides us a way to identify the people's opinion about the entity.There are two approaches to identify the sentiment of the people about the entity.The sentiment analysis processes are use lexical base approach and machine learning approach.
In machine learning approach the classifier are divided into the two parts.Before the text are analyze for the sentiment first we have to train the classifier.On training of classifier it labeled with it polarity value.The labeled data are store into the train data set.In machine learning approach there are two data set are used i.e. train data set and test data set.The train data set are trained by the algorithm before the sentiment analysis.The other test data set are use to check the polarity of the text.Each approach has its advantages and disadvantages.When using the lexical approach, there is no need for labeled data and the procedure classifies the train data and the decisions taken by the classifier.On the other side when using the machine learning there is no need of emotional dictionary, also no need to check the polarity of each word.

4.1Classification Process
The classification process plays the important role in sentiment analysis.

Fig 2:
The classification process in sentiment analysis.

Naive Bayes
The Bayesian Classification represents a supervised learning method as well as a statistical method for classification.

Algorithm:
Steps in algorithm are as follows: 1.Each data sample is represented by n dimensional feature vector, X = (x1, x2….. xn), depicting n measurements made on the sample from n attributes, respectively A1, A2… An. 2. Suppose that there are m classes, C1, C2……Cm, given an unknown data sample, X (i.e., having no class label), the classifier will predict that X belongs to the class having the highest posterior probability, conditioned if and only if: P(Ci|X)>P(Cj|X) for all 1< = j< = m and j!= i Thus P(Ci|X) is maximized.The class Ci for which P(Ci|X) is maximized is called the maximum posteriori hypothesis.By Bayes theorem, 3.As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized.If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, i.e.P(C1) = P(C2) = …..= P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci)P(Ci).Note that the class prior probabilities may be estimated by P(Ci) = Si /s , where Si is the number of training samples of class Ci, and s is the total number of training samples.on X.That is, the naive probability assigns an unknown sample X to the class Ci.

Support Vector Machine
In machine learning, support-vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

V. RESULT AND SNAPSHOTS
The snapshots of the results on execution of the program considering different cases are shown below.The snapshot in figure 3 shows the initial page of the paper when the paper is executed for the first time.The figure 4 shows how a user can upload a dataset.In next figure 5, the dataset is successfully uploaded and the results in response to the corresponding dataset. In

VI. CONCLUSION AND FUTURE SCOPE
The system is accurate enough for the test case of all the product of reviews on Amazon.We have designed our own methodology that integrates existing sentiment analysis approaches.Classification of reviews along with sentimental analysis increased the accuracy of the system which in turn provides accurate reviews to the user.Amazon star ratings alone is not enough for the customers to make their decisions.One should go through text reviews to know specially which feature of the product lacking customer satisfaction.Hence this system help a customer to buy a product and also help a manufacturer or seller to know the pros and cons of their product.

Future enhancement
The system can be implemented as a web or mobile application with a user interactive front end where the user can choose a product and the feature based rating will be displayed.Along with the rating for each feature samples of the review text for each major feature can be displayed which can help the user understand the reason for customer dissatisfaction about the feature.

Figure 1 :
Figure 1: Block Diagram of System Architecture

1 .
Determine the type of training examples.Before doing anything else, the user should decide what kind of data is to be used as a training set.In the case of handwriting analysis, for example, this might be a single handwritten character, an entire handwritten word, or an entire line of handwriting.2. Gather a training set.The training set needs to be representative of the real-world use of the function.Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements.3. Determine the input feature representation of the learned function.The accuracy of the learned function depends strongly on how the input object is represented.Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object.The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output.4. Determine the structure of the learned function and corresponding learning algorithm.For example, the engineer may choose to use support vector machines or naïve bayes. 5. Complete the design.Run the learning algorithm on the gathered training set.Some supervised learning algorithms require the user to determine certain control parameters.These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via crossvalidation.6. Evaluate the accuracy of the learned function.After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set.
Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting).An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

Fig 6 ,
Fig 3 :Initial page showing the Front end

Fig 8 :Fig 9 :
Fig 8 : Result showing output of the entered comment It assumes an underlying probabilistic model and allows to capture uncertainty about the model in a principled way by determining probabilities of the outcomes.It can solve diagnostic and predictive problems.This Classification is named after Thomas Bayes (1702-1761), who proposed the Bayes Theorem.Bayesian classification provides practical learning algorithms and prior knowledge and observed data can be combined.Bayesian Classification provides a useful perspective for understanding and evaluating many learning algorithms.It calculates explicit probabilities for hypothesis and it is robust to noise in input data.Naïve Bayes is a classification technique based on Bayes'Theorem with an assumption of independence among predictors.In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.It is called "Naive" because it incorporates the simplifying assumption that attribute values are conditionally independent, given the classification of the instance.When this assumption is met, the naïve Bayes classifier outputs the MAP classification.Even when this assumption is not met, as in the case of learning to classify text, the Naïve Bayes classifier is often quite effective.Bayesian belief networks provide a more expressive representation for sets of conditional independence assumptions among subsets of the attributes.Naive Bayes model is easy to build and particularly useful for very large data sets.Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features/predictors) in a learning problem Maximumlikelihood training can be done by evaluating a closed form expression, which takes linear time., rather than by expensive iterative approximation as used for many other types of classifiers.There is no explicit search during training (as opposed to decision trees).When conditional independence is satisfied, Naive Bayes corresponds to MAP classification.Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c).Look at the equation below: P(c|x)=P(x|c)P(c)/P(x)