Theory

Machine Learning (ML)
Machine learning is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead (Khan et al., 2017). Machine learning algorithms find natural patterns in data that generate insight and help you make better decisions and predictions. ML algorithms are used every day to make critical decisions in medical diagnosis, stock trading, energy load forecasting, natural language processing and more. 

Why Machine Learning ?

The machine learning field is continuously evolving. And along with evolution comes a rise in demand and importance. There is one crucial reason why machine learning is a necessity. This is because of high value predictions that can guide better decisions and smart actions in real time without human intervention. Machine learning as a technology helps analyze large chunks of data, with an automated process which has changed the way data extraction and interpretation works by involving automatic sets of generic methods that have replaced traditional statistical techniques (Simon et al., 2016).
Data analysis has traditionally been characterized by the trial and error approach – one that becomes impossible to use when there are significant and heterogeneous data sets in question. The availability of more data is directly proportional to the difficulty of bringing in new predictive models that work accurately. Traditional statistical solutions are more focused on static analysis that is limited to the analysis of samples that are frozen in time that could result in unreliable and inaccurate conclusions(Simon et al., 2016). Thus, machine learning proposes a smart dynamic alternative which is proficient in accuracy, efficiency and better results for real time processing of data.
Goals of Machine learning are:
  • Efficient algorithms of practical value
  • Effective use of time and space
  • Resource/Data proficiency
  • General-purpose algorithms
  • Predictability results
Supervised learning is a machine learning method that uses a known dataset (training dataset) to make predictions. The training dataset includes input data and response values. From it, the supervised learning algorithm seeks to build a model that can make predictions of the response values for a new dataset.  A test dataset is often used to validate the model.
Supervised learning includes two categories of algorithms:
Classification: for categorical response values, where the data can be separated into specific   classes
Regression: for continuous response values

Supervised learning is used in designing the model of this project. Supervised pattern classification is the task of training a model based on labeled training data which then can be used to assign a pre-defined class label to new objects.Since the dataset includes labelled data as positive and negative reviews, supervised learning is used. 

Naive Bayes classifiers, a family of classifiers that are based on the popular Bayes’ probability theorem, are known for creating simple yet well performing models, especially in the fields of natural language processing and health prediction fields. Not only large sample sizes but small sample sizes in Naive Bayes classifiers can outperform the more powerful alternatives algorithms. Thus, Naive Bayes is chosen as a supervised algorithm to perform sentiment analysis.Bayes’ Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes’ theorem is stated mathematically as the following equation:
 P(A|B) = \frac{P(B|A) P(A)}{P(B)}
where A and B are events and P(B) . 
Finding probability of event A, given the event B is true, where B is known as evidence. P (A) is the priori of A (the prior probability, i.e. Probability of event before evidence is seen). The evidence is an attribute value of an unknown instance (here, it is event B). P (A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen. 

Natural Language Processing (NLP) is a field in machine learning with the ability of a computer to understand, analyze, manipulate, and potentially generate human language (Emms and Luz, 2007). The ultimate objective of NLP is to read, decipher, understand, and make sense of the human languages in a manner that is valuable. Most NLP techniques rely on machine learning to derive meaning from human languages. NLP entails applying algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand. When the text has been provided, the computer will utilize algorithms to extract meaning associated with every sentence and collect the essential data from them. 
The main components of  Natural Language Processing are:
  • Morphological and Lexical Analysis: It involves identifying and analyzing the structure of words. Lexicon of a language means the collection of words and phrases in a language. Lexical analysis is dividing the whole chunk of text into paragraphs, sentences, and words.
  • Syntactic Analysis: It involves analysis of words in the sentence for grammar and arranging words in a manner that shows the relationship among the words.
  • Semantic Analysis: It draws the exact meaning or the dictionary meaning from the text. The text is checked for meaningfulness. It is done by mapping syntactic structures and objects in the task domain. The semantic analyzer disregards sentence such as “hot ice-cream”.
  • Discourse Integration: The meaning of any sentence depends upon the meaning of the sentence just before it. In addition, it also brings about the meaning of immediately succeeding sentence.
  • Pragmatic Analysis: During this, what was said is re-interpreted on what it actually meant. It involves deriving those aspects of language which require real world knowledge.
Use of NLP in real time are:
  • Information retrieval from google
  • Google machine translation
  • Speech and voice recognition
  • Sentiment analysis 
Sentiment analysis is contextual mining of text which identifies and extracts subjective information in source material, and helping a business to understand the social sentiment of their brand, product or service while monitoring online conversations, reviews or feedback. Here, in this project, Imdb reviews that are given by the audiences will be analyzed in the basis of positive and negative sentiments with following NLP techniques through Naive Bayes classifier model.