Code

According to the latest TIOBE Programming Community Index, Python is one of the top 10 popular programming languages of 2017. Python is a general purpose and high level programming language.  Python is used for developing desktop GUI applications, websites and web applications including data science and analysis. Also, Python, as a high level programming language, allows you to focus on core functionality of the application by taking care of common programming tasks. The simple syntax rules of the programming language further makes it easier for you to keep the code base readable and application maintainable. 

Functionalities:
  • Readable and maintainable code
  • Multiple programming paradigms
  • Compatible with major platforms and systems
  • Robust standard library
  • Open source frameworks and tools
  • Simplify complex software development
  • Adopt test driven development
Evidences:

Python version installed in computer

Anaconda open-source distribution of the Python and R programming languages for scientific computing data science, machine learning applications, large-scale data processing, predictive analytics, etc.



From Anaconda, jupyter notebook is used to  data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning




Math operators

Variables declaration and Data Types





Working with strings


Loops


Loops and conditions


Arrays are called lists in python.So, use of lists

After, the basics code of python, processes for Natural Language Processing (NLP) is demonstrated.The Natural Language Toolkit NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing for English language in the Python programming language. 


Natural Language Processing (NLP)
Import of NLTK toolkit for natural language processing,



Tokenization
Tokenization is a process of breaking of strings into token. Tokenization include three steps;
  1. Breaking sentences into words
  2. Understanding importance of each word in respect to the sentence
  3. Produce a structural description of a sentence
The sentence can also be tokenized with help of Ngrams, Bigrams and Trigrams. Trigrams are tokens of three consecutive words and Bigrams are tokens of two consecutive words whereas Ngrams are any number of tokens of words.


Tokenization using bigrams


Stemming
Stemming refers to normalizing words into its base form or root form. It works by cutting of the end or beginning of the word to normalize the specific word into its base form.

Lemmatization
Lemmatization takes into consideration of morphological analysis of the word. It groups together different inflected forms of a word called Lemma. It is similar to stemming, as it maps words and gives one common root. Unlike stemming, output of lemmatization is a proper word.

Stop Words
Stop words are words like I, before, after etc. which helps in forming the word, but these are not of any use in NLP. These types of words are listed as Stop Words.

Part of Speech
Grammatical types of words like verb, noun, adjective, articles indicates how the word functions. A sentence can includes lots of part of speech based on the context how it is used. POS tags are used as statistical NLP task which distinguishes the sense of word which is helpful is text realization.




Named Entity Recognition
Named entity recognition locates and classifies named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Syntax
Set of principles, rules and processes in a given language is a syntax. Syntax Tree is a representation of syntactic structure of sentences or strings which of formed in the basis of part of speech.

Chunking
Chunking is a process of picking up individual pieces of information and grouping them into bigger pieces.
The above mentioned processes are demonstrated as examples in Code Language.

The development of the sentiment analysis model using labeled Imdb dataset from Kaggle:
Import of libraries

Loading dataset

Pre-processing of data




Plot of frequency of words

Splitting the data into training and testing set



Supervised algorithm Naive Bayes Classifier



Multi nominal Naive Bayes Classifier

For User Input:

Checking the validity of model by providing user input