Dive into Natural Language Processing with Python and R.

Data Science NLP use case.


Python programming language and its libraries combined together and R language in addition form the powerful tools for solving Natural Language Processing tasks.

Natural language processing - NLP - is a branch of linguistics, computer science CS and artificial intelligence AI which study the interactions between computers and human language, in fact how to program computers to process and analyze natural language data in large amounts.

Natural Language Processing in Python and R.
Natural Language Processing meme.

Python Knowledge Base: Make coding great again.
- Updated: 2025-01-21 by Andrey BRATUS, Senior Data Analyst.




    The main goal is a computer capable of "understanding" the contents of text data, especially the contextual nuances of the language used. The technology must also accurately extract information and basic ideas contained in the documents, categorize and organize the documents themselves.


  1. Natural Language Processing in Python.


  2.          
    #Importing the libraries
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    #Importing the dataset
    dataset = pd.read_csv('my_dataset.tsv', delimiter = '\t', quoting = 3)
    
    #Cleaning the text
    import re
    import nltk
    nltk.download('stopwords')
    from nltk.corpus import stopwords
    from nltk.stem.porter import PorterStemmer
    corpus = []
    for i in range(0, 1000):
      review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
      review = review.lower()
      review = review.split()
      ps = PorterStemmer()
      all_stopwords = stopwords.words('english')
      all_stopwords.remove('not')
      review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
      review = ' '.join(review)
      corpus.append(review)
    print(corpus)
    
    #Creating the Bag of Words model
    from sklearn.feature_extraction.text import CountVectorizer
    cv = CountVectorizer(max_features = 1500)
    X = cv.fit_transform(corpus).toarray()
    y = dataset.iloc[:, -1].values
    
    
    #Splitting the dataset into the Training set and Test set
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
    
    #Training the Naive Bayes model on the Training set
    from sklearn.naive_bayes import GaussianNB
    classifier = GaussianNB()
    classifier.fit(X_train, y_train)
    
    #Predicting the Test set results
    y_pred = classifier.predict(X_test)
    print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
    
    #Making the Confusion Matrix
    from sklearn.metrics import confusion_matrix, accuracy_score
    cm = confusion_matrix(y_test, y_pred)
    print(cm)
    accuracy_score(y_test, y_pred)
    


  3. Natural Language Processing in R.


  4. 
    #Importing the dataset
    dataset_original = read.delim('my_dataset.tsv', quote = '', stringsAsFactors = FALSE)
    # Cleaning the text
    # install.packages('tm')
    # install.packages('SnowballC')
    library(tm)
    library(SnowballC)
    corpus = VCorpus(VectorSource(dataset_original$Review))
    corpus = tm_map(corpus, content_transformer(tolower))
    corpus = tm_map(corpus, removeNumbers)
    corpus = tm_map(corpus, removePunctuation)
    corpus = tm_map(corpus, removeWords, stopwords())
    corpus = tm_map(corpus, stemDocument)
    corpus = tm_map(corpus, stripWhitespace)
    
    #Creating the Bag of Words model
    dtm = DocumentTermMatrix(corpus)
    dtm = removeSparseTerms(dtm, 0.999)
    dataset = as.data.frame(as.matrix(dtm))
    dataset$Liked = dataset_original$Liked
    
    # Encoding the target feature as factor
    dataset$Liked = factor(dataset$Liked, levels = c(0, 1))
    
    # Splitting the dataset into the Training set and Test set
    # install.packages('caTools')
    library(caTools)
    set.seed(123)
    split = sample.split(dataset$Liked, SplitRatio = 0.8)
    training_set = subset(dataset, split == TRUE)
    test_set = subset(dataset, split == FALSE)
    
    # Fitting Random Forest Classification to the Training set
    # install.packages('randomForest')
    library(randomForest)
    classifier = randomForest(x = training_set[-692],
                              y = training_set$Liked,
                              ntree = 10)
    
    # Predicting the Test set results
    y_pred = predict(classifier, newdata = test_set[-692])
    
    # Making the Confusion Matrix
    cm = table(test_set[, 692], y_pred)
    




See also related topics: