Machine learning for text analysis and NLP

Extracting Meaning: The Convergence of Machine Learning and Text Analysis

When we think about big data and analytics, the most straightforward associations that pop up include math, numbers, statistics, and maybe even spreadsheets. It’s understandable since data usually comes in the form of numbers. In fact, the most basic form of data representation is binary, based on just two digits—0 and 1.

However, big data comprises a far broader spectrum of information, from emails and loan application forms to text messages, image data, and voice recordings. All these various data types carry an incredible amount of untapped information. Text data is not an exception.

The stats suggest that, by 2022, we’ll be sending over 300 billion emails daily. For the past decade, the number of text messages sent daily has increased by more than 7%. Younger generations overwhelmingly prefer texting to phone calls. And this is just scratching the surface, as there are many other types of textual data: support tickets, insurance application forms, healthcare records, product descriptions, and many others.

Extracting meaning out of this information is an incredibly complicated task since texts may have different contexts and formats. Textual data is usually referred to as unstructured data because it doesn’t have a clear storage format or a predefined data model. Sure, you can put a sentence into an Excel cell. But how would that help you to analyze it?

The applications of text analysis are far and wide, from simple automation to advanced interactions between the person inputting the data and the system they interact with. A rudimentary example of that is a chatbot. This complexity of text analysis breeds its own rules and even fields of study, like natural language processing.

In this article, we’ll go over some of the applications of text analysis, its specific use cases, and techniques that are proven to be useful in extracting meaningful insights out of text.

NLP is the Endgame

When we’re talking about machine learning in text analysis, it’s important to understand the overall concepts and the vocabulary that might be relevant to this topic. Sometimes people confuse machine learning, natural language processing, and AI. These terms are interconnected, but they don’t mean the same thing.

Machine learning techniques decipher text to deliver a specific result, like to identify sentiment in a tweet. Natural language processing refers to the broader results produced by an ML-enabled system. NLP encompasses not only machine learning but also lexicology, computational linguistics (computational lexicon semantics, etc.), and many other fields. NLP is considered to be one of the hardest domains in machine learning and AI due to its complexity.

Natural language processing, machine learning, deep learning

We’ll be covering both basic concepts, like machine learning text analysis, and more advanced NLP-specific themes.

Sentiment Analysis

In marketing, there are many things that standard metrics like the Net Promoter Score miss. It is, after all, in many cases, an Excel spreadsheet calculation. And this is where machine learning comes in. Being able to better identify fringe cases or incorporate more data into your understanding of the audience is crucial.

You can apply these types of machine learning solutions to a broad range of use cases since an operational machine learning system can quickly score the specific sentiment and put it on a scale that means something for your business. For example, people with highly negative support ticket submissions can be quickly identified and moved to the top of the customer support line. Other more nuanced use cases include:

  • Identifying sentiment around your brand online using open-source data (for example, twits)
  • Maintaining visibility into the slightest changes in customer sentiment
  • Improving marketing to specific individuals with the worst sentiment

Let’s take a look at an elementary example of a machine learning workflow that can be delivered by a sentiment classification tool.

  • Acquire an appropriate dataset. Luckily, there are plenty of them, with the IMDB movie review dataset being the most popular one.
  • You need to pre-process and clean your data. Some machine learning models will work with raw data. But most of the tried and tested methods need the data to be structured uniformly. This means removing any funky characters, formatting, code bits present in the dataset, etc.
  • After this is done, we need to transform the text into something that’s readable by the algorithm. It’s called vectorization. We’re creating a matrix with a column that represents each word, and the rest of the columns represent each review. Every time a specific word is used in a review, we place 1 in that column.

hot encoding NLP

Sometimes this is referred to as one hot encoding. In layman’s terms, it means turning data into numeric features, readable by algorithms.

  • Now you can start building the classification model. There are plenty of algorithms to do the job: logistic regression, perceptron (a type of neural network), and others. Most of them are pretty accurate.

Accuracy of different NLP algorithms

Depending on the performance of the classification, you might need to further pre-process data or choose a different algorithm. Of course, when that’s all fine-tuned, you can start thinking about the operationalization of generated insights.

This was just a very basic example and a high-level overview of the pipeline. There are other techniques and methodologies, which work with different linguistic constructs, such as phrase analysis.

Structured Data and Text

Paradoxically, one of the more advanced applications of machine learning in text analysis has nothing to do with the interpretation of the text itself, but instead with the combined study of text, structured data, and their correlations. This combination produces great results and uncovers more sophisticated insights.

For example, LSTM algorithms can be used to analyze unstructured data, like news articles and tweets, as well as in combination with structured data like sales of a specific fashion brand. This combination can be used to build a time series model, which outputs predictions of sales volumes for that brand. Or, this combination can be used to predict short-term asset values based on current news. The Wall Street Journal has been using machine learning and text analysis to make investment decisions for quite some time now.

Other Applications

There are many other similar applications of text analysis, which span a myriad of industries. Let’s quickly glance over a couple of examples.

Healthcare Records Analysis

Combining patient records with their biometric data can allow hospitals to identify high-risk patients based on their ongoing treatment records. Companies like Roam Analytics specialize in just that—taking healthcare records, which are being updated all the time by doctors and nursing staff, to uncover advanced health insights.

Credit Application Analysis

Along with structured data (income, number of family members, etc.), loan applications include text, like “loan purpose.” This text provides an incredible amount of signal, which couldn’t have been discovered by just analyzing the numbers in the loan applications.

For example, a text analysis study performed with the help of machine learning by the Columbia Business School suggests that there are specific words on loan applications that can point to higher default risks. People who are likely to default mention “God,” family members, and polite words like “thank you” in their applications. Some of these have a more obvious logic, like the fact that the person is trying to be more polite because they had trouble with loans in the past. But it’s up to the lender to make sense of all the other insights.

Banking Transactions Analysis

Credit card transactions contain tons of valuable textual data. Most of the transactions are supplied with standard descriptions, like “groceries,” “airline tickets,” etc. This information can be combined with the structured data that the bank already has, like the credit score and the account balance. The combination of text data from these transactions and numerical banking data can be used to predict credit risk with a high degree of accuracy.

Entity Recognition

Think of this use cases as text recognition on steroids. Understanding the context, the difference between words, and their importance is easy for us, humans. But a machine by default doesn’t see a difference between the words “road” and “Sydney.” For software, it’s just a combination of transformed symbols. Being able to teach a system to recognize named entities opens up a variety of opportunities.

It is not an easy task. But deep learning algorithms have been successfully completing it. Some of the more advanced approaches to this problem include:

  • Creating model ensembles, where two or more algorithms are combined to deliver the result. Sometimes the resulting models are referred to as blenders.
  • Creating combined systems that rely on deep learning models and rules, set by the operator.

Deep learning-based NLP

The applications of machine learning in entity recognition are limitless in industries heavily reliant on unstructured data (text):

  • Entity recognition can be used to extract drug test data from patient healthcare records, with models that are trained to recognize specific drug names and side effects. This way clinical data could be quickly analyzed to better understand drug effects.
  • R&D divisions can use the full corpora of research in their specific domain to find published papers that cover a highly specialized topic and sift through the irrelevant content.
  • Banking, investment, and other organizations can use public databases like the SEC’s EDGAR repository to uncover business health KPIs of their competitors, potential investment targets, or as a means of general market research.

Topic Modeling

This technique is used to identify “hidden” topics within a text. For example, you sell books. One of the categories that you offer is called “sci-fi.” So you tag all of the books in this category as “sci-fi” and when someone purchases a book from your selection, you offer recommendations. But since “sci-fi” is a very broad topic, the user gets a ton of recommendations that they might not be interested in.

When you apply topic modeling to the text of your “sci-fi” books, the output of the algorithm delivers a variety of topics based on the text. This way, when your reader receives the next recommendation for a book, they’ll get a more precise suggestion and would be more likely to check out the book. The New York Times uses topic models to improve article recommendations for its readers.

Topic modeling

As you can imagine, there are numerous other applications of topic modeling techniques:

One of the most frequently and effectively used models for this technique is called Latent Dirichlet Allocation (LDA), which assigns a probability to each specific topic the text might refer to. So it can express uncertainty about the results, which in its own right can be an insight. There are other types of algorithms in this category, such as Non-Negative Matrix Factorization (NMF), which don’t offer probability and confidently assign a specific topic to a document.

The types of models that you might use for topic modeling depend on the types of data and the goals that you’re pursuing with your text analysis project. That’s why it’s essential to have the necessary expertise on board or partner with an experienced service provider that has a good track record.


It’s important to understand that this overview of a fraction of the use cases and applications of machine learning in text analysis is just scratching the surface. Machine learning text analysis is an incredibly complicated and rigorous process.

The feature engineering efforts alone could take a considerable amount of time, and the results may be less than optimal if you don’t choose the right approaches (n-grams, cosine similarity, or others). The results of your work might be incredibly rewarding, but you need a strong vision for the end goal of the project. Before even starting the project, your team needs to review the latest innovations in the niche. Practical research around the applications of recurrent neural networks (RNNs) and convolutional neural networks (CNNs) for text analysis/ NLP is expanding every day. You don’t need to invent the wheel. Maybe, there’s already a documented way of solving your business problem.

Having a strong team to materialize and operationalize the project is a must. Good thing that we have one in mind.

Is there an ongoing text analysis project that you’re trying to accomplish with machine learning? What is the biggest obstacle? What advice would you give to business owners considering products or solutions in this niche? Share your thoughts and ideas in the comments below!

Extracting Meaning: The Convergence of Machine Learning and Text Analysis
October 22, 2018