Text Summarization Using Deep Learning
So, In this article, We will make a Text Summarizer using Deep Learning. We will walk through a step-by-step process for building it and then we will implement our first text summarization model in python.
What is Text Summarization in NLP?
Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software. It helps the machines process and understands the human language.
“Text summarization is the problem of creating a short, accurate, and fluent summary of a longer text document.” -Text Summarizer
There are broadly two different approaches that are used for text summarization:
- Extractive Summarization
- Abstractive Summarization
Let’s look at these two types in more detail.
Extractive Summarization
In this type of Summarization, We identify the important sentences or phrases from the original text and extract only those from the text. Using the Text Rank algorithm we can build an Extractive Summarization.
Abstractive Summarization
Here, we generate new sentences from the existing sentences or text. In this, the generated sentences from Abstractive Summarization might be not present in the existing sentences or text.
So here we are going to build an Abstractive Text Summarizer using Deep Learning.
Introduction to Sequence-to-Sequence (Seq2Seq) Modeling
We can build a Seq2Seq model on any problem which includes the sequential information about the given problem. This includes two classification
- Neural Machine Translation
- Named Entity Recognition
some very common applications of sequential information are given below
In the case of Neural Machine Translation, the input is a text in language, and the output is also text in another language:
In the case of Named Entity Recognition, the input is a sequence of words and the output is a sequence of tags for given words as an input.
Our objective is to build an Abstractive Text Summarizer where the input is a long sequence of words and the output is a short summary of those words. So, we can model this as a Many-to-Many Seq2Seq problem.
There are two major components of a Seq2Seq model:
- Encoder
- Decoder
Encoder-Decoder
The Encoder-Decoder architecture is mainly used to solve the sequence-to-sequence (Seq2Seq) problems where the input and output sequences are of different lengths.
There are two phases to set up the Encoder-Decoder
- Training Phase
- Inference Phase
Understanding the Problem Statement
Customer reviews can often be long and descriptive. Analyzing these reviews manually, as you can imagine, is really time-consuming. This is where the brilliance of Natural Language Processing can be applied to generate a summary for long reviews.
Implementing Text Summarization in Python using Keras
Keras is an open-source library that provides a Python interface for artificial neural networks
Let’s import it into our environment:
Import the Libraries
Read the dataset
Drop Duplicates and NA values
Preprocessing
Performing basic preprocessing steps is very important before we get to the model building part. Using messy and uncleaned text data is a potentially disastrous move. So in this step, we will drop all the unwanted symbols, characters, etc. from the text that do not affect the objective of our problem.
Here is the dictionary that we will use for expanding the contractions:
a) Text Cleaning
Let’s look at the first 10 reviews in our dataset to get an idea of the text preprocessing steps:
We will perform the below-preprocessing tasks for our data:
- Convert everything to lowercase
- Remove HTML tags
- Contraction mapping
- Remove(‘s)
- Remove any text inside the parenthesis()
- Eliminate punctuations and special characters
- Remove stopwords
- Remove short words
b) Summary Cleaning
And now the first 10 rows of the reviews to an idea of the preprocessing steps for the summary column:
Define the function for this task:
Remember to add the start and end special tokens at the beginning and end of the summary:
Now, let’s take a look at the top 5 reviews and their summary:
Understanding the distribution of the sequences
Here, We will analyze the length of the reviews and the summary to get an overall idea about the distribution of the length of the text. It will help us to fix the maximum length of the sequence.
For Example
Now we are getting closer to make the building a model part. Before that, we have to split the dataset into a Training and Validation set.
Preparing the Tokenizer
It produces a sequence of the vocabulary and it will convert word sequence to integer sequence.
Let’s make Tokenizers for the text and summary
a) Text Tokenizer
b) Summary Tokenizer
Model building
Now We are going to make the model-building part. Before that let’s understand a few terms,
- Return Sequences = True: When the return sequence is set to True. LSTM will produce a hidden state and cell state for every timestep.
- Return State = True: When the return state is set to True. LSTM will produce a hidden state and cell state of the last timestep only.
- Initial State: It will initialize the internal state of the LSTM for the first timestep only.
- Stacked LSTM: It has multiple layers of the LSTM on top of each other it leads to better representation of the sequence.
Here we are building stacked LSTM for the encoder…
It is used to stop the training of the neural network at the right time by monitoring a user-specified metric. Here we use validation loss(val_loss) once it increases our model will stop the training.
We will train the model on a batch size of 512 and validate it on the holdout set(10% of the dataset). Batch size is a term used in Deep Learning and refers to the number of training examples utilized in one iteration.
Understanding the Diagnostic plot
Here we plot some diagnostic to understand the behavior of the model over time.
Now let’s build the dictionary to convert the index to word for target and source vocabulary.
Inference
Set up the inference for the Encoder-Decoder
We are defining the function below which is the Implementation of the interference process.
Let’s define the function to convert an integer sequence to a word sequence for summary as well as the reviews.
here is the output that we get.
Here, we are not getting an Accurate Summary.
This how we can perform text summarization using deep learning concepts in Python.