Using WordClouds and N-grams to visualize text data

Alifia Ghantiwala
6 min readApr 12, 2022

--

Introduction

Working with text data can be very different from working with numerical data in machine learning. The entire process of data visualization, data cleaning, preprocessing, tokenization, and lemmatization is different for textual data than plain numerical data.

The first step to building better models is to understand your data. Data visualization can be a good aid in understanding the data at hand.

In this article, we would aim to visualize textual data in a manner that is pleasing to look at and also provides some insights into the data.

Data Visualization and Analysis

The dataset we would work on is a consumer complaint dataset (available on Kaggle).

Features available are

1) Product: The product for which the complaint is received.

2) Narrative: The complaint by the consumer.

Checking the balance in Data

train["product"].value_counts()sns.countplot(data=train,x="product")
plt.xticks(rotation=45)
plt.show()

Observation: The data is highly imbalanced.

Credit reporting product has the most number of complaints.

Plotting Wordclouds

Wordcloud provides a visual representation of your text. Based on the frequency and relevance of words in your text, their size would be determined in a word cloud. They are useful to provide quick insights into the data which can further be investigated.

We first take a subset of the credit card product from the entire data and plot a basic word cloud for the same.

Taking a subset of credit card products from the entire data.

credit_card = train[train["product"]=="credit_card"]

Function to create word tokens of sentences.

def function(train):   
comment_words = ""
for i in train["narrative"]:
val = str(i)
tokens = val.split()
for k in range(len(tokens)):
tokens[k] = tokens[k].lower()
comment_words += " ".join(tokens)+" "
return comment_words

Plotting a basic word cloud

def plot_wordcloud(train):    
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
comment_words = function(train)
wordcloud = WordCloud(width = 800, height = 800,
background_color ='black',
min_font_size = 10,collocations=False).generate(comment_words)
# plot the WordCloud image
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()

The above plot looks pretty basic, let’s try changing the colormap of the word cloud to make it more visually appealing. I used the color map = ‘tab20c’ for this plot, you can find a list of all color maps here

This does look a bit better, doesn’t it? but it is still pretty basic.

We would now try to give a shape to the word cloud by using a mask image.

1) First we would need an image that has a completely white background and a clear bounded foreground object.

I chose the below image of a credit card for the creation of my mask.

For the creation of the mask, we would use PIL for loading the image and create a NumPy array as can be in the next code snippet.

cc_mask = np.array(Image.open('../input/credit-card/credit_card.jpg'))

In the original word cloud creation function, we add a parameter of mask which refers to cc_mask.

def plot_wordcloud(train):    
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
comment_words = function(train)
wordcloud = WordCloud(scale=3,
background_color='white',
mask=cc_mask,
max_words=150,
colormap='tab20c',
stopwords=stopwords,
collocations=True).generate_from_text(comment_words)
# plot the WordCloud image
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()

On plotting the word cloud we get to see

Well, this does not look so good. Now we try with a different credit card image as the mask.

def plot_wordcloud(train):    
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
comment_words = function(train)
wordcloud = WordCloud(scale=3,
background_color='white',
mask=ccc_mask,
max_words=150,
colormap='Dark2',
stopwords=stopwords,
collocations=True).generate_from_text(comment_words)
# plot the WordCloud image
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.show()

This word cloud looks like a credit card, doesn’t it? It would help if we could have a border surrounding its boundaries.

Next, we add a contour to the above image as in the next code snippet and look at the results.

def plot_wordcloud(train):    
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
comment_words = function(train)
wordcloud = WordCloud(scale=3,
background_color='white',
mask=ccc_mask,
max_words=150,
colormap='Dark2',
stopwords=stopwords,
collocations=True,
contour_color="black",
contour_width=3).generate_from_text(comment_words)
# plot the WordCloud image
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad = 0)
plt.title("Credit card complaint wordcloud")
plt.show()

Looks neat!

Similarly, after some trial and error, I was able to create wordclouds for the other product types.

Product Type: Retail Banking

Product Type: Loan and Mortgage

For creating the masks for the below wordclouds, I used Canva, you could use MSPaint as well.

Product type: Debt Collection

Product Type: Credit Reporting

Unigram, Bigram, and Trigram Visualization

Looking at the most common words in the text can give us an important understanding of them. We would use CountVectorizer to create unigrams, bigrams, and trigrams and visualize them.

from sklearn.feature_extraction.text import CountVectorizer
def get_top_n_words(corpus, n=None):
vec = CountVectorizer().fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
#print(vec.vocabulary_.items())
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_words(cr['narrative'], 20)
#for word, freq in common_words:
#print(word, freq)
df1 = pd.DataFrame(common_words, columns = ['narrative' , 'count'])
df1.groupby('narrative').sum()['count'].sort_values(ascending=False).plot(kind='bar', title='Top 20 words in narrative before removing stop words')

We can see that credit, account, and report are the most common unigrams for credit reporting product types.

Next, we look at the bigrams.

from sklearn.feature_extraction.text import CountVectorizer
def get_top_n_words(corpus, n=None):
vec = CountVectorizer(ngram_range=(2,2)).fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
#print(vec.vocabulary_.items())
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
common_words = get_top_n_words(cr['narrative'], 20)
#for word, freq in common_words:
#print(word, freq)
df1 = pd.DataFrame(common_words, columns = ['narrative' , 'count'])
df1.groupby('narrative').sum()['count'].sort_values(ascending=False).plot(kind='bar', title='Top 20 words in narrative before removing stop words')

Visualizing bigrams gives us a better context of the data. We can see that the most repeating 20 bigrams, have the word credit repeating multiple times over.

For plotting the trigrams I changed the ngram_range to (3,3) while initializing the count vectorizer.

Visualizing the trigrams gives us further context to the credit reporting product narrative.

We visualize trigrams for each of the products next.

Debt Collection

We can see some similar trigrams between debt collection and credit reporting products, like “fair credit reporting” and “credit reporting act”.

Loan

Retail Banking:

Conclusion

We looked at how using wordclouds and n-grams we can better understand the dataset by actually visualizing the context. You now know how to create beautiful word clouds that generate insights, I would suggest you, try out the same on a data set of your interest.

To summarise the article, the main points we discussed were:

  • Text data visualization is different from numerical data visualization.
  • Using wordclouds we can view the most prominent words from the dataset based on their frequency.
  • Using unigrams, bigrams, and trigrams we understand the context of the data as well as similarities if any between two classes of data.

References:

--

--

Alifia Ghantiwala
Alifia Ghantiwala

Written by Alifia Ghantiwala

Trying to investigate data better!

Responses (1)