Link: here

Clustering Reddit's Hottest Topics

Executive Summary/Synopsis

This project explores the different subjects of topics of discussions in reddit based on the thread title by using unsupervised learning to cluster this data.

Reddit is a website where different ranges topics can be talked about from pop culture to politics and mundane dinner table topics and analysis was done from the content contributor's point of view.

This data is analyzed by K-means clustering, a method of unsupervised learning. The titles of the thread's in reddit are clustered to gain insight on what are the most talked about topics on reddit. The basic breakdown of the process is:

1. Read the data from given file and house it into a Pandas dataframe.
2. Clean the data by removing whitespace and punctuations.
3. Use TFIDF (term frequency–inverse document frequency) to vectorize the texts.
4. Use TSVD (truncated singular value decomposition) to reduce the vectorized components into 80% variance explained
5. Use K-means clustering to separate titles into a k number of clusters
    a.) elbow method to select appropriate amount of k
    b.) implementing K-means clustering based on selected number
6. Upon clustering, house the clustered data into dataframe to properly analyze results.
7. The clustered data has topics: 
    a.) U.S. Politics
    b.) trivial topics (gaming / prizes)
    c.) ask for help or to report a new discovery
    d.) tech support help request
    e.) miscellaneous data to spread out to be clustered

Data Description

Reddit is a forum type of website where different ranges topics can be talked about from pop culture to politics and mundane dinner table topics.

A person must sign up by selecting a username, and take part of discussions by commenting on an existing thread or starting one. A text file containing the authors and titles of the discussion threads was provided for.

Method

The clustering strategy is to break down the entire set of the titles using the TFIDF vectorizer, then reduce it's components with Truncated SVD. Using the broken down components, k is selected using elbow analysis of the internal validation criteria like inertia (sum of squared euclidean distances), Calinski-Harabasz index and Silhouette coefficient.

In selecting k, many trials were made. The elbow analysis did not give a clear number based on the inflection point of internal validation criteria, but it showed that a range from 4 to 7 can be treated as k.

After clustering each, analysis was made on the top "features" per cluster. The clustered data was analyzed if the features were grouped in away that the features had commonality. Due to clustering having a random process involved in its implementation, the results were not always consistent.

Initializations

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import html
import seaborn as sns
import string
import warnings
warnings.filterwarnings("ignore")

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
from sklearn.random_projection import sparse_random_matrix

from scipy.spatial.distance import euclidean
from sklearn.metrics import silhouette_score, confusion_matrix
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score

from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
from scipy.spatial.distance import cityblock

from sklearn.metrics import calinski_harabaz_score, silhouette_score

from wordcloud import WordCloud, STOPWORDS
from sklearn.manifold import TSNE

This is to read the .txt file and place the text data into a Dataframe

In [2]:
df = pd.read_csv(r"reddit-dmw-sample.txt", sep='\t')
In [3]:
df.tail()
Out[3]:
Unnamed: 0 author title
5995 5995 ceryniz Illegal towing from apartment complex MD.
5996 5996 aksumighty Amid Trump surge, nearly 20,000 Massachusetts ...
5997 5997 Aneroth_Kid Studio Ghibli and Nintendo should collaborate ...
5998 5998 whatisthishownow Can a landlord place a security camera in a co...
5999 5999 akhmadsanusi Rio Haryanto Pembalap F1 Indonesia Terkenal Da...

Pre-processing

This is to clean the title column. All the text values or made into lowercase and the puctuation marks are removed

In [4]:
# this code converts all entries in df['title'] to lowercase letters
df['title'] = df['title'].map(lambda x: x.lower()) 
In [5]:
# this code converts all entries in df['title'] to lowercase letters
df['title'] = df['title'].map(lambda x: x.translate(str.maketrans('', '', string.punctuation))) 
In [6]:
# this code removes emojis
df['title'] = df['title'].map(lambda s: s.encode('ascii', 'ignore').decode('ascii'))
In [7]:
df=df.drop(columns='Unnamed: 0') # drops Unnamed: 0 column since it's not needed
df.tail()
Out[7]:
author title
5995 ceryniz illegal towing from apartment complex md
5996 aksumighty amid trump surge nearly 20000 massachusetts vo...
5997 Aneroth_Kid studio ghibli and nintendo should collaborate ...
5998 whatisthishownow can a landlord place a security camera in a co...
5999 akhmadsanusi rio haryanto pembalap f1 indonesia terkenal da...

TFIDF

Each row of the text is vectorized the TFIDF vectorizer. The dataframe shown below is the text rows broken down into the vectors based on the feature names.

In [8]:
vectorizer = TfidfVectorizer(stop_words='english') #
tfidf_v = vectorizer.fit_transform(df['title'])
tfidf_df = pd.DataFrame(tfidf_v.toarray(), columns=vectorizer.get_feature_names())
tfidf_df.head()
Out[8]:
001 001025 004 005 01 0181 02 04 0530 060 ... zu zucchine zuckerberg zues zuhause zuk zulus zuppa zur zurck
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 13155 columns

Exploratory Data Analysis

Some initial analysis regarding the lengths of the entries for authors and titles ddin't draw any hypotheses about the overall data. There are lots "deleted" authors, but this data doesn't necessarily affect the overall result.

Using a word cloud to see the terms that are highlighted in the titles is useful to see what returns to make hypotheses about the resulting clustered data.

In [54]:
df1=df
df1['author_length']=[len(x) for x in df1['author']]
df1['title_length']=[len(x) for x in df1['title']]
df1.tail()
Out[54]:
author title author_length title_length
5995 ceryniz illegal towing from apartment complex md 7 40
5996 aksumighty amid trump surge nearly 20000 massachusetts vo... 10 72
5997 Aneroth_Kid studio ghibli and nintendo should collaborate ... 11 98
5998 whatisthishownow can a landlord place a security camera in a co... 16 58
5999 akhmadsanusi rio haryanto pembalap f1 indonesia terkenal da... 12 55
In [53]:
df1.groupby('author').count().sort_values('title', ascending=False)[:5]
Out[53]:
title author_length title_length
author
[deleted] 1187 1187 1187
Buycheaplow 53 53 53
bhawaniraj 28 28 28
tropicalpost 19 19 19
ZiDarkGaming 18 18 18
In [51]:
wordcloud = WordCloud(
    background_color = 'black'
    ).generate(str(new_df['title']))

plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
Out[51]:
(-0.5, 399.5, 199.5, -0.5)

Dimension reduction

Based on Python documentation, PCA Analysis is only used for dense data, for data based on a sparse matrix, Truncated SVD would be a better choice of dimensional reduction.

In [10]:
tsvd = TruncatedSVD(n_components=2600)
tsvdComponents = tsvd.fit_transform(tfidf_v)#x=tfidf_v.toarray()

Variance explained

If Truncated SVD would be used to break down the data into components it must have a variance of at least 80%.

In [11]:
tsvd.explained_variance_ratio_.sum() 
Out[11]:
0.8084058583399314

TSNE Visualization

TSNE can be used to visualize how the entirety of data is clustered together after being dimensionally reduced by TSVD. In this case, the data is very dense and clumped together, implying possible complications in clustering.

In [12]:
X_new= TSNE(random_state=1337).fit_transform(
    tsvdComponents)
plt.scatter(X_new[:,0], X_new[:,1])
Out[12]:
<matplotlib.collections.PathCollection at 0x1e381f1ef60>

Elbow analysis

Being that the data has been cleaned, vectorized, and with dimensions reduced with 80% of variance explained, the actual clustering can now take place. Using a Kmeans clustering method, first thing to do is to determine the numbers of k.

In [13]:
X = tsvdComponents 

distortions = []
inertias = []

K = range(1,10)
for k in K:
    kmeanModel = KMeans(random_state=1337, n_clusters=k)# random state is to get consistent values for inertia
    X_predict=kmeanModel.fit_predict(X)
    distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])
    inertias.append(kmeanModel.inertia_)
In [14]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,5))


ax1.plot(range(1,10), distortions, 'bx-')
ax1.set_xlabel('k')
ax1.set_ylabel('Distortion')
ax1.set_title('The Elbow Method showing the optimal k')


ax2.plot(range(1,10), inertias, 'rx-')
ax2.set_xlabel('k')
ax2.set_ylabel('Inertia')
ax2.set_title('The Elbow Method showing the optimal k')


plt.show()
In [15]:
chs = []
scs = []

K = range(2,11)
for k in K:
    kmeanModel = KMeans(random_state=1337, n_clusters=k)
    X_predict=kmeanModel.fit_predict(X)
    chs.append(calinski_harabaz_score(X, X_predict))
    scs.append(silhouette_score(X, X_predict))
In [16]:
fig, (ax3, ax4) = plt.subplots(1,2, figsize=(15,3))


ax3.plot(range(2,11), chs, 'gx-')
ax3.set_xlabel('k')
ax3.set_ylabel('Calinski Harabaz')
ax3.set_title('The Elbow Method showing the optimal k')


ax4.plot(range(2,11), scs, 'yx-')
ax4.set_xlabel('k')
ax4.set_ylabel('Silhouette score')
ax4.set_title('The Elbow Method showing the optimal k')


plt.show()

Choosing k

Using the elbow method based on the inertia and distortion, the most appropriate number of clusters looks like it can be 3, 5 or 8. The Silhouette score may indicate that 4 or 7 is a better number for clusters. The Calinski Harabaz score is inconclusive.

In [28]:
kmeans_tsvd = KMeans(n_clusters=6, max_iter=1000)
cluster_num = kmeans_tsvd.fit_predict(X) 
#the resulting labels for clusters 0 to 6 will be stored in cluster_num
In [ ]:
cluster_num[110:125]

Visualizing the clustering

The figure below shows how the TSVD components are clustered together when represented in a two-dimensional plane.

In [18]:
X_n= TSNE(random_state=1337).fit_transform(
    tsvdComponents)
plt.scatter(X_n[:,0], X_n[:,1], c=cluster_num)
Out[18]:
<matplotlib.collections.PathCollection at 0x1e3cde547f0>

Labeling

A dataframe matching the actual texts with the esablished cluster numbers per row is made. It shows each row of text's TSVD components with the actual cluster label based on the implemented K-means clustering where k=7

In [32]:
new_df=pd.DataFrame(df['title'])
new_df['cluster_num']=cluster_num
new_df[2260:2275]
Out[32]:
title cluster_num
2260 should i take a court appointed lawyer or try ... 2
2261 simple misdemeanor 2
2262 apes can outperform humans in simple short ter... 2
2263 homemade general tao 2
2264 wa read through lots of others heres my question 2
2265 cs go matchmaking 1vs5 clutch round 3030 ge 2
2266 new years resolution minecraft lets play part 2 5
2267 living with partner on permanent disability he... 2
2268 breaking down 21 apple product flops from 1984... 4
2269 til electric company in turkey charges extra u... 1
2270 rubio a vote for trump is a vote for hillary 3
2271 legal experts how does this loophole work i hi... 2
2272 prolife congressman who pressured mistress int... 2
2273 texas in a rental what is considered a modific... 2
2274 amazing 3 in 1 combo offer at aachi 2

To get the most popular individual terms for the entire set of titles, TfidfVectorizer is applied and Kmeans with k=1 is implemented to get the most popular terms for that cluster.

In [20]:
vectorizer = TfidfVectorizer(stop_words='english')
cl = vectorizer.fit_transform(new_df['title'])

model = KMeans(n_clusters=1)
model.fit(cl)


order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

cluster_w=[]


for ind in order_centroids[0, :10]:
    cluster_w.append(terms[ind])
print("Top terms for whole group of text", cluster_w) 
Top terms for whole group of text ['til', 'new', 'trump', 'sanders', 'clinton', '2016', 'game', 'year', 'bernie', 'donald']

Analyzing the results

Since each row of the set of titles has already been clustered and labelled, analysis can now be done. Using the get_feature_names() methof of the TfidfVectorizer() algorithm and .clustercenters feature of KMeans() using k equal to "1", each cluster can be analyzed by the terms closest to its individual centroid.

The code below loops to each cluster, isolating the results of each individual cluster, then plugging those titles into the TFIDF vectorizer, to get the feature names. The individual feature names are then clustered via K-means with k=1, and the feauture names are are sorted to the closest to that centroid.

In [30]:
print("Top terms per cluster:")
cluster_df=pd.DataFrame()

for k in range(6): # range of loops is 7, being equal to number of clusters
    cluster_name='Cluster ' + str(k)

    vectorizer_n = TfidfVectorizer(stop_words='english')
    #the titles belonging to a cluster is plugged into a TFIDF vectorizer
    cl_n = vectorizer_n.fit_transform(new_df[new_df['cluster_num']==k]['title']) 

    model_n = KMeans(n_clusters=1)

    model_n.fit(cl_n)

    cluster_n=[]
    order_centroids_n = model_n.cluster_centers_.argsort()[:, ::-1]
    terms_n = vectorizer_n.get_feature_names()
    
    #the vectorized feature names are clustered with KMeans(n_clusters=1)
    cluster_df[cluster_name]=[terms_n[ind] for ind in list(order_centroids_n[0, :50])]
    
    for ind in order_centroids_n[0, :10]:
        cluster_n.append(terms_n[ind])
    len_=len(terms_n)
    print(cluster_name, len_, cluster_n)


    #for each cluster, the number of features are printed along with the top ten features
Top terms per cluster:
Cluster 0 384 ['help', 'need', 'number', 'legal', 'support', 'toll', '18777788969', 'technical', 'free', 'line']
Cluster 1 308 ['company', 'development', 'design', 'web', 'best', 'application', 'til', 'year', 'digital', 'mortgage']
Cluster 2 12563 ['til', 'trump', '2016', 'game', 'just', 'donald', 'car', 'food', 'best', 'number']
Cluster 3 1040 ['sanders', 'clinton', 'bernie', 'hillary', 'trump', 'iowa', 'clintons', 'donald', 'campaign', 'emails']
Cluster 4 419 ['apple', 'iphone', 'new', 'york', 'judge', 'fbi', 'til', '6s', 'times', 'case']
Cluster 5 517 ['new', 'year', 'happy', 'years', 'game', 'birthday', 'day', '2015', 'eve', '2016']

Clustered data

A new dataframe containing the most popular tems for each cluster is made.

In [31]:
cluster_df.iloc[:25]
#Clusters' themes are are tech support, tech news, assorted data, politics, legal advise, New Year topic
Out[31]:
Cluster 0 Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5
0 help company til sanders apple new
1 need development trump clinton iphone year
2 number design 2016 bernie new happy
3 legal web game hillary york years
4 support best just trump judge game
5 toll application donald iowa fbi birthday
6 18777788969 til car clintons til day
7 technical year food donald 6s 2015
8 free digital best campaign times eve
9 line mortgage number emails case 2016
10 steam started online million 21 games
11 pa mobile new super court gaming
12 school marketing support poll sides resolution
13 got private games new feds steam
14 ticket pay need vote ruling best
15 dial launching time plan phone party
16 project cart iowa final tv homemade
17 days illegal video defeat reality til
18 carolina app make tuesday se play
19 im crisis question state fight time
20 just trump black inside rules make
21 new housing 10 supporters locked owned
22 speeding 2007 years democratic custody video
23 death service people voters oreilly gamer
24 probably tech free caucus children resolutions

Word clouds

Word clouds for each cluster is then made. Each wordcloud displays the most "popular" terms for each cluster. This is a convenient tool to see what are the best

In [45]:
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize=(70,50))


text1 = cluster_df["Cluster 0"][:50]
wordcloud1 = WordCloud(
    background_color = 'black'
    ).generate(str(text1))

ax1.imshow(wordcloud1, interpolation = 'bilinear')
ax1.axis('off')


text2 = cluster_df["Cluster 1"][:50]
wordcloud2 = WordCloud(
    background_color = 'black'
    ).generate(str(text2))

ax2.imshow(wordcloud2, interpolation = 'bilinear')
ax2.axis('off')

text3 = cluster_df["Cluster 2"][:50]
wordcloud3 = WordCloud(
    background_color = 'black'
    ).generate(str(text3))

ax3.imshow(wordcloud3, interpolation = 'bilinear')
ax3.axis('off')


text4 = cluster_df["Cluster 3"][:50]
wordcloud4 = WordCloud(
    background_color = 'black'
    ).generate(str(text4))

ax4.imshow(wordcloud4, interpolation = 'bilinear')
ax4.axis('off')

text5 = cluster_df["Cluster 4"][:50]
wordcloud5 = WordCloud(
    background_color = 'black'
    ).generate(str(text5))

ax5.imshow(wordcloud5, interpolation = 'bilinear')
ax5.axis('off')


text6 = cluster_df["Cluster 5"][:50]
wordcloud6 = WordCloud(
    background_color = 'black'
    ).generate(str(text6))

ax6.imshow(wordcloud6, interpolation = 'bilinear')
ax6.axis('off')


plt.show()