This project explores the different subjects of topics of discussions in reddit based on the thread title by using unsupervised learning to cluster this data.
Reddit is a website where different ranges topics can be talked about from pop culture to politics and mundane dinner table topics and analysis was done from the content contributor's point of view.
This data is analyzed by K-means clustering, a method of unsupervised learning. The titles of the thread's in reddit are clustered to gain insight on what are the most talked about topics on reddit. The basic breakdown of the process is:
1. Read the data from given file and house it into a Pandas dataframe.
2. Clean the data by removing whitespace and punctuations.
3. Use TFIDF (term frequency–inverse document frequency) to vectorize the texts.
4. Use TSVD (truncated singular value decomposition) to reduce the vectorized components into 80% variance explained
5. Use K-means clustering to separate titles into a k number of clusters
a.) elbow method to select appropriate amount of k
b.) implementing K-means clustering based on selected number
6. Upon clustering, house the clustered data into dataframe to properly analyze results.
7. The clustered data has topics:
a.) U.S. Politics
b.) trivial topics (gaming / prizes)
c.) ask for help or to report a new discovery
d.) tech support help request
e.) miscellaneous data to spread out to be clustered
Reddit is a forum type of website where different ranges topics can be talked about from pop culture to politics and mundane dinner table topics.
A person must sign up by selecting a username, and take part of discussions by commenting on an existing thread or starting one. A text file containing the authors and titles of the discussion threads was provided for.
The clustering strategy is to break down the entire set of the titles using the TFIDF vectorizer, then reduce it's components with Truncated SVD. Using the broken down components, k is selected using elbow analysis of the internal validation criteria like inertia (sum of squared euclidean distances), Calinski-Harabasz index and Silhouette coefficient.
In selecting k, many trials were made. The elbow analysis did not give a clear number based on the inflection point of internal validation criteria, but it showed that a range from 4 to 7 can be treated as k.
After clustering each, analysis was made on the top "features" per cluster. The clustered data was analyzed if the features were grouped in away that the features had commonality. Due to clustering having a random process involved in its implementation, the results were not always consistent.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import html
import seaborn as sns
import string
import warnings
warnings.filterwarnings("ignore")
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
from sklearn.random_projection import sparse_random_matrix
from scipy.spatial.distance import euclidean
from sklearn.metrics import silhouette_score, confusion_matrix
from sklearn.metrics import adjusted_rand_score, adjusted_mutual_info_score
from sklearn.cluster import KMeans
from sklearn import metrics
from scipy.spatial.distance import cdist
from scipy.spatial.distance import cityblock
from sklearn.metrics import calinski_harabaz_score, silhouette_score
from wordcloud import WordCloud, STOPWORDS
from sklearn.manifold import TSNE
This is to read the .txt file and place the text data into a Dataframe
df = pd.read_csv(r"reddit-dmw-sample.txt", sep='\t')
df.tail()
This is to clean the title column. All the text values or made into lowercase and the puctuation marks are removed
# this code converts all entries in df['title'] to lowercase letters
df['title'] = df['title'].map(lambda x: x.lower())
# this code converts all entries in df['title'] to lowercase letters
df['title'] = df['title'].map(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
# this code removes emojis
df['title'] = df['title'].map(lambda s: s.encode('ascii', 'ignore').decode('ascii'))
df=df.drop(columns='Unnamed: 0') # drops Unnamed: 0 column since it's not needed
df.tail()
Each row of the text is vectorized the TFIDF vectorizer. The dataframe shown below is the text rows broken down into the vectors based on the feature names.
vectorizer = TfidfVectorizer(stop_words='english') #
tfidf_v = vectorizer.fit_transform(df['title'])
tfidf_df = pd.DataFrame(tfidf_v.toarray(), columns=vectorizer.get_feature_names())
tfidf_df.head()
Some initial analysis regarding the lengths of the entries for authors and titles ddin't draw any hypotheses about the overall data. There are lots "deleted" authors, but this data doesn't necessarily affect the overall result.
Using a word cloud to see the terms that are highlighted in the titles is useful to see what returns to make hypotheses about the resulting clustered data.
df1=df
df1['author_length']=[len(x) for x in df1['author']]
df1['title_length']=[len(x) for x in df1['title']]
df1.tail()
df1.groupby('author').count().sort_values('title', ascending=False)[:5]
wordcloud = WordCloud(
background_color = 'black'
).generate(str(new_df['title']))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
Based on Python documentation, PCA Analysis is only used for dense data, for data based on a sparse matrix, Truncated SVD would be a better choice of dimensional reduction.
tsvd = TruncatedSVD(n_components=2600)
tsvdComponents = tsvd.fit_transform(tfidf_v)#x=tfidf_v.toarray()
If Truncated SVD would be used to break down the data into components it must have a variance of at least 80%.
tsvd.explained_variance_ratio_.sum()
TSNE can be used to visualize how the entirety of data is clustered together after being dimensionally reduced by TSVD. In this case, the data is very dense and clumped together, implying possible complications in clustering.
X_new= TSNE(random_state=1337).fit_transform(
tsvdComponents)
plt.scatter(X_new[:,0], X_new[:,1])
Being that the data has been cleaned, vectorized, and with dimensions reduced with 80% of variance explained, the actual clustering can now take place. Using a Kmeans clustering method, first thing to do is to determine the numbers of k.
X = tsvdComponents
distortions = []
inertias = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(random_state=1337, n_clusters=k)# random state is to get consistent values for inertia
X_predict=kmeanModel.fit_predict(X)
distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])
inertias.append(kmeanModel.inertia_)
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,5))
ax1.plot(range(1,10), distortions, 'bx-')
ax1.set_xlabel('k')
ax1.set_ylabel('Distortion')
ax1.set_title('The Elbow Method showing the optimal k')
ax2.plot(range(1,10), inertias, 'rx-')
ax2.set_xlabel('k')
ax2.set_ylabel('Inertia')
ax2.set_title('The Elbow Method showing the optimal k')
plt.show()
chs = []
scs = []
K = range(2,11)
for k in K:
kmeanModel = KMeans(random_state=1337, n_clusters=k)
X_predict=kmeanModel.fit_predict(X)
chs.append(calinski_harabaz_score(X, X_predict))
scs.append(silhouette_score(X, X_predict))
fig, (ax3, ax4) = plt.subplots(1,2, figsize=(15,3))
ax3.plot(range(2,11), chs, 'gx-')
ax3.set_xlabel('k')
ax3.set_ylabel('Calinski Harabaz')
ax3.set_title('The Elbow Method showing the optimal k')
ax4.plot(range(2,11), scs, 'yx-')
ax4.set_xlabel('k')
ax4.set_ylabel('Silhouette score')
ax4.set_title('The Elbow Method showing the optimal k')
plt.show()
Using the elbow method based on the inertia and distortion, the most appropriate number of clusters looks like it can be 3, 5 or 8. The Silhouette score may indicate that 4 or 7 is a better number for clusters. The Calinski Harabaz score is inconclusive.
kmeans_tsvd = KMeans(n_clusters=6, max_iter=1000)
cluster_num = kmeans_tsvd.fit_predict(X)
#the resulting labels for clusters 0 to 6 will be stored in cluster_num
cluster_num[110:125]
The figure below shows how the TSVD components are clustered together when represented in a two-dimensional plane.
X_n= TSNE(random_state=1337).fit_transform(
tsvdComponents)
plt.scatter(X_n[:,0], X_n[:,1], c=cluster_num)
A dataframe matching the actual texts with the esablished cluster numbers per row is made. It shows each row of text's TSVD components with the actual cluster label based on the implemented K-means clustering where k=7
new_df=pd.DataFrame(df['title'])
new_df['cluster_num']=cluster_num
new_df[2260:2275]
To get the most popular individual terms for the entire set of titles, TfidfVectorizer is applied and Kmeans with k=1 is implemented to get the most popular terms for that cluster.
vectorizer = TfidfVectorizer(stop_words='english')
cl = vectorizer.fit_transform(new_df['title'])
model = KMeans(n_clusters=1)
model.fit(cl)
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
cluster_w=[]
for ind in order_centroids[0, :10]:
cluster_w.append(terms[ind])
print("Top terms for whole group of text", cluster_w)
Since each row of the set of titles has already been clustered and labelled, analysis can now be done. Using the get_feature_names() methof of the TfidfVectorizer() algorithm and .clustercenters feature of KMeans() using k equal to "1", each cluster can be analyzed by the terms closest to its individual centroid.
The code below loops to each cluster, isolating the results of each individual cluster, then plugging those titles into the TFIDF vectorizer, to get the feature names. The individual feature names are then clustered via K-means with k=1, and the feauture names are are sorted to the closest to that centroid.
print("Top terms per cluster:")
cluster_df=pd.DataFrame()
for k in range(6): # range of loops is 7, being equal to number of clusters
cluster_name='Cluster ' + str(k)
vectorizer_n = TfidfVectorizer(stop_words='english')
#the titles belonging to a cluster is plugged into a TFIDF vectorizer
cl_n = vectorizer_n.fit_transform(new_df[new_df['cluster_num']==k]['title'])
model_n = KMeans(n_clusters=1)
model_n.fit(cl_n)
cluster_n=[]
order_centroids_n = model_n.cluster_centers_.argsort()[:, ::-1]
terms_n = vectorizer_n.get_feature_names()
#the vectorized feature names are clustered with KMeans(n_clusters=1)
cluster_df[cluster_name]=[terms_n[ind] for ind in list(order_centroids_n[0, :50])]
for ind in order_centroids_n[0, :10]:
cluster_n.append(terms_n[ind])
len_=len(terms_n)
print(cluster_name, len_, cluster_n)
#for each cluster, the number of features are printed along with the top ten features
A new dataframe containing the most popular tems for each cluster is made.
cluster_df.iloc[:25]
#Clusters' themes are are tech support, tech news, assorted data, politics, legal advise, New Year topic
Word clouds for each cluster is then made. Each wordcloud displays the most "popular" terms for each cluster. This is a convenient tool to see what are the best
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6)) = plt.subplots(3,2, figsize=(70,50))
text1 = cluster_df["Cluster 0"][:50]
wordcloud1 = WordCloud(
background_color = 'black'
).generate(str(text1))
ax1.imshow(wordcloud1, interpolation = 'bilinear')
ax1.axis('off')
text2 = cluster_df["Cluster 1"][:50]
wordcloud2 = WordCloud(
background_color = 'black'
).generate(str(text2))
ax2.imshow(wordcloud2, interpolation = 'bilinear')
ax2.axis('off')
text3 = cluster_df["Cluster 2"][:50]
wordcloud3 = WordCloud(
background_color = 'black'
).generate(str(text3))
ax3.imshow(wordcloud3, interpolation = 'bilinear')
ax3.axis('off')
text4 = cluster_df["Cluster 3"][:50]
wordcloud4 = WordCloud(
background_color = 'black'
).generate(str(text4))
ax4.imshow(wordcloud4, interpolation = 'bilinear')
ax4.axis('off')
text5 = cluster_df["Cluster 4"][:50]
wordcloud5 = WordCloud(
background_color = 'black'
).generate(str(text5))
ax5.imshow(wordcloud5, interpolation = 'bilinear')
ax5.axis('off')
text6 = cluster_df["Cluster 5"][:50]
wordcloud6 = WordCloud(
background_color = 'black'
).generate(str(text6))
ax6.imshow(wordcloud6, interpolation = 'bilinear')
ax6.axis('off')
plt.show()