Workshop 4
Online Data Processing
& Text Analysis
Learn how to transform raw scraped data into actionable insights using Python — from data cleaning and NLP preprocessing to advanced text analysis techniques.
Setup & Data Cleaning
Install dependencies, load the dataset, and prepare it for analysis.
Install Required Libraries & Font Setup
Run this cell first in Google Colab. It installs all packages AND sets up a global Chinese font (Noto Sans CJK) so every chart in this workshop renders Chinese characters correctly — no repeated setup needed.
!pip install pandas matplotlib wordcloud jieba snownlp spacy networkx gensim nltk pypinyin
!python -m spacy download zh_core_web_sm
# ── Global Chinese Font Setup (run once, applies to ALL charts) ─────────────────
# Install Noto CJK fonts package (provides both .ttc and .otf files)
!apt-get install -y fonts-noto-cjk > /dev/null 2>&1
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import os
# Use the .otf file directly — more reliable than .ttc in matplotlib
# This path is stable across Google Colab environments
_OTF_PATH = '/usr/share/fonts/opentype/noto/NotoSansCJKsc-Regular.otf'
if os.path.exists(_OTF_PATH):
fm.fontManager.addfont(_OTF_PATH)
CJK_PROP = fm.FontProperties(fname=_OTF_PATH)
plt.rcParams['font.family'] = CJK_PROP.get_name()
print(f"✓ Chinese font loaded: Noto Sans CJK SC (Regular)")
else:
# Fallback: search for any available CJK otf/ttf
fm._load_fontmanager(try_read_cache=False)
_candidates = fm.findSystemFonts(fontpaths=['/usr/share/fonts'], fontext='otf')
_path = next((f for f in _candidates if 'CJKsc' in f and 'Regular' in f), None)
if _path:
fm.fontManager.addfont(_path)
CJK_PROP = fm.FontProperties(fname=_path)
plt.rcParams['font.family'] = CJK_PROP.get_name()
print(f"✓ Chinese font loaded: {_path.split('/')[-1]}")
else:
plt.rcParams['font.sans-serif'] = ['Noto Sans CJK SC', 'WenQuanYi Zen Hei']
plt.rcParams['axes.unicode_minus'] = False
CJK_PROP = None
print("⚠ Using fallback font settings")
print("Setup complete. CJK_PROP is available for all charts.")Load & Clean Data
Upload dataset_1.csv to your Colab session, then run this code to clean it.
| Column | Description | Example |
|---|---|---|
text | Post content (Chinese) | 黑神话悟空是件美好事物... |
created_at | Post timestamp | 2024-06-01 22:37 |
screen_name | Author username | 不学了当主唱 |
reposts_count | Number of reposts | 12 |
comments_count | Number of comments | 5 |
attitudes_count | Number of likes | 25 |
import pandas as pd
# Load data
df = pd.read_csv('dataset_1.csv')
print(f"Original shape: {df.shape}")
# Drop rows with no text content
df = df.dropna(subset=['text'])
# Remove duplicate posts
df = df.drop_duplicates(subset=['text'])
# Convert timestamp to datetime
df['created_at'] = pd.to_datetime(df['created_at'])
# Strip hidden whitespace from keyword column
df['keyword'] = df['keyword'].str.strip()
# Reset index after filtering
df = df.reset_index(drop=True)
print(f"Cleaned shape: {df.shape}")
df.head(3)NLP Preprocessing
Tokenise text, remove stopwords, and normalise word forms.
Tokenisation & Stopword Removal
Split Chinese text into meaningful words using jieba, then filter out common words with no analytical value. Stopwords are loaded from all .txt files inside a folder — making it easy to manage and extend your stopword lists.
| Concept | Description | Tool |
|---|---|---|
| Tokenisation | Split text into individual words/tokens | jieba (Chinese) / nltk (English) |
| Stopwords | Common words with no analytical value (的, 了, 是) | All .txt files in a folder, one word per line |
| Result | A list of meaningful words per post | Used for word cloud, keyword stats, topic model |
stopwords/ containing one or more .txt files — Chinese and English stopwords can coexist in the same folder or even the same file, one word per line. Example structure:stopwords/chinese.txt → 的 了 是 在 …stopwords/english.txt → the a an is are …stopwords/social_media.txt → rt via lol 转发 …All files are merged into one unified set. English words are matched case-insensitively. Upload the folder to Google Colab, then set
STOPWORDS_FOLDER to its path.import jieba
import re
import os
# ── Step 1: Load stopwords from a folder of .txt files ──────────────────
# The folder may contain BOTH Chinese and English stopword files.
# Each .txt file should have one stopword per line (any language).
# All files are merged into a single unified stopword set.
def load_stopwords_from_folder(folder_path):
"""Read all .txt files in a folder and merge into one stopword set.
Supports Chinese, English, and mixed-language stopword files.
"""
stopwords = set()
txt_files = [f for f in os.listdir(folder_path) if f.endswith('.txt')]
print(f"Found {len(txt_files)} stopword file(s): {txt_files}")
for filename in txt_files:
filepath = os.path.join(folder_path, filename)
with open(filepath, 'r', encoding='utf-8') as f:
words = {line.strip().lower() for line in f if line.strip()}
stopwords.update(words)
print(f"Total stopwords loaded: {len(stopwords)}")
return stopwords
# Set your stopwords folder path here
STOPWORDS_FOLDER = './stopwords' # upload a 'stopwords/' folder to Colab
stopwords = load_stopwords_from_folder(STOPWORDS_FOLDER)
# ── Step 2: Tokenise and filter (Chinese + English) ──────────────────────
def clean_and_tokenize(text):
# Remove URLs and @mentions
text = re.sub(r'http\S+', '', str(text))
text = re.sub(r'@\S+', '', text)
text = re.sub(r'[\s]+', ' ', text).strip()
# Tokenize with jieba (handles Chinese; also splits English words)
words = jieba.lcut(text)
filtered = []
for w in words:
w_clean = w.strip()
if len(w_clean) <= 1:
continue
# Normalise to lowercase for English stopword matching
if w_clean.lower() in stopwords:
continue
filtered.append(w_clean)
return filtered
# Apply to dataset
df['tokens'] = df['text'].apply(clean_and_tokenize)
print(df[['text', 'tokens']].head())Stemming & Lemmatisation (English)
For English text, reduce words to their base form to unify variants before analysis.
| Technique | Input | Output | Tool |
|---|---|---|---|
| Stemming | running, runs, ran | run (crude cut) | PorterStemmer |
| Lemmatisation | better, best | good (context-aware) | WordNetLemmatizer |
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
nltk.download('wordnet')
nltk.download('punkt')
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
# --- Lemmatisation (context-aware, recommended) ---
def lemmatize_english(text):
tokens = text.lower().split()
return [lemmatizer.lemmatize(w) for w in tokens]
# --- Stemming (crude but fast) ---
def stem_english(text):
tokens = text.lower().split()
return [stemmer.stem(w) for w in tokens]
# Example
sample = "The games were being played by running players"
print("Lemmatised:", lemmatize_english(sample))
print("Stemmed: ", stem_english(sample))Text Visualisation
Visualise patterns in the data: time trends, word frequency, and word clouds.
Time Trend Analysis
Plot how post volume changes over time to identify key events or discussion peaks.
import matplotlib.pyplot as plt
# Note: Run the font setup cell in Keyword Frequency first
# if your axis labels contain Chinese characters.
# Extract month from timestamp
df['month'] = df['created_at'].dt.to_period('M').dt.to_timestamp()
monthly_counts = df.groupby('month').size().reset_index(name='count')
# Plot line chart
plt.figure(figsize=(10, 4))
plt.plot(
monthly_counts['month'],
monthly_counts['count'],
marker='o',
color='steelblue',
linewidth=2,
markersize=6
)
plt.title('Monthly Post Volume: Black Myth Wukong on Weibo', fontsize=14)
plt.xlabel('Month')
plt.ylabel('Number of Posts')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()Word Cloud
Visualise the most frequent words — the larger the word, the more frequently it appears.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Download Chinese font for Google Colab
!wget -q -O /tmp/chinese_font.ttf \
"https://github.com/adobe-fonts/source-han-sans/raw/release/OTF/SimplifiedChinese/SourceHanSansSC-Regular.otf"
# Combine all tokens into a single string
all_tokens = [w for tokens in df['tokens'] for w in tokens]
all_words_str = ' '.join(all_tokens)
# Generate word cloud
wc = WordCloud(
font_path='/tmp/chinese_font.ttf',
width=800,
height=400,
background_color='white',
max_words=100,
colormap='Blues'
).generate(all_words_str)
plt.figure(figsize=(12, 6))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title('Most Frequent Words in Weibo Posts', fontsize=14)
plt.tight_layout()
plt.show()Keyword Frequency Statistics
Count and rank the most common words in the dataset to identify key themes. The code below installs a CJK font automatically so Chinese characters display correctly in the chart.
from collections import Counter
import matplotlib.pyplot as plt
from pypinyin import lazy_pinyin, Style
def to_pinyin(word):
"""Convert Chinese to pinyin; keep English/numbers as-is."""
if any('一' <= c <= '鿿' for c in word):
return ' '.join(lazy_pinyin(word, style=Style.NORMAL))
return word
# ── Count and plot top keywords ──────────────────────────────────────────────────
all_tokens = [w for tokens in df['tokens'] for w in tokens]
word_freq = Counter(all_tokens)
# Get top 20 words, convert labels to pinyin for display
top20 = word_freq.most_common(20)
words_orig, counts = zip(*top20)
words_pinyin = [to_pinyin(w) for w in words_orig]
# Plot horizontal bar chart
fig, ax = plt.subplots(figsize=(10, 7))
ax.barh(list(reversed(words_pinyin)), list(reversed(counts)), color='steelblue')
ax.set_title('Top 20 Keywords in Black Myth Wukong Posts', fontsize=14, pad=12)
ax.set_xlabel('Frequency')
plt.tight_layout()
plt.show()
print("\nTop 10 keywords (original Chinese | pinyin):")
for (word, count), pinyin in zip(top20[:10], words_pinyin[:10]):
print(f" {word} ({pinyin}): {count}")Advanced Text Analysis
Apply NLP techniques to extract deeper insights from the text.
Sentiment Analysis
Automatically classify each post as positive, negative, or neutral using SnowNLP.
| Tool | Language | Method | Score Range |
|---|---|---|---|
SnowNLP | Chinese | Naïve Bayes | 0 (neg) → 1 (pos) |
VADER | English | Rule-based lexicon | -1 → +1 |
TextBlob | English | Pattern-based | -1 → +1 |
from snownlp import SnowNLP
import matplotlib.pyplot as plt
def get_sentiment(text):
try:
s = SnowNLP(str(text))
score = s.sentiments # 0 (negative) → 1 (positive)
if score > 0.6:
return 'positive'
elif score < 0.4:
return 'negative'
else:
return 'neutral'
except:
return 'neutral'
# Apply to a sample of 500 rows (full dataset is slow)
sample_df = df.head(500).copy()
sample_df['sentiment'] = sample_df['text'].apply(get_sentiment)
# Print distribution
print(sample_df['sentiment'].value_counts())
# Visualise
colors = {'positive': '#27ae60', 'neutral': '#95a5a6', 'negative': '#e74c3c'}
counts = sample_df['sentiment'].value_counts()
plt.figure(figsize=(6, 4))
plt.bar(counts.index, counts.values,
color=[colors.get(s, 'gray') for s in counts.index])
plt.title('Sentiment Distribution (Sample of 500 Posts)')
plt.ylabel('Count')
plt.tight_layout()
plt.show()POS Tagging & Syntactic Analysis
Label each word with its grammatical role (noun, verb, adjective) to analyse how people describe the game.
| POS Tag | Meaning | Example |
|---|---|---|
n | Noun | 游戏, 悟空, 玩家 |
v | Verb | 发布, 购买, 期待 |
a | Adjective | 精彩, 失望, 好看 |
d | Adverb | 非常, 真的, 已经 |
import jieba.posseg as pseg
from collections import Counter
import matplotlib.pyplot as plt
def extract_by_pos(text, target_flag='a'):
"""Extract words by POS tag: n=noun, v=verb, a=adjective, d=adverb"""
words = pseg.cut(str(text))
return [w.word for w in words if w.flag == target_flag]
# Extract adjectives (how people describe the game)
df['adjectives'] = df['text'].apply(lambda x: extract_by_pos(x, 'a'))
# Count most common adjectives
all_adj = [w for adj_list in df['adjectives'] for w in adj_list]
adj_freq = Counter(all_adj)
print("Top 10 adjectives used to describe the game:")
for word, count in adj_freq.most_common(10):
print(f" {word}: {count}")
# You can also extract nouns or verbs:
# df['nouns'] = df['text'].apply(lambda x: extract_by_pos(x, 'n'))Named Entity Recognition (NER)
Automatically identify and classify named entities — people, organisations, locations — in the text.
| Entity Type | Description | Example (from dataset) |
|---|---|---|
PERSON | Person names | 悟空, 孙悟空 |
ORG | Organisations | 游戏科学, WeGame |
GPE | Geopolitical entities | 中国, 香港, 北京 |
PRODUCT | Product names | 黑神话悟空, PS5 |
import spacy
from collections import Counter
# Load Chinese NLP model
nlp = spacy.load('zh_core_web_sm')
def extract_entities(text):
doc = nlp(str(text)[:500]) # Limit length for speed
return [(ent.text, ent.label_) for ent in doc.ents]
# Apply to a sample
sample_df = df.head(200).copy()
sample_df['entities'] = sample_df['text'].apply(extract_entities)
# Flatten and count by entity type
all_entities = [ent for ents in sample_df['entities'] for ent in ents]
# Top organisations
org_list = [e[0] for e in all_entities if e[1] == 'ORG']
print("Top organisations:", Counter(org_list).most_common(10))
# Top persons
person_list = [e[0] for e in all_entities if e[1] == 'PERSON']
print("Top persons:", Counter(person_list).most_common(10))
# Top locations
gpe_list = [e[0] for e in all_entities if e[1] == 'GPE']
print("Top locations:", Counter(gpe_list).most_common(10))Topic Modelling with LDA
Discover hidden themes across the dataset without any manual labelling using Latent Dirichlet Allocation.
from gensim import corpora, models
# Step 1: Build dictionary from all tokenised posts
dictionary = corpora.Dictionary(df['tokens'].tolist())
# Remove very rare and very common words
dictionary.filter_extremes(no_below=5, no_above=0.5)
# Step 2: Convert to bag-of-words corpus
corpus = [dictionary.doc2bow(tokens) for tokens in df['tokens']]
# Step 3: Train LDA model (5 topics, 5 passes for speed)
lda_model = models.LdaModel(
corpus,
num_topics=5,
id2word=dictionary,
passes=5,
random_state=42
)
# Step 4: Print discovered topics
print("=== Discovered Topics ===")
for idx, topic in lda_model.print_topics(num_words=8):
print(f"\nTopic {idx + 1}:")
print(f" {topic}")Semantic Network (Co-occurrence Analysis)
Visualise which words appear together to reveal conceptual associations in public discourse.
import networkx as nx
import itertools
from collections import Counter
import matplotlib.pyplot as plt
from pypinyin import lazy_pinyin, Style
def to_pinyin(word):
"""Convert Chinese to pinyin; keep English/numbers as-is."""
if any('一' <= c <= '鿿' for c in word):
return ' '.join(lazy_pinyin(word, style=Style.NORMAL))
return word
# Step 1: Build co-occurrence pairs from tokenised text
co_occur = Counter()
for tokens in df['tokens']:
tokens = tokens[:20] # Limit per post to reduce noise
for pair in itertools.combinations(set(tokens), 2):
co_occur[tuple(sorted(pair))] += 1
# Step 2: Keep top 50 most frequent pairs
top_pairs = co_occur.most_common(50)
# Step 3: Build network graph with pinyin labels
G = nx.Graph()
for (w1, w2), weight in top_pairs:
# Convert node labels to pinyin so matplotlib renders them correctly
G.add_edge(to_pinyin(w1), to_pinyin(w2), weight=weight)
# Step 4: Visualise
fig, ax = plt.subplots(figsize=(16, 12))
pos = nx.spring_layout(G, k=1.2, seed=42)
# Node size: base size + degree scaling so all nodes are large enough for labels
degree = dict(G.degree())
min_size = 2500 # minimum node size to fit label text
node_sizes = [min_size + degree[n] * 400 for n in G.nodes()]
# Draw edges
nx.draw_networkx_edges(G, pos, edge_color='#aaaaaa', width=0.8, ax=ax, alpha=0.6)
# Draw nodes
nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color='steelblue', ax=ax)
# Draw labels with font size scaled to fit inside nodes
nx.draw_networkx_labels(
G, pos, ax=ax,
font_color='white',
font_size=8,
font_weight='bold'
)
ax.set_title('Semantic Network: Black Myth Wukong Weibo Posts', fontsize=14, pad=15)
ax.axis('off')
plt.tight_layout()
plt.show()Extended Text Analysis
Deeper linguistic and statistical techniques to uncover patterns in the corpus.
N-gram Analysis
Identify frequently co-occurring word sequences (bigrams, trigrams) to capture multi-word phrases that single-word frequency misses — e.g. 'shen hua + wu kong' is more informative than either word alone.
| N-gram | Example | Insight |
|---|---|---|
| Unigram | 神话, 悟空, 游戏 | Individual word frequency |
| Bigram | 神话 + 悟空, 国产 + 游戏 | Common 2-word phrases |
| Trigram | 黑神话 + 悟空 + 游戏 | 3-word topic phrases |
from collections import Counter
from pypinyin import lazy_pinyin, Style
import matplotlib.pyplot as plt
def to_pinyin(w):
if any('一' <= c <= '鿿' for c in w):
return ' '.join(lazy_pinyin(w, style=Style.NORMAL))
return w
# ── Build bigrams (2-word sequences) ──────────────────────────────────────
bigrams = Counter()
for tokens in df['tokens']:
for a, b in zip(tokens, tokens[1:]):
bigrams[(a, b)] += 1
# ── Build trigrams (3-word sequences) ─────────────────────────────────────
trigrams = Counter()
for tokens in df['tokens']:
for a, b, c in zip(tokens, tokens[1:], tokens[2:]):
trigrams[(a, b, c)] += 1
# ── Print top results ─────────────────────────────────────────────────────
print("Top 10 Bigrams:")
for (w1, w2), cnt in bigrams.most_common(10):
print(f" {to_pinyin(w1)} + {to_pinyin(w2)}: {cnt}")
print("
Top 10 Trigrams:")
for (w1, w2, w3), cnt in trigrams.most_common(10):
print(f" {to_pinyin(w1)} + {to_pinyin(w2)} + {to_pinyin(w3)}: {cnt}")
# ── Visualise top 15 bigrams ───────────────────────────────────────────────
top15 = bigrams.most_common(15)
labels = [f"{to_pinyin(a)} + {to_pinyin(b)}" for (a, b), _ in top15]
counts = [c for _, c in top15]
fig, ax = plt.subplots(figsize=(10, 7))
ax.barh(labels[::-1], counts[::-1], color='steelblue')
ax.set_xlabel('Frequency')
ax.set_title('Top 15 Bigrams in Black Myth Wukong Posts')
plt.tight_layout()
plt.show()POS Tag Distribution Chart
Visualise the grammatical composition of the corpus. The ratio of nouns to adjectives to verbs reveals how audiences frame their discussion — factual reporting vs. emotional evaluation.
| Tag | Label | Communication Significance |
|---|---|---|
n | Noun | Topics and entities being discussed |
v | Verb | Actions and behaviours described |
a | Adjective | Evaluative language and sentiment |
d | Adverb | Intensity of expression |
import jieba.posseg as pseg
from collections import Counter
import matplotlib.pyplot as plt
# POS tag full names for readability
POS_LABELS = {
'n': 'Noun', 'v': 'Verb', 'a': 'Adjective', 'd': 'Adverb',
'r': 'Pronoun', 'p': 'Preposition', 'c': 'Conjunction',
'm': 'Numeral', 'q': 'Classifier', 'x': 'Other/Symbol',
}
# Count POS tags across all posts
pos_counts = Counter()
for text in df['text']:
for word, flag in pseg.cut(str(text)):
tag = flag[:1] # Use first character of tag
if tag in POS_LABELS:
pos_counts[tag] += 1
# Sort by frequency
tags = sorted(pos_counts, key=pos_counts.get, reverse=True)
counts = [pos_counts[t] for t in tags]
labels = [f"{POS_LABELS.get(t, t)} ({t})" for t in tags]
# Plot
fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#2196F3','#4CAF50','#FF9800','#E91E63','#9C27B0','#00BCD4','#FF5722','#607D8B','#795548','#FFC107']
ax.bar(labels, counts, color=colors[:len(labels)])
ax.set_ylabel('Count')
ax.set_title('POS Tag Distribution in Black Myth Wukong Posts')
plt.xticks(rotation=30, ha='right')
plt.tight_layout()
plt.show()
print("
POS Distribution:")
for tag, cnt in zip(tags, counts):
print(f" {POS_LABELS.get(tag, tag):15s}: {cnt:,}")TF-IDF Keyword Extraction
TF-IDF (Term Frequency–Inverse Document Frequency) identifies words that are important to individual posts but rare across the whole corpus — revealing distinctive, document-specific keywords beyond simple frequency counts.
| Method | What it measures | Best for |
|---|---|---|
| Word Frequency | How often a word appears overall | Corpus-level themes |
| TF-IDF | How distinctive a word is to a specific post | Document-level keywords, content differentiation |
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import matplotlib.pyplot as plt
from pypinyin import lazy_pinyin, Style
def to_pinyin(w):
if any('一' <= c <= '鿿' for c in w):
return ' '.join(lazy_pinyin(w, style=Style.NORMAL))
return w
# Convert token lists to space-joined strings for TF-IDF
corpus = [' '.join(tokens) for tokens in df['tokens']]
# Fit TF-IDF vectorizer
tfidf = TfidfVectorizer(max_features=500, min_df=5)
X = tfidf.fit_transform(corpus)
# Compute mean TF-IDF score per term across all documents
mean_scores = np.asarray(X.mean(axis=0)).flatten()
terms = tfidf.get_feature_names_out()
# Top 20 terms by mean TF-IDF
top_idx = mean_scores.argsort()[::-1][:20]
top_terms = [(terms[i], mean_scores[i]) for i in top_idx]
print("Top 20 TF-IDF Keywords:")
for term, score in top_terms:
print(f" {to_pinyin(term):25s} score={score:.4f}")
# Visualise
labels = [to_pinyin(t) for t, _ in top_terms]
scores = [s for _, s in top_terms]
fig, ax = plt.subplots(figsize=(10, 8))
ax.barh(labels[::-1], scores[::-1], color='teal')
ax.set_xlabel('Mean TF-IDF Score')
ax.set_title('Top 20 Keywords by TF-IDF Score')
plt.tight_layout()
plt.show()TF-IDF Document Clustering
Use TF-IDF vectors with K-Means to automatically group posts into thematic clusters, then visualise the clusters in 2D using PCA. Each cluster represents a distinct discussion thread.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
from pypinyin import lazy_pinyin, Style
def to_pinyin(w):
if any('一' <= c <= '鿿' for c in w):
return ' '.join(lazy_pinyin(w, style=Style.NORMAL))
return w
# Build TF-IDF matrix
corpus = [' '.join(tokens) for tokens in df['tokens']]
tfidf = TfidfVectorizer(max_features=300, min_df=5)
X = tfidf.fit_transform(corpus)
terms = tfidf.get_feature_names_out()
# K-Means clustering (5 clusters)
N_CLUSTERS = 5
km = KMeans(n_clusters=N_CLUSTERS, random_state=42, n_init=10)
km.fit(X)
df['cluster'] = km.labels_
# Print top keywords per cluster
print("Cluster Keywords (TF-IDF centroids):")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(N_CLUSTERS):
top_words = [to_pinyin(terms[j]) for j in order_centroids[i, :8]]
size = (df['cluster'] == i).sum()
print(f" Cluster {i+1} ({size} posts): {', '.join(top_words)}")
# Visualise clusters with PCA (2D)
pca = PCA(n_components=2, random_state=42)
coords = pca.fit_transform(X.toarray())
fig, ax = plt.subplots(figsize=(10, 8))
colors = ['#E91E63','#2196F3','#4CAF50','#FF9800','#9C27B0']
for i in range(N_CLUSTERS):
mask = km.labels_ == i
ax.scatter(coords[mask, 0], coords[mask, 1],
c=colors[i], label=f'Cluster {i+1}', alpha=0.5, s=15)
ax.set_title('TF-IDF Document Clusters (PCA 2D Projection)')
ax.legend()
plt.tight_layout()
plt.show()Sentiment × Time Cross-Analysis
Track how the ratio of positive, neutral, and negative posts shifts month by month. Sudden changes often correspond to real-world events — a game announcement, a controversy, or a release date.
from snownlp import SnowNLP
import matplotlib.pyplot as plt
import pandas as pd
# Compute sentiment score for each post
print("Computing sentiment scores (may take ~1 min for full dataset)...")
df['sentiment'] = df['text'].apply(lambda t: SnowNLP(str(t)).sentiments)
df['sent_label'] = df['sentiment'].apply(
lambda s: 'Positive' if s > 0.6 else ('Negative' if s < 0.4 else 'Neutral')
)
# Group by month and sentiment label
df['month'] = df['created_at'].dt.to_period('M').dt.to_timestamp()
monthly = df.groupby(['month', 'sent_label']).size().unstack(fill_value=0)
# Normalise to proportions (%)
monthly_pct = monthly.div(monthly.sum(axis=1), axis=0) * 100
# Plot stacked area chart
fig, ax = plt.subplots(figsize=(12, 6))
colors = {'Positive': '#4CAF50', 'Neutral': '#FFC107', 'Negative': '#F44336'}
for label in ['Positive', 'Neutral', 'Negative']:
if label in monthly_pct.columns:
ax.plot(monthly_pct.index, monthly_pct[label],
marker='o', label=label, color=colors[label], linewidth=2)
ax.set_xlabel('Month')
ax.set_ylabel('Proportion (%)')
ax.set_title('Sentiment Trend Over Time — Black Myth Wukong Weibo Posts')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()
print("
Monthly sentiment breakdown (% of posts):")
print(monthly_pct.round(1).to_string())Engagement Correlation Analysis
Quantify the relationship between sentiment score and engagement metrics (likes, reposts, comments). Does more emotional content drive more shares? The data may surprise you.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
# Ensure sentiment scores are computed (run sentiment_over_time.py first)
# If not yet computed, uncomment the next two lines:
# from snownlp import SnowNLP
# df['sentiment'] = df['text'].apply(lambda t: SnowNLP(str(t)).sentiments)
# Select engagement columns
eng_cols = ['sentiment', 'reposts_count', 'comments_count', 'attitudes_count']
eng_df = df[eng_cols].dropna()
# Correlation matrix
corr = eng_df.corr()
print("Correlation matrix:")
print(corr.round(3))
# Heatmap
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: correlation heatmap
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm',
ax=axes[0], square=True, linewidths=0.5)
axes[0].set_title('Correlation: Sentiment vs Engagement')
# Right: scatter — sentiment vs likes
axes[1].scatter(eng_df['sentiment'], eng_df['attitudes_count'],
alpha=0.3, s=10, color='steelblue')
axes[1].set_xlabel('Sentiment Score (0=neg, 1=pos)')
axes[1].set_ylabel('Likes (attitudes_count)')
axes[1].set_title('Sentiment vs Likes')
axes[1].set_yscale('log') # Log scale for skewed engagement data
plt.tight_layout()
plt.show()Keyword Co-occurrence Heatmap
A matrix heatmap showing how often the top N keywords appear together in the same post. Darker cells indicate stronger co-occurrence — an alternative view of the semantic network.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from pypinyin import lazy_pinyin, Style
def to_pinyin(w):
if any('一' <= c <= '鿿' for c in w):
return ' '.join(lazy_pinyin(w, style=Style.NORMAL))
return w
# Select top N keywords
N = 15
all_tokens = [w for tokens in df['tokens'] for w in tokens]
top_words = [w for w, _ in Counter(all_tokens).most_common(N)]
# Build co-occurrence matrix
mat = np.zeros((N, N), dtype=int)
for tokens in df['tokens']:
token_set = set(tokens)
for i, a in enumerate(top_words):
for j, b in enumerate(top_words):
if i != j and a in token_set and b in token_set:
mat[i][j] += 1
# Convert labels to pinyin
labels = [to_pinyin(w) for w in top_words]
# Plot heatmap
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(
mat, xticklabels=labels, yticklabels=labels,
cmap='Blues', annot=True, fmt='d', linewidths=0.3,
ax=ax
)
ax.set_title(f'Top {N} Keyword Co-occurrence Heatmap')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()User Behaviour Analysis
Understand when users post and who drives the most engagement in the community.
Posting Time Distribution
Analyse when users are most active — by hour of day and day of week. This reveals audience habits and optimal publishing windows, a key insight for media strategy.
import matplotlib.pyplot as plt
import numpy as np
# Extract hour and weekday
df['hour'] = df['created_at'].dt.hour
df['weekday'] = df['created_at'].dt.day_name()
WEEKDAY_ORDER = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# ── Left: Hourly distribution ─────────────────────────────────────────────
hour_counts = df['hour'].value_counts().sort_index()
axes[0].bar(hour_counts.index, hour_counts.values, color='steelblue', width=0.8)
axes[0].set_xlabel('Hour of Day (0–23)')
axes[0].set_ylabel('Number of Posts')
axes[0].set_title('Posting Activity by Hour')
axes[0].set_xticks(range(0, 24, 2))
axes[0].grid(axis='y', alpha=0.3)
# Annotate peak hour
peak_hour = hour_counts.idxmax()
axes[0].axvline(peak_hour, color='red', linestyle='--', alpha=0.7,
label=f'Peak: {peak_hour}:00')
axes[0].legend()
# ── Right: Weekday distribution ───────────────────────────────────────────
weekday_counts = df['weekday'].value_counts().reindex(WEEKDAY_ORDER, fill_value=0)
colors = ['#4CAF50' if d in ['Saturday','Sunday'] else '#2196F3' for d in WEEKDAY_ORDER]
axes[1].bar(WEEKDAY_ORDER, weekday_counts.values, color=colors)
axes[1].set_xlabel('Day of Week')
axes[1].set_ylabel('Number of Posts')
axes[1].set_title('Posting Activity by Weekday')
plt.setp(axes[1].get_xticklabels(), rotation=30, ha='right')
axes[1].grid(axis='y', alpha=0.3)
plt.suptitle('User Posting Behaviour — Black Myth Wukong Weibo Posts', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()
print(f"Peak posting hour: {peak_hour}:00")
print(f"Most active day: {weekday_counts.idxmax()}")Influencer Identification
Rank users by total engagement (likes + reposts + comments) to identify key opinion leaders (KOLs). Also compute average engagement per post to distinguish prolific posters from genuinely influential ones.
| Metric | Formula | What it reveals |
|---|---|---|
| Total Engagement | Likes + Reposts + Comments | Overall reach and impact |
| Avg Engagement / Post | Total ÷ Post Count | Content quality vs. volume |
| Repost Rate | Reposts ÷ Total Engagement | Content shareability |
import matplotlib.pyplot as plt
import pandas as pd
# Compute total engagement per user
df['engagement'] = df['reposts_count'] + df['comments_count'] + df['attitudes_count']
# Aggregate by user
user_stats = df.groupby('screen_name').agg(
total_posts=('text', 'count'),
total_engagement=('engagement', 'sum'),
avg_engagement=('engagement', 'mean'),
total_likes=('attitudes_count', 'sum'),
total_reposts=('reposts_count', 'sum'),
total_comments=('comments_count', 'sum'),
).sort_values('total_engagement', ascending=False)
print("Top 10 Most Influential Users:")
print(user_stats.head(10).to_string())
# ── Visualise top 15 users by total engagement ────────────────────────────
top15 = user_stats.head(15)
fig, ax = plt.subplots(figsize=(12, 8))
bars = ax.barh(top15.index[::-1], top15['total_engagement'][::-1], color='steelblue')
# Colour-code by post count
max_posts = top15['total_posts'].max()
for bar, (_, row) in zip(bars, top15[::-1].iterrows()):
intensity = row['total_posts'] / max_posts
bar.set_alpha(0.4 + 0.6 * intensity)
ax.set_xlabel('Total Engagement (Likes + Reposts + Comments)')
ax.set_title('Top 15 Users by Total Engagement')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()
# ── Engagement rate (avg per post) ────────────────────────────────────────
print("
Top 10 by Average Engagement per Post (min 3 posts):")
high_avg = user_stats[user_stats['total_posts'] >= 3].sort_values('avg_engagement', ascending=False)
print(high_avg[['total_posts','avg_engagement','total_likes','total_reposts']].head(10).to_string())