Hong Kong Baptist University · School of Communication

Workshop 4
Online Data Processing
& Text Analysis

Learn how to transform raw scraped data into actionable insights using Python — from data cleaning and NLP preprocessing to advanced text analysis techniques.

Dataset
Weibo Posts
黑神话悟空 (Black Myth Wukong)
Records
9,414
Jan – Jun 2024
Language
Python 3
Google Colab compatible
PART 01

Setup & Data Cleaning

Install dependencies, load the dataset, and prepare it for analysis.

Install Required Libraries & Font Setup

Run this cell first in Google Colab. It installs all packages AND sets up a global Chinese font (Noto Sans CJK) so every chart in this workshop renders Chinese characters correctly — no repeated setup needed.

install.sh
!pip install pandas matplotlib wordcloud jieba snownlp spacy networkx gensim nltk pypinyin
!python -m spacy download zh_core_web_sm

# ── Global Chinese Font Setup (run once, applies to ALL charts) ─────────────────
# Install Noto CJK fonts package (provides both .ttc and .otf files)
!apt-get install -y fonts-noto-cjk > /dev/null 2>&1

import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
import os

# Use the .otf file directly — more reliable than .ttc in matplotlib
# This path is stable across Google Colab environments
_OTF_PATH = '/usr/share/fonts/opentype/noto/NotoSansCJKsc-Regular.otf'

if os.path.exists(_OTF_PATH):
    fm.fontManager.addfont(_OTF_PATH)
    CJK_PROP = fm.FontProperties(fname=_OTF_PATH)
    plt.rcParams['font.family'] = CJK_PROP.get_name()
    print(f"✓ Chinese font loaded: Noto Sans CJK SC (Regular)")
else:
    # Fallback: search for any available CJK otf/ttf
    fm._load_fontmanager(try_read_cache=False)
    _candidates = fm.findSystemFonts(fontpaths=['/usr/share/fonts'], fontext='otf')
    _path = next((f for f in _candidates if 'CJKsc' in f and 'Regular' in f), None)
    if _path:
        fm.fontManager.addfont(_path)
        CJK_PROP = fm.FontProperties(fname=_path)
        plt.rcParams['font.family'] = CJK_PROP.get_name()
        print(f"✓ Chinese font loaded: {_path.split('/')[-1]}")
    else:
        plt.rcParams['font.sans-serif'] = ['Noto Sans CJK SC', 'WenQuanYi Zen Hei']
        plt.rcParams['axes.unicode_minus'] = False
        CJK_PROP = None
        print("⚠ Using fallback font settings")

print("Setup complete. CJK_PROP is available for all charts.")
Packages
pandas — data manipulation · matplotlib — visualisation · wordcloud — word cloud · jieba — Chinese tokenisation · snownlp — Chinese sentiment · spacy — NER · gensim — topic modelling · networkx — semantic network

Load & Clean Data

Upload dataset_1.csv to your Colab session, then run this code to clean it.

ColumnDescriptionExample
textPost content (Chinese)黑神话悟空是件美好事物...
created_atPost timestamp2024-06-01 22:37
screen_nameAuthor username不学了当主唱
reposts_countNumber of reposts12
comments_countNumber of comments5
attitudes_countNumber of likes25
load_clean.py
import pandas as pd

# Load data
df = pd.read_csv('dataset_1.csv')
print(f"Original shape: {df.shape}")

# Drop rows with no text content
df = df.dropna(subset=['text'])

# Remove duplicate posts
df = df.drop_duplicates(subset=['text'])

# Convert timestamp to datetime
df['created_at'] = pd.to_datetime(df['created_at'])

# Strip hidden whitespace from keyword column
df['keyword'] = df['keyword'].str.strip()

# Reset index after filtering
df = df.reset_index(drop=True)

print(f"Cleaned shape: {df.shape}")
df.head(3)
Expected result
After cleaning: 5,627 rows remain (from 9,414). Missing text rows are retweets without original content.
PART 02

NLP Preprocessing

Tokenise text, remove stopwords, and normalise word forms.

Tokenisation & Stopword Removal

Split Chinese text into meaningful words using jieba, then filter out common words with no analytical value. Stopwords are loaded from all .txt files inside a folder — making it easy to manage and extend your stopword lists.

ConceptDescriptionTool
TokenisationSplit text into individual words/tokensjieba (Chinese) / nltk (English)
StopwordsCommon words with no analytical value (的, 了, 是)All .txt files in a folder, one word per line
ResultA list of meaningful words per postUsed for word cloud, keyword stats, topic model
Stopwords Folder Structure (Chinese + English)
Prepare a folder named stopwords/ containing one or more .txt files — Chinese and English stopwords can coexist in the same folder or even the same file, one word per line. Example structure:

stopwords/chinese.txt → 的 了 是 在 …
stopwords/english.txt → the a an is are …
stopwords/social_media.txt → rt via lol 转发 …

All files are merged into one unified set. English words are matched case-insensitively. Upload the folder to Google Colab, then set STOPWORDS_FOLDER to its path.
tokenize.py
import jieba
import re
import os

# ── Step 1: Load stopwords from a folder of .txt files ──────────────────
# The folder may contain BOTH Chinese and English stopword files.
# Each .txt file should have one stopword per line (any language).
# All files are merged into a single unified stopword set.

def load_stopwords_from_folder(folder_path):
    """Read all .txt files in a folder and merge into one stopword set.
    Supports Chinese, English, and mixed-language stopword files.
    """
    stopwords = set()
    txt_files = [f for f in os.listdir(folder_path) if f.endswith('.txt')]
    print(f"Found {len(txt_files)} stopword file(s): {txt_files}")
    for filename in txt_files:
        filepath = os.path.join(folder_path, filename)
        with open(filepath, 'r', encoding='utf-8') as f:
            words = {line.strip().lower() for line in f if line.strip()}
            stopwords.update(words)
    print(f"Total stopwords loaded: {len(stopwords)}")
    return stopwords

# Set your stopwords folder path here
STOPWORDS_FOLDER = './stopwords'  # upload a 'stopwords/' folder to Colab
stopwords = load_stopwords_from_folder(STOPWORDS_FOLDER)

# ── Step 2: Tokenise and filter (Chinese + English) ──────────────────────

def clean_and_tokenize(text):
    # Remove URLs and @mentions
    text = re.sub(r'http\S+', '', str(text))
    text = re.sub(r'@\S+', '', text)
    text = re.sub(r'[\s]+', ' ', text).strip()

    # Tokenize with jieba (handles Chinese; also splits English words)
    words = jieba.lcut(text)

    filtered = []
    for w in words:
        w_clean = w.strip()
        if len(w_clean) <= 1:
            continue
        # Normalise to lowercase for English stopword matching
        if w_clean.lower() in stopwords:
            continue
        filtered.append(w_clean)
    return filtered

# Apply to dataset
df['tokens'] = df['text'].apply(clean_and_tokenize)
print(df[['text', 'tokens']].head())

Stemming & Lemmatisation (English)

For English text, reduce words to their base form to unify variants before analysis.

TechniqueInputOutputTool
Stemmingrunning, runs, ranrun (crude cut)PorterStemmer
Lemmatisationbetter, bestgood (context-aware)WordNetLemmatizer
lemmatize.py
import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
nltk.download('wordnet')
nltk.download('punkt')

lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

# --- Lemmatisation (context-aware, recommended) ---
def lemmatize_english(text):
    tokens = text.lower().split()
    return [lemmatizer.lemmatize(w) for w in tokens]

# --- Stemming (crude but fast) ---
def stem_english(text):
    tokens = text.lower().split()
    return [stemmer.stem(w) for w in tokens]

# Example
sample = "The games were being played by running players"
print("Lemmatised:", lemmatize_english(sample))
print("Stemmed:   ", stem_english(sample))
PART 03

Text Visualisation

Visualise patterns in the data: time trends, word frequency, and word clouds.

Time Trend Analysis

Plot how post volume changes over time to identify key events or discussion peaks.

time_trend.py
import matplotlib.pyplot as plt

# Note: Run the font setup cell in Keyword Frequency first
# if your axis labels contain Chinese characters.

# Extract month from timestamp
df['month'] = df['created_at'].dt.to_period('M').dt.to_timestamp()
monthly_counts = df.groupby('month').size().reset_index(name='count')

# Plot line chart
plt.figure(figsize=(10, 4))
plt.plot(
    monthly_counts['month'],
    monthly_counts['count'],
    marker='o',
    color='steelblue',
    linewidth=2,
    markersize=6
)
plt.title('Monthly Post Volume: Black Myth Wukong on Weibo', fontsize=14)
plt.xlabel('Month')
plt.ylabel('Number of Posts')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
Expected insight
Post volume peaks in May–June 2024, coinciding with the game's official release announcement.

Word Cloud

Visualise the most frequent words — the larger the word, the more frequently it appears.

wordcloud_gen.py
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Download Chinese font for Google Colab
!wget -q -O /tmp/chinese_font.ttf \
  "https://github.com/adobe-fonts/source-han-sans/raw/release/OTF/SimplifiedChinese/SourceHanSansSC-Regular.otf"

# Combine all tokens into a single string
all_tokens = [w for tokens in df['tokens'] for w in tokens]
all_words_str = ' '.join(all_tokens)

# Generate word cloud
wc = WordCloud(
    font_path='/tmp/chinese_font.ttf',
    width=800,
    height=400,
    background_color='white',
    max_words=100,
    colormap='Blues'
).generate(all_words_str)

plt.figure(figsize=(12, 6))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title('Most Frequent Words in Weibo Posts', fontsize=14)
plt.tight_layout()
plt.show()

Keyword Frequency Statistics

Count and rank the most common words in the dataset to identify key themes. The code below installs a CJK font automatically so Chinese characters display correctly in the chart.

keyword_freq.py
from collections import Counter
import matplotlib.pyplot as plt
from pypinyin import lazy_pinyin, Style

def to_pinyin(word):
    """Convert Chinese to pinyin; keep English/numbers as-is."""
    if any('一' <= c <= '鿿' for c in word):
        return ' '.join(lazy_pinyin(word, style=Style.NORMAL))
    return word

# ── Count and plot top keywords ──────────────────────────────────────────────────
all_tokens = [w for tokens in df['tokens'] for w in tokens]
word_freq = Counter(all_tokens)

# Get top 20 words, convert labels to pinyin for display
top20 = word_freq.most_common(20)
words_orig, counts = zip(*top20)
words_pinyin = [to_pinyin(w) for w in words_orig]

# Plot horizontal bar chart
fig, ax = plt.subplots(figsize=(10, 7))
ax.barh(list(reversed(words_pinyin)), list(reversed(counts)), color='steelblue')
ax.set_title('Top 20 Keywords in Black Myth Wukong Posts', fontsize=14, pad=12)
ax.set_xlabel('Frequency')
plt.tight_layout()
plt.show()

print("\nTop 10 keywords (original Chinese | pinyin):")
for (word, count), pinyin in zip(top20[:10], words_pinyin[:10]):
    print(f"  {word} ({pinyin}): {count}")
PART 04

Advanced Text Analysis

Apply NLP techniques to extract deeper insights from the text.

Sentiment Analysis

Automatically classify each post as positive, negative, or neutral using SnowNLP.

ToolLanguageMethodScore Range
SnowNLPChineseNaïve Bayes0 (neg) → 1 (pos)
VADEREnglishRule-based lexicon-1 → +1
TextBlobEnglishPattern-based-1 → +1
sentiment.py
from snownlp import SnowNLP
import matplotlib.pyplot as plt

def get_sentiment(text):
    try:
        s = SnowNLP(str(text))
        score = s.sentiments   # 0 (negative) → 1 (positive)
        if score > 0.6:
            return 'positive'
        elif score < 0.4:
            return 'negative'
        else:
            return 'neutral'
    except:
        return 'neutral'

# Apply to a sample of 500 rows (full dataset is slow)
sample_df = df.head(500).copy()
sample_df['sentiment'] = sample_df['text'].apply(get_sentiment)

# Print distribution
print(sample_df['sentiment'].value_counts())

# Visualise
colors = {'positive': '#27ae60', 'neutral': '#95a5a6', 'negative': '#e74c3c'}
counts = sample_df['sentiment'].value_counts()
plt.figure(figsize=(6, 4))
plt.bar(counts.index, counts.values,
        color=[colors.get(s, 'gray') for s in counts.index])
plt.title('Sentiment Distribution (Sample of 500 Posts)')
plt.ylabel('Count')
plt.tight_layout()
plt.show()

POS Tagging & Syntactic Analysis

Label each word with its grammatical role (noun, verb, adjective) to analyse how people describe the game.

POS TagMeaningExample
nNoun游戏, 悟空, 玩家
vVerb发布, 购买, 期待
aAdjective精彩, 失望, 好看
dAdverb非常, 真的, 已经
pos_tagging.py
import jieba.posseg as pseg
from collections import Counter
import matplotlib.pyplot as plt

def extract_by_pos(text, target_flag='a'):
    """Extract words by POS tag: n=noun, v=verb, a=adjective, d=adverb"""
    words = pseg.cut(str(text))
    return [w.word for w in words if w.flag == target_flag]

# Extract adjectives (how people describe the game)
df['adjectives'] = df['text'].apply(lambda x: extract_by_pos(x, 'a'))

# Count most common adjectives
all_adj = [w for adj_list in df['adjectives'] for w in adj_list]
adj_freq = Counter(all_adj)

print("Top 10 adjectives used to describe the game:")
for word, count in adj_freq.most_common(10):
    print(f"  {word}: {count}")

# You can also extract nouns or verbs:
# df['nouns'] = df['text'].apply(lambda x: extract_by_pos(x, 'n'))
Communication insight
Adjectives reveal public evaluation and emotional tone — a key metric for media and PR research.

Named Entity Recognition (NER)

Automatically identify and classify named entities — people, organisations, locations — in the text.

Entity TypeDescriptionExample (from dataset)
PERSONPerson names悟空, 孙悟空
ORGOrganisations游戏科学, WeGame
GPEGeopolitical entities中国, 香港, 北京
PRODUCTProduct names黑神话悟空, PS5
ner_spacy.py
import spacy
from collections import Counter

# Load Chinese NLP model
nlp = spacy.load('zh_core_web_sm')

def extract_entities(text):
    doc = nlp(str(text)[:500])  # Limit length for speed
    return [(ent.text, ent.label_) for ent in doc.ents]

# Apply to a sample
sample_df = df.head(200).copy()
sample_df['entities'] = sample_df['text'].apply(extract_entities)

# Flatten and count by entity type
all_entities = [ent for ents in sample_df['entities'] for ent in ents]

# Top organisations
org_list = [e[0] for e in all_entities if e[1] == 'ORG']
print("Top organisations:", Counter(org_list).most_common(10))

# Top persons
person_list = [e[0] for e in all_entities if e[1] == 'PERSON']
print("Top persons:", Counter(person_list).most_common(10))

# Top locations
gpe_list = [e[0] for e in all_entities if e[1] == 'GPE']
print("Top locations:", Counter(gpe_list).most_common(10))

Topic Modelling with LDA

Discover hidden themes across the dataset without any manual labelling using Latent Dirichlet Allocation.

lda_model.py
from gensim import corpora, models

# Step 1: Build dictionary from all tokenised posts
dictionary = corpora.Dictionary(df['tokens'].tolist())

# Remove very rare and very common words
dictionary.filter_extremes(no_below=5, no_above=0.5)

# Step 2: Convert to bag-of-words corpus
corpus = [dictionary.doc2bow(tokens) for tokens in df['tokens']]

# Step 3: Train LDA model (5 topics, 5 passes for speed)
lda_model = models.LdaModel(
    corpus,
    num_topics=5,
    id2word=dictionary,
    passes=5,
    random_state=42
)

# Step 4: Print discovered topics
print("=== Discovered Topics ===")
for idx, topic in lda_model.print_topics(num_words=8):
    print(f"\nTopic {idx + 1}:")
    print(f"  {topic}")
Expected topics might include
Game mechanics · Price discussion · Visual quality · Cultural references · Community reactions

Semantic Network (Co-occurrence Analysis)

Visualise which words appear together to reveal conceptual associations in public discourse.

semantic_network.py
import networkx as nx
import itertools
from collections import Counter
import matplotlib.pyplot as plt
from pypinyin import lazy_pinyin, Style

def to_pinyin(word):
    """Convert Chinese to pinyin; keep English/numbers as-is."""
    if any('一' <= c <= '鿿' for c in word):
        return ' '.join(lazy_pinyin(word, style=Style.NORMAL))
    return word

# Step 1: Build co-occurrence pairs from tokenised text
co_occur = Counter()
for tokens in df['tokens']:
    tokens = tokens[:20]  # Limit per post to reduce noise
    for pair in itertools.combinations(set(tokens), 2):
        co_occur[tuple(sorted(pair))] += 1

# Step 2: Keep top 50 most frequent pairs
top_pairs = co_occur.most_common(50)

# Step 3: Build network graph with pinyin labels
G = nx.Graph()
for (w1, w2), weight in top_pairs:
    # Convert node labels to pinyin so matplotlib renders them correctly
    G.add_edge(to_pinyin(w1), to_pinyin(w2), weight=weight)

# Step 4: Visualise
fig, ax = plt.subplots(figsize=(16, 12))
pos = nx.spring_layout(G, k=1.2, seed=42)

# Node size: base size + degree scaling so all nodes are large enough for labels
degree = dict(G.degree())
min_size = 2500  # minimum node size to fit label text
node_sizes = [min_size + degree[n] * 400 for n in G.nodes()]

# Draw edges
nx.draw_networkx_edges(G, pos, edge_color='#aaaaaa', width=0.8, ax=ax, alpha=0.6)

# Draw nodes
nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color='steelblue', ax=ax)

# Draw labels with font size scaled to fit inside nodes
nx.draw_networkx_labels(
    G, pos, ax=ax,
    font_color='white',
    font_size=8,
    font_weight='bold'
)

ax.set_title('Semantic Network: Black Myth Wukong Weibo Posts', fontsize=14, pad=15)
ax.axis('off')
plt.tight_layout()
plt.show()
Interpretation
Nodes with more connections (higher degree centrality) are conceptually central to the discussion. Clusters of connected words represent coherent themes.
PART 05

Extended Text Analysis

Deeper linguistic and statistical techniques to uncover patterns in the corpus.

N-gram Analysis

Identify frequently co-occurring word sequences (bigrams, trigrams) to capture multi-word phrases that single-word frequency misses — e.g. 'shen hua + wu kong' is more informative than either word alone.

N-gramExampleInsight
Unigram神话, 悟空, 游戏Individual word frequency
Bigram神话 + 悟空, 国产 + 游戏Common 2-word phrases
Trigram黑神话 + 悟空 + 游戏3-word topic phrases
ngram_analysis.py
from collections import Counter
from pypinyin import lazy_pinyin, Style
import matplotlib.pyplot as plt

def to_pinyin(w):
    if any('一' <= c <= '鿿' for c in w):
        return ' '.join(lazy_pinyin(w, style=Style.NORMAL))
    return w

# ── Build bigrams (2-word sequences) ──────────────────────────────────────
bigrams = Counter()
for tokens in df['tokens']:
    for a, b in zip(tokens, tokens[1:]):
        bigrams[(a, b)] += 1

# ── Build trigrams (3-word sequences) ─────────────────────────────────────
trigrams = Counter()
for tokens in df['tokens']:
    for a, b, c in zip(tokens, tokens[1:], tokens[2:]):
        trigrams[(a, b, c)] += 1

# ── Print top results ─────────────────────────────────────────────────────
print("Top 10 Bigrams:")
for (w1, w2), cnt in bigrams.most_common(10):
    print(f"  {to_pinyin(w1)} + {to_pinyin(w2)}: {cnt}")

print("
Top 10 Trigrams:")
for (w1, w2, w3), cnt in trigrams.most_common(10):
    print(f"  {to_pinyin(w1)} + {to_pinyin(w2)} + {to_pinyin(w3)}: {cnt}")

# ── Visualise top 15 bigrams ───────────────────────────────────────────────
top15 = bigrams.most_common(15)
labels = [f"{to_pinyin(a)} + {to_pinyin(b)}" for (a, b), _ in top15]
counts = [c for _, c in top15]

fig, ax = plt.subplots(figsize=(10, 7))
ax.barh(labels[::-1], counts[::-1], color='steelblue')
ax.set_xlabel('Frequency')
ax.set_title('Top 15 Bigrams in Black Myth Wukong Posts')
plt.tight_layout()
plt.show()
Expected insight
Top bigram is shen hua + wu kong (5,910 occurrences), confirming the game title is the dominant phrase. guo chan + you xi (domestic game) reveals strong national identity framing.

POS Tag Distribution Chart

Visualise the grammatical composition of the corpus. The ratio of nouns to adjectives to verbs reveals how audiences frame their discussion — factual reporting vs. emotional evaluation.

TagLabelCommunication Significance
nNounTopics and entities being discussed
vVerbActions and behaviours described
aAdjectiveEvaluative language and sentiment
dAdverbIntensity of expression
pos_distribution.py
import jieba.posseg as pseg
from collections import Counter
import matplotlib.pyplot as plt

# POS tag full names for readability
POS_LABELS = {
    'n': 'Noun', 'v': 'Verb', 'a': 'Adjective', 'd': 'Adverb',
    'r': 'Pronoun', 'p': 'Preposition', 'c': 'Conjunction',
    'm': 'Numeral', 'q': 'Classifier', 'x': 'Other/Symbol',
}

# Count POS tags across all posts
pos_counts = Counter()
for text in df['text']:
    for word, flag in pseg.cut(str(text)):
        tag = flag[:1]  # Use first character of tag
        if tag in POS_LABELS:
            pos_counts[tag] += 1

# Sort by frequency
tags = sorted(pos_counts, key=pos_counts.get, reverse=True)
counts = [pos_counts[t] for t in tags]
labels = [f"{POS_LABELS.get(t, t)} ({t})" for t in tags]

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#2196F3','#4CAF50','#FF9800','#E91E63','#9C27B0','#00BCD4','#FF5722','#607D8B','#795548','#FFC107']
ax.bar(labels, counts, color=colors[:len(labels)])
ax.set_ylabel('Count')
ax.set_title('POS Tag Distribution in Black Myth Wukong Posts')
plt.xticks(rotation=30, ha='right')
plt.tight_layout()
plt.show()

print("
POS Distribution:")
for tag, cnt in zip(tags, counts):
    print(f"  {POS_LABELS.get(tag, tag):15s}: {cnt:,}")

TF-IDF Keyword Extraction

TF-IDF (Term Frequency–Inverse Document Frequency) identifies words that are important to individual posts but rare across the whole corpus — revealing distinctive, document-specific keywords beyond simple frequency counts.

MethodWhat it measuresBest for
Word FrequencyHow often a word appears overallCorpus-level themes
TF-IDFHow distinctive a word is to a specific postDocument-level keywords, content differentiation
tfidf_keywords.py
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
import matplotlib.pyplot as plt
from pypinyin import lazy_pinyin, Style

def to_pinyin(w):
    if any('一' <= c <= '鿿' for c in w):
        return ' '.join(lazy_pinyin(w, style=Style.NORMAL))
    return w

# Convert token lists to space-joined strings for TF-IDF
corpus = [' '.join(tokens) for tokens in df['tokens']]

# Fit TF-IDF vectorizer
tfidf = TfidfVectorizer(max_features=500, min_df=5)
X = tfidf.fit_transform(corpus)

# Compute mean TF-IDF score per term across all documents
mean_scores = np.asarray(X.mean(axis=0)).flatten()
terms = tfidf.get_feature_names_out()

# Top 20 terms by mean TF-IDF
top_idx = mean_scores.argsort()[::-1][:20]
top_terms = [(terms[i], mean_scores[i]) for i in top_idx]

print("Top 20 TF-IDF Keywords:")
for term, score in top_terms:
    print(f"  {to_pinyin(term):25s}  score={score:.4f}")

# Visualise
labels = [to_pinyin(t) for t, _ in top_terms]
scores = [s for _, s in top_terms]

fig, ax = plt.subplots(figsize=(10, 8))
ax.barh(labels[::-1], scores[::-1], color='teal')
ax.set_xlabel('Mean TF-IDF Score')
ax.set_title('Top 20 Keywords by TF-IDF Score')
plt.tight_layout()
plt.show()

TF-IDF Document Clustering

Use TF-IDF vectors with K-Means to automatically group posts into thematic clusters, then visualise the clusters in 2D using PCA. Each cluster represents a distinct discussion thread.

tfidf_clustering.py
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import numpy as np
import matplotlib.pyplot as plt
from pypinyin import lazy_pinyin, Style

def to_pinyin(w):
    if any('一' <= c <= '鿿' for c in w):
        return ' '.join(lazy_pinyin(w, style=Style.NORMAL))
    return w

# Build TF-IDF matrix
corpus = [' '.join(tokens) for tokens in df['tokens']]
tfidf = TfidfVectorizer(max_features=300, min_df=5)
X = tfidf.fit_transform(corpus)
terms = tfidf.get_feature_names_out()

# K-Means clustering (5 clusters)
N_CLUSTERS = 5
km = KMeans(n_clusters=N_CLUSTERS, random_state=42, n_init=10)
km.fit(X)
df['cluster'] = km.labels_

# Print top keywords per cluster
print("Cluster Keywords (TF-IDF centroids):")
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(N_CLUSTERS):
    top_words = [to_pinyin(terms[j]) for j in order_centroids[i, :8]]
    size = (df['cluster'] == i).sum()
    print(f"  Cluster {i+1} ({size} posts): {', '.join(top_words)}")

# Visualise clusters with PCA (2D)
pca = PCA(n_components=2, random_state=42)
coords = pca.fit_transform(X.toarray())

fig, ax = plt.subplots(figsize=(10, 8))
colors = ['#E91E63','#2196F3','#4CAF50','#FF9800','#9C27B0']
for i in range(N_CLUSTERS):
    mask = km.labels_ == i
    ax.scatter(coords[mask, 0], coords[mask, 1],
               c=colors[i], label=f'Cluster {i+1}', alpha=0.5, s=15)
ax.set_title('TF-IDF Document Clusters (PCA 2D Projection)')
ax.legend()
plt.tight_layout()
plt.show()
Expected clusters
Typical clusters found: Game mechanics & gameplay · Price & purchase discussion · Cultural identity (domestic game pride) · Community & social sharing · Media & news coverage

Sentiment × Time Cross-Analysis

Track how the ratio of positive, neutral, and negative posts shifts month by month. Sudden changes often correspond to real-world events — a game announcement, a controversy, or a release date.

sentiment_over_time.py
from snownlp import SnowNLP
import matplotlib.pyplot as plt
import pandas as pd

# Compute sentiment score for each post
print("Computing sentiment scores (may take ~1 min for full dataset)...")
df['sentiment'] = df['text'].apply(lambda t: SnowNLP(str(t)).sentiments)
df['sent_label'] = df['sentiment'].apply(
    lambda s: 'Positive' if s > 0.6 else ('Negative' if s < 0.4 else 'Neutral')
)

# Group by month and sentiment label
df['month'] = df['created_at'].dt.to_period('M').dt.to_timestamp()
monthly = df.groupby(['month', 'sent_label']).size().unstack(fill_value=0)

# Normalise to proportions (%)
monthly_pct = monthly.div(monthly.sum(axis=1), axis=0) * 100

# Plot stacked area chart
fig, ax = plt.subplots(figsize=(12, 6))
colors = {'Positive': '#4CAF50', 'Neutral': '#FFC107', 'Negative': '#F44336'}
for label in ['Positive', 'Neutral', 'Negative']:
    if label in monthly_pct.columns:
        ax.plot(monthly_pct.index, monthly_pct[label],
                marker='o', label=label, color=colors[label], linewidth=2)

ax.set_xlabel('Month')
ax.set_ylabel('Proportion (%)')
ax.set_title('Sentiment Trend Over Time — Black Myth Wukong Weibo Posts')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("
Monthly sentiment breakdown (% of posts):")
print(monthly_pct.round(1).to_string())
Communication research value
Sentiment trend lines are a core tool in computational communication research — they allow researchers to link public opinion shifts to specific media events or policy announcements.

Engagement Correlation Analysis

Quantify the relationship between sentiment score and engagement metrics (likes, reposts, comments). Does more emotional content drive more shares? The data may surprise you.

engagement_correlation.py
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

# Ensure sentiment scores are computed (run sentiment_over_time.py first)
# If not yet computed, uncomment the next two lines:
# from snownlp import SnowNLP
# df['sentiment'] = df['text'].apply(lambda t: SnowNLP(str(t)).sentiments)

# Select engagement columns
eng_cols = ['sentiment', 'reposts_count', 'comments_count', 'attitudes_count']
eng_df = df[eng_cols].dropna()

# Correlation matrix
corr = eng_df.corr()
print("Correlation matrix:")
print(corr.round(3))

# Heatmap
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: correlation heatmap
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm',
            ax=axes[0], square=True, linewidths=0.5)
axes[0].set_title('Correlation: Sentiment vs Engagement')

# Right: scatter — sentiment vs likes
axes[1].scatter(eng_df['sentiment'], eng_df['attitudes_count'],
                alpha=0.3, s=10, color='steelblue')
axes[1].set_xlabel('Sentiment Score (0=neg, 1=pos)')
axes[1].set_ylabel('Likes (attitudes_count)')
axes[1].set_title('Sentiment vs Likes')
axes[1].set_yscale('log')  # Log scale for skewed engagement data

plt.tight_layout()
plt.show()
Expected finding
Correlation between sentiment and engagement is typically weak (r ≈ 0.05–0.10) for this dataset, suggesting engagement is driven more by content topic (e.g. news value) than emotional tone alone.

Keyword Co-occurrence Heatmap

A matrix heatmap showing how often the top N keywords appear together in the same post. Darker cells indicate stronger co-occurrence — an alternative view of the semantic network.

cooccurrence_heatmap.py
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from pypinyin import lazy_pinyin, Style

def to_pinyin(w):
    if any('一' <= c <= '鿿' for c in w):
        return ' '.join(lazy_pinyin(w, style=Style.NORMAL))
    return w

# Select top N keywords
N = 15
all_tokens = [w for tokens in df['tokens'] for w in tokens]
top_words = [w for w, _ in Counter(all_tokens).most_common(N)]

# Build co-occurrence matrix
mat = np.zeros((N, N), dtype=int)
for tokens in df['tokens']:
    token_set = set(tokens)
    for i, a in enumerate(top_words):
        for j, b in enumerate(top_words):
            if i != j and a in token_set and b in token_set:
                mat[i][j] += 1

# Convert labels to pinyin
labels = [to_pinyin(w) for w in top_words]

# Plot heatmap
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(
    mat, xticklabels=labels, yticklabels=labels,
    cmap='Blues', annot=True, fmt='d', linewidths=0.3,
    ax=ax
)
ax.set_title(f'Top {N} Keyword Co-occurrence Heatmap')
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()
Interpretation
The heatmap complements the semantic network: it quantifies co-occurrence strength numerically, making it easier to compare pairs and identify the tightest conceptual clusters.
PART 06

User Behaviour Analysis

Understand when users post and who drives the most engagement in the community.

Posting Time Distribution

Analyse when users are most active — by hour of day and day of week. This reveals audience habits and optimal publishing windows, a key insight for media strategy.

posting_time_distribution.py
import matplotlib.pyplot as plt
import numpy as np

# Extract hour and weekday
df['hour'] = df['created_at'].dt.hour
df['weekday'] = df['created_at'].dt.day_name()

WEEKDAY_ORDER = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ── Left: Hourly distribution ─────────────────────────────────────────────
hour_counts = df['hour'].value_counts().sort_index()
axes[0].bar(hour_counts.index, hour_counts.values, color='steelblue', width=0.8)
axes[0].set_xlabel('Hour of Day (0–23)')
axes[0].set_ylabel('Number of Posts')
axes[0].set_title('Posting Activity by Hour')
axes[0].set_xticks(range(0, 24, 2))
axes[0].grid(axis='y', alpha=0.3)

# Annotate peak hour
peak_hour = hour_counts.idxmax()
axes[0].axvline(peak_hour, color='red', linestyle='--', alpha=0.7,
                label=f'Peak: {peak_hour}:00')
axes[0].legend()

# ── Right: Weekday distribution ───────────────────────────────────────────
weekday_counts = df['weekday'].value_counts().reindex(WEEKDAY_ORDER, fill_value=0)
colors = ['#4CAF50' if d in ['Saturday','Sunday'] else '#2196F3' for d in WEEKDAY_ORDER]
axes[1].bar(WEEKDAY_ORDER, weekday_counts.values, color=colors)
axes[1].set_xlabel('Day of Week')
axes[1].set_ylabel('Number of Posts')
axes[1].set_title('Posting Activity by Weekday')
plt.setp(axes[1].get_xticklabels(), rotation=30, ha='right')
axes[1].grid(axis='y', alpha=0.3)

plt.suptitle('User Posting Behaviour — Black Myth Wukong Weibo Posts', fontsize=13, y=1.02)
plt.tight_layout()
plt.show()

print(f"Peak posting hour: {peak_hour}:00")
print(f"Most active day: {weekday_counts.idxmax()}")
Expected insight
Peak posting hours are typically 21:00–23:00 (evening leisure time). Weekend activity may be higher for gaming-related content.

Influencer Identification

Rank users by total engagement (likes + reposts + comments) to identify key opinion leaders (KOLs). Also compute average engagement per post to distinguish prolific posters from genuinely influential ones.

MetricFormulaWhat it reveals
Total EngagementLikes + Reposts + CommentsOverall reach and impact
Avg Engagement / PostTotal ÷ Post CountContent quality vs. volume
Repost RateReposts ÷ Total EngagementContent shareability
influencer_identification.py
import matplotlib.pyplot as plt
import pandas as pd

# Compute total engagement per user
df['engagement'] = df['reposts_count'] + df['comments_count'] + df['attitudes_count']

# Aggregate by user
user_stats = df.groupby('screen_name').agg(
    total_posts=('text', 'count'),
    total_engagement=('engagement', 'sum'),
    avg_engagement=('engagement', 'mean'),
    total_likes=('attitudes_count', 'sum'),
    total_reposts=('reposts_count', 'sum'),
    total_comments=('comments_count', 'sum'),
).sort_values('total_engagement', ascending=False)

print("Top 10 Most Influential Users:")
print(user_stats.head(10).to_string())

# ── Visualise top 15 users by total engagement ────────────────────────────
top15 = user_stats.head(15)

fig, ax = plt.subplots(figsize=(12, 8))
bars = ax.barh(top15.index[::-1], top15['total_engagement'][::-1], color='steelblue')

# Colour-code by post count
max_posts = top15['total_posts'].max()
for bar, (_, row) in zip(bars, top15[::-1].iterrows()):
    intensity = row['total_posts'] / max_posts
    bar.set_alpha(0.4 + 0.6 * intensity)

ax.set_xlabel('Total Engagement (Likes + Reposts + Comments)')
ax.set_title('Top 15 Users by Total Engagement')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

# ── Engagement rate (avg per post) ────────────────────────────────────────
print("
Top 10 by Average Engagement per Post (min 3 posts):")
high_avg = user_stats[user_stats['total_posts'] >= 3].sort_values('avg_engagement', ascending=False)
print(high_avg[['total_posts','avg_engagement','total_likes','total_reposts']].head(10).to_string())