Emotion Analysis: Execution Report

A complete account of the model evolution from V1 failure to V2 success.

Step 1: Model V1 Experiment & Failure Analysis

What we did: We started by loading the small, manually annotated dataset (annotation_sentiment_study.xlsx). We cleaned the data and attempted to train our first model (Model V1) using standard BERT.

Specific Actions:

# ===============================
# Load Malay SMS abbreviations
# CSV format:
# Column 0: Abbreviation
# Column 1: Original
# ===============================
abbr_df = pd.read_csv("malay_sms_abbreviations.csv")

# Normalize abbreviations (important!)
abbr_df['Abbreviation'] = abbr_df['Abbreviation'].astype(str).str.lower().str.strip()
abbr_df['Original'] = abbr_df['Original'].astype(str).str.lower().str.strip()

# Build abbreviation dictionary
ABBR_MAP = dict(zip(abbr_df['Abbreviation'], abbr_df['Original']))

# Compile regex once for efficiency
ABBR_PATTERN = re.compile(
    r'\b(' + '|'.join(map(re.escape, ABBR_MAP.keys())) + r')\b'
)

def expand_malay_abbreviations(text: str) -> str:
    return ABBR_PATTERN.sub(lambda m: ABBR_MAP[m.group(0)], text)

# ===============================
# Encoding / mojibake fix
# ===============================
def fix_encoding_mojibake(text):
    try:
        return text.encode('gbk').decode('utf-8')
    except:
        return text

# ===============================
# BERT preprocessing function
# ===============================
def preprocess_bert(text: str) -> str:
    text = str(text)

    # 1. Fix encoding / mojibake
    text = fix_encoding_mojibake(text)

    # 1.1 Unescape HTML entities
    text = html.unescape(text)

    # 2. Lowercase (use ONLY if using uncased BERT)
    text = text.lower()

    # 3. Expand Malay SMS abbreviations ⭐⭐⭐
    text = expand_malay_abbreviations(text)

    # 4. Demojize (emoji → text)
    text = emoji.demojize(text)

    # 5. Remove URLs
    text = re.sub(r"http\S+|www\S+", "", text)

    # 6. Remove user mentions
    text = re.sub(r"@[A-Za-z0-9_.]+", "", text)

    # 7. Remove HTML tags
    text = re.sub(r"<.*?>", "", text)

    # 8. Keep hashtag text (remove # only)
    text = re.sub(r"#(\w+)", r"\1", text)

    # 9. Reduce excessive elongation (max 3 chars)
    text = re.sub(r"(.)\1{3,}", r"\1\1\1", text)

    # 10. Normalize whitespace
    text = re.sub(r"\s+", " ", text).strip()

    return text

# ===============================
# Apply preprocessing
# ===============================
print("Applying BERT-optimized preprocessing with Malay SMS expansion...")
df['Processed_Tweet'] = df['Tweet'].apply(preprocess_bert)

# ===============================
# Display comparison
# ===============================
pd.set_option('display.max_colwidth', None)
df[['Tweet', 'Processed_Tweet']]

Step 2: Data Preparation for Augmentation

What we did: We addressed the data shortage. Since manual labeling is slow, we prepared raw data to be labeled by an AI.

Specific Actions:

Step 3: Dual-Dataset Labeling Strategy

What we did: We utilized a local Large Language Model (Ollama/Qwen) to label data. Crucially, we didn't just label new data; we re-labeled the old data too.

Specific Actions:

messages = [
        {
            "role": "system",
            "content": (
                "You are an emotion classification system. "
                "Reply with ONLY ONE emotion label from the provided list."
            )
        },
        {
            "role": "user",
            "content": f"""
Classify the MAIN emotion in the text below.

Text:
{text}

Choose EXACTLY ONE label from:
Happiness
Anger
Sadness
Neutral
Fear

Rules:
- Reply with ONLY the label
- No explanation
- If unclear, reply Neutral
"""
        }
    ]

Step 4: Cleaning & Training Model V2

What we did: This is the core retraining phase. We cleaned the newly added dataset and trained Model V2 using the augmented, consistent data.

Specific Actions:

Model V2 Results

The final evaluation of Model V2 on the test set produced the following metrics:

Step 5: Large-Scale Prediction (88k Rows)

What we did: We took the successfully trained Model V2 and deployed it against the full raw dataset.

Specific Actions:

Step 6: Analysis of Predicted Results

What we did: We performed a sociological analysis on the 88,000 predictions generated by Model V2.

Temporal Trends

We analyzed how emotions fluctuated over the collected period.

Word Clouds (All Emotions)

We generated word clouds for every emotion category to verify the model's associations. Below are all the clouds generated:

Word Cloud Visualization 1
Word Cloud Visualization 2
Word Cloud Visualization 3
Word Cloud Visualization 4
Word Cloud Visualization 5
Word Cloud Visualization 6
Word Cloud Visualization 7
Word Cloud Visualization 8
Word Cloud Visualization 9

Influencer & Temporal Analysis

Finally, we looked at user behavior, correlating followers with sentiment and mapping activity by hour.