Project Execution Report

What we did: We started by loading the small, manually annotated dataset (annotation_sentiment_study.xlsx). We cleaned the data and attempted to train our first model (Model V1) using standard BERT.

# ===============================
# Load Malay SMS abbreviations
# CSV format:
# Column 0: Abbreviation
# Column 1: Original
# ===============================
abbr_df = pd.read_csv("malay_sms_abbreviations.csv")

# Normalize abbreviations (important!)
abbr_df['Abbreviation'] = abbr_df['Abbreviation'].astype(str).str.lower().str.strip()
abbr_df['Original'] = abbr_df['Original'].astype(str).str.lower().str.strip()

# Build abbreviation dictionary
ABBR_MAP = dict(zip(abbr_df['Abbreviation'], abbr_df['Original']))

# Compile regex once for efficiency
ABBR_PATTERN = re.compile(
    r'\b(' + '|'.join(map(re.escape, ABBR_MAP.keys())) + r')\b'
)

def expand_malay_abbreviations(text: str) -> str:
    return ABBR_PATTERN.sub(lambda m: ABBR_MAP[m.group(0)], text)

# ===============================
# Encoding / mojibake fix
# ===============================
def fix_encoding_mojibake(text):
    try:
        return text.encode('gbk').decode('utf-8')
    except:
        return text

# ===============================
# BERT preprocessing function
# ===============================
def preprocess_bert(text: str) -> str:
    text = str(text)

    # 1. Fix encoding / mojibake
    text = fix_encoding_mojibake(text)

    # 1.1 Unescape HTML entities
    text = html.unescape(text)

    # 2. Lowercase (use ONLY if using uncased BERT)
    text = text.lower()

    # 3. Expand Malay SMS abbreviations ⭐⭐⭐
    text = expand_malay_abbreviations(text)

    # 4. Demojize (emoji → text)
    text = emoji.demojize(text)

    # 5. Remove URLs
    text = re.sub(r"http\S+|www\S+", "", text)

    # 6. Remove user mentions
    text = re.sub(r"@[A-Za-z0-9_.]+", "", text)

    # 7. Remove HTML tags
    text = re.sub(r"<.*?>", "", text)

    # 8. Keep hashtag text (remove # only)
    text = re.sub(r"#(\w+)", r"\1", text)

    # 9. Reduce excessive elongation (max 3 chars)
    text = re.sub(r"(.)\1{3,}", r"\1\1\1", text)

    # 10. Normalize whitespace
    text = re.sub(r"\s+", " ", text).strip()

    return text

# ===============================
# Apply preprocessing
# ===============================
print("Applying BERT-optimized preprocessing with Malay SMS expansion...")
df['Processed_Tweet'] = df['Tweet'].apply(preprocess_bert)

# ===============================
# Display comparison
# ===============================
pd.set_option('display.max_colwidth', None)
df[['Tweet', 'Processed_Tweet']]

Step 2: Data Preparation for Augmentation

What we did: We addressed the data shortage. Since manual labeling is slow, we prepared raw data to be labeled by an AI.

Step 3: Dual-Dataset Labeling Strategy

What we did: We utilized a local Large Language Model (Ollama/Qwen) to label data. Crucially, we didn't just label new data; we re-labeled the old data too.

messages = [
        {
            "role": "system",
            "content": (
                "You are an emotion classification system. "
                "Reply with ONLY ONE emotion label from the provided list."
            )
        },
        {
            "role": "user",
            "content": f"""
Classify the MAIN emotion in the text below.

Text:
{text}

Choose EXACTLY ONE label from:
Happiness
Anger
Sadness
Neutral
Fear

Rules:
- Reply with ONLY the label
- No explanation
- If unclear, reply Neutral
"""
        }
    ]

Step 4: Cleaning & Training Model V2

What we did: This is the core retraining phase. We cleaned the newly added dataset and trained Model V2 using the augmented, consistent data.

Model V2 Results

The final evaluation of Model V2 on the test set produced the following metrics:

Step 5: Large-Scale Prediction (88k Rows)

What we did: We took the successfully trained Model V2 and deployed it against the full raw dataset.

Step 6: Analysis of Predicted Results

What we did: We performed a sociological analysis on the 88,000 predictions generated by Model V2.

Temporal Trends

Word Clouds (All Emotions)

We generated word clouds for every emotion category to verify the model's associations. Below are all the clouds generated:

Influencer & Temporal Analysis

Finally, we looked at user behavior, correlating followers with sentiment and mapping activity by hour.

Emotion Analysis: Execution Report

Step 1: Model V1 Experiment & Failure Analysis