A complete account of the model evolution from V1 failure to V2 success.
What we did: We started by loading the small, manually annotated dataset (annotation_sentiment_study.xlsx). We cleaned the data and attempted to train our first model (Model V1) using standard BERT.
Specific Actions:
# ===============================
# Load Malay SMS abbreviations
# CSV format:
# Column 0: Abbreviation
# Column 1: Original
# ===============================
abbr_df = pd.read_csv("malay_sms_abbreviations.csv")
# Normalize abbreviations (important!)
abbr_df['Abbreviation'] = abbr_df['Abbreviation'].astype(str).str.lower().str.strip()
abbr_df['Original'] = abbr_df['Original'].astype(str).str.lower().str.strip()
# Build abbreviation dictionary
ABBR_MAP = dict(zip(abbr_df['Abbreviation'], abbr_df['Original']))
# Compile regex once for efficiency
ABBR_PATTERN = re.compile(
r'\b(' + '|'.join(map(re.escape, ABBR_MAP.keys())) + r')\b'
)
def expand_malay_abbreviations(text: str) -> str:
return ABBR_PATTERN.sub(lambda m: ABBR_MAP[m.group(0)], text)
# ===============================
# Encoding / mojibake fix
# ===============================
def fix_encoding_mojibake(text):
try:
return text.encode('gbk').decode('utf-8')
except:
return text
# ===============================
# BERT preprocessing function
# ===============================
def preprocess_bert(text: str) -> str:
text = str(text)
# 1. Fix encoding / mojibake
text = fix_encoding_mojibake(text)
# 1.1 Unescape HTML entities
text = html.unescape(text)
# 2. Lowercase (use ONLY if using uncased BERT)
text = text.lower()
# 3. Expand Malay SMS abbreviations ⭐⭐⭐
text = expand_malay_abbreviations(text)
# 4. Demojize (emoji → text)
text = emoji.demojize(text)
# 5. Remove URLs
text = re.sub(r"http\S+|www\S+", "", text)
# 6. Remove user mentions
text = re.sub(r"@[A-Za-z0-9_.]+", "", text)
# 7. Remove HTML tags
text = re.sub(r"<.*?>", "", text)
# 8. Keep hashtag text (remove # only)
text = re.sub(r"#(\w+)", r"\1", text)
# 9. Reduce excessive elongation (max 3 chars)
text = re.sub(r"(.)\1{3,}", r"\1\1\1", text)
# 10. Normalize whitespace
text = re.sub(r"\s+", " ", text).strip()
return text
# ===============================
# Apply preprocessing
# ===============================
print("Applying BERT-optimized preprocessing with Malay SMS expansion...")
df['Processed_Tweet'] = df['Tweet'].apply(preprocess_bert)
# ===============================
# Display comparison
# ===============================
pd.set_option('display.max_colwidth', None)
df[['Tweet', 'Processed_Tweet']]
What we did: We addressed the data shortage. Since manual labeling is slow, we prepared raw data to be labeled by an AI.
Specific Actions:
abai_geram (Excel).xlsx which contains over 90,000 rows.abai_geram_3000.csv for the LLM to process.What we did: We utilized a local Large Language Model (Ollama/Qwen) to label data. Crucially, we didn't just label new data; we re-labeled the old data too.
Specific Actions:
abai_geram dataset.messages = [
{
"role": "system",
"content": (
"You are an emotion classification system. "
"Reply with ONLY ONE emotion label from the provided list."
)
},
{
"role": "user",
"content": f"""
Classify the MAIN emotion in the text below.
Text:
{text}
Choose EXACTLY ONE label from:
Happiness
Anger
Sadness
Neutral
Fear
Rules:
- Reply with ONLY the label
- No explanation
- If unclear, reply Neutral
"""
}
]What we did: This is the core retraining phase. We cleaned the newly added dataset and trained Model V2 using the augmented, consistent data.
Specific Actions:
abai_geram data using the Step 1 pipeline to ensure feature compatibility.The final evaluation of Model V2 on the test set produced the following metrics:
What we did: We took the successfully trained Model V2 and deployed it against the full raw dataset.
Specific Actions:
abai_geram.What we did: We performed a sociological analysis on the 88,000 predictions generated by Model V2.
We analyzed how emotions fluctuated over the collected period.
We generated word clouds for every emotion category to verify the model's associations. Below are all the clouds generated:
Finally, we looked at user behavior, correlating followers with sentiment and mapping activity by hour.