I am a PhD student at MBZUAI, advised by Alham Fikri Aji and Yova Kementchedjhieva. My work focuses on multimodal foundation models, post-training and alignment, grounded vision-language systems, and large-scale evaluation.
Previously, I was a Research Engineer at Singapore Management University, where I worked on multilingual and multimodal interpretation with Chong-Wah Ngo. Before that, I earned my B.Eng. in Computer Science from Institut Teknologi Bandung, where I worked with Ayu Purwarianti on synthetic data generation for explainable multimodal reasoning.
My goal is to build multimodal systems that can properly utilize multimodal evidence and interaction feedback rather than relying on modality-specific shortcuts.
Modality Utilization. I study how multimodal models allocate attention and capacity across modalities, and why they often under-utilize informative signals in favor of dominant ones. My work focuses on understanding when and why models fail to fully exploit available modalities, leading to sparse utilization, shortcut learning, hallucination, or language over-reliance. This direction is reflected in the Synthetic-VQA-NLE framework, which enables the generation of explainable and sound synthetic VQA data; SeeingCulture, which diagnoses the lack of domain awareness that distorts visual grounding; and ConfusedTourists, which points out how perturbing context can trigger biased behavior in grounding systems.
Post-Training on Action-Conditioned Supervision Signals. To mitigate the above problems, I work on both training and non-training adaptations to improve cross-modal grounding and utilization. My current interests include post-training signals that recover or sharpen modality-specific abilities, distillation, and supervision schemes that could eventually support action-conditioned multimodal systems in which perception, language, and decision-making are tightly coupled. This direction is reflected most directly in LinguDistill, for its cross-modal distillation attempt to recover the language-centric ability of VLMs, and also connects back to Synthetic-VQA-NLE as an earlier step toward better grounded supervision. As you are reading this webpage, I am also actively leading a world model evaluation research to assess plan-action cross-modal consistency, as well as involved in a memory-based VLMs research to boost modality grounding and prevent modality-specific forgetting.
Large-Scale Evaluation. I also often investigate model robustness under varying resource conditions and modality availability. This involves designing evaluation protocols and infrastructure that probe modality reliance, cross-modal consistency, and broad inclusivity at scale. This direction is reflected in award-winning WorldCuisine, SEACrowd both as a paper and through active organization involvement, ProxyLM, and DataRubrics, which together study scalable benchmarking, performance prediction, and data quality assessment.
Preprint
CVPR 2026 Findings
CVPR 2026
EMNLP 2025
MRL @ EMNLP 2025
NAACL 2025
NAACL 2025
COLING 2025
Preprint
EMNLP 2024
APSIPA ASC 2024