avatar

Patrick Amadeus Irawan

PhD Student
MBZUAI
patrick.irawan@mbzuai.ac.ae


About Me

I’m a PhD student at MBZUAI, advised by Alham Fikri Aji. My work investigates the understanding-generation gap in multimodal models and builds solutions to close it.

Before my PhD, I worked as a Research Engineer at Singapore Management University with Chong-Wah Ngo, where I focused on multilingual and multimodal interpretation. I started this path during my Bachelor’s in Computer Science at Institut Teknologi Bandung, working with Ayu Purwarianti on synthetically scaling explainable VQA data.

My goal is to build multimodal systems that perceive, reason, and generate grounded response over complex real-world inputs — not ones that get “lucky” on benchmarks by exploiting shortcuts.

Research Interests

I am interested in understanding the discrepancy between multimodal understanding and generation, and innovating methods to minimize such gap.

Multimodal models often appear to understand an image well when asked to describe it in dominant modality (e.g. text). But when the same model is asked to use that understanding to generate in non-dominant modality, like images or video, it fails on multiple aspects. My research starts by diagnosing where and why this breakdown happens, then works on the solutions that minimize such discrepancy.

Understanding

I study how models decide what to attend to, and why they over-rely on language signals while underusing visual inputs. This leads to shortcut learning, hallucination, and weak grounding across different settings — biased visual grounding under semantically-aligned perturbations (ConfusedTourists), fragile perceptual attention exposed through counting tasks (CountingTricks), and domain gaps that degrade even high-level understanding (SeeingCulture), and its reasoning elicitation quality (Synthetic-VQA-NLE).

Generation

These failure modes point toward where to intervene. I work on post-training methods that recover missing abilities and strengthen cross-modal alignment using adaptive distillation. LinguDistill shows that language ability degrades during visual training and that distillation can recover it, confirming the gap is real and empirically addressable. On the generation side, I am currently working on world model evaluation for plan-action consistency and memory-based VLMs to achieve better grounding robustness over out-of-distribution data.

Evaluation

Measuring progress on this gap also requires evaluation setups that are faithful to real-world conditions. I design benchmarks that stress-test model behavior under distribution shifts, missing modalities, and limited resources (WorldCuisine, SEACrowd, DataRubrics), so that improvements on the generation side can be tracked reliably and at scale.

Updates

Publications

2026
  1. LinguDistill: Recovering Linguistic Ability in Vision Language Models via Selective Cross-Modal Distillation teaser Preprint
    Patrick Amadeus Irawan, Elang Fuadi, Satendra Kumar, Alham Fikri Aji, Yova Kementchedjhieva
    Preprint, 2026.
    Proposes a selective distillation strategy to recover linguistic competence in VLMs without giving up multimodal capability.

  2. Vision Language Models are Confused Tourists teaser CVPR 2026 Findings
    Patrick Amadeus Irawan, Ikhlasul Akmal Hanif, Muhammad Dehan Al Kautsar, Genta Indra Winata, Fajri Koto, Alham Fikri Aji
    Computer Vision and Pattern Recognition Conference (CVPR), 2026 Findings.
    Studies how VLMs misread culturally conflicting visual situations, exposing grounding failures that are invisible to standard benchmarks.

  3. M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG teaser CVPR 2026
    David Anugraha, Patrick Amadeus Irawan, Anshul Singh, En-Shiun Annie Lee, Genta Indra Winata
    Computer Vision and Pattern Recognition Conference (CVPR), 2026.
    Evaluates whether multimodal retrieval actually helps multilingual and multicultural question answering at scale, and where it fails.

2025
  1. Seeing Culture: A Benchmark for Visual Reasoning and Grounding teaser EMNLP 2025
    Burak Satar, Zhixin Ma, Patrick Amadeus Irawan, Wilfried A. Mulyawan, Jing Jiang, Ee-Peng Lim, Chong-Wah Ngo
    Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025.
    Builds a benchmark for culture-sensitive visual reasoning and grounding, pushing evaluation beyond object recognition into contextual interpretation.

  2. Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations teaser MRL @ EMNLP 2025
    Patrick Amadeus Irawan, Ryandito Diandaru, Belati Jagad Bintang Syuhada, Randy Zakya Suchrady, Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya
    Multilingual Representation Learning Workshop at EMNLP, 2025.
    Introduces entropy-based crosslingual representations that treat language modeling uncertainty as an end-to-end learnable signal.

  3. WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines teaser NAACL 2025
    Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, and others
    North American Chapter of the Association for Computational Linguistics (NAACL), 2025.
    Co-leads a benchmark that tests multilingual and multicultural VQA through food, culture, and visual context rather than English-centric priors.

  4. ProxyLM: Predicting Language Model Performance on Multilingual Tasks via Proxy Models teaser NAACL 2025
    David Anugraha, Genta Indra Winata, Chenyue Li, Patrick Amadeus Irawan, En-Shiun Annie Lee
    North American Chapter of the Association for Computational Linguistics (NAACL), 2025.
    Predicts multilingual model performance with cheaper proxy models, reducing evaluation cost when exploring large design spaces.

  5. Towards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language Models teaser COLING 2025
    Patrick Amadeus Irawan, Genta Indra Winata, Samuel Cahyawijaya, Ayu Purwarianti
    International Conference on Computational Linguistics (COLING), 2025.
    Develops a more efficient pipeline for generating VQA explanations with VLMs, improving synthetic supervision for grounded reasoning.

  6. Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability teaser Preprint
    Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, and others
    Preprint, 2025.
    Proposes an automated scorecard for dataset quality and accountability, making data auditing more systematic and comparable.

2024
  1. SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages teaser EMNLP 2024
    Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James V Miranda, Jennifer Santoso, Elyanah Aco, ..., Patrick Amadeus Irawan, and others
    Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024.
    Contributes to a multilingual multimodal data hub and benchmark suite centered on Southeast Asian languages, expanding evaluation beyond high-resource settings.

  2. Leveraging IoT and Machine Learning for Efficient Rice Stock Monitoring and Prediction teaser APSIPA ASC 2024
    Nana Sutisna, Aditya Prawira Nugroho, Christopher Jeffrey, Patrick Amadeus Irawan, Rizky Ramadhana, Ronggur Mahendra, Michael Jonathan, Infall Syafalni, Trio Adiono
    Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2024.
    Applies machine learning and sensor systems to real-world stock monitoring, showing the engineering side of my research background.

Experience & Service

Selected Experience

Reviewing