SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages

Accepted in EMNLP 2025, SEACrowd Data Catalogue

Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.

Towards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language Models

arXiv Review

Natural Language Explanation (NLE) aims to elucidate the decision-making process by providing detailed, human-friendly explanations in natural language. It helps demystify the decision-making processes of large vision-language models (LVLMs) through the use of language models. While existing methods for creating a Vision Question-Answering with Natural Language Explanation (VQA-NLE) datasets can provide explanations, they heavily rely on human annotations that are time-consuming and costly. In this study, we propose a novel approach that leverages LVLMs to efficiently generate high-quality synthetic VQA-NLE datasets. By evaluating our synthetic data, we showcase how advanced prompting techniques can lead to the production of high-quality VQA-NLE data. Our findings indicate that this proposed method achieves up to 20x faster than human annotation, with only a minimal decrease in qualitative metrics, achieving robust quality that is nearly equivalent to human-annotated data. Furthermore, we show that incorporating visual prompts significantly enhances the relevance of text generation. Our study paves the way for a more efficient and robust automated generation of multi-modal NLE data, offering a promising solution to the problem.

ProxyLM: Predicting Language Model Performance on Multilingual Tasks via Proxy Models

Published in arXiv

Performance prediction is a method to estimate the performance of Language Models (LMs) on various Natural Language Processing (NLP) tasks, mitigating computational costs associated with model capacity and data for fine-tuning. Our paper introduces ProxyLM, a scalable framework for predicting LM performance using proxy models in multilingual tasks. These proxy models act as surrogates, approximating the performance of the LM of interest. By leveraging proxy models, ProxyLM significantly reduces computational overhead on task evaluations, achieving up to a 37.08x speedup compared to traditional methods, even with our smallest proxy models. Additionally, our methodology showcases adaptability to previously unseen languages in pre-trained LMs, outperforming the state-of-the-art performance by 1.89x as measured by root-mean-square error (RMSE). This framework streamlines model selection, enabling efficient deployment and iterative LM enhancements without extensive computational resources.

Automatic Cattle Activity Monitoring for Anomaly Detection Using YOLOv5 and Seasonal Trend Loess

Published in GEMASTIK XV Data Mining Competition 2022

Livestock farming is one of the crucial sectors for a country’s food security. Unfortunately, the development of the livestock sector in Indonesia has been declining since 2016, primarily due to losses resulting from disruptions in livestock productivity. Therefore, a technological solution is needed to address this issue. In this research, we propose a solution utilizing computer vision and statistical techniques to detect anomalies in the changing patterns of livestock activities—represented by cattle. Computer vision technology is employed to detect cattle activities, and the results are processed into a time series graph depicting the percentage of cattle activities. This graph is then analyzed using Seasonal-Trend decomposition with Loess (STL) statistical technique to identify anomalies marked by changes in activity patterns. In the presence of anomalies, early warnings can be provided to farmers, enabling them to take preventive or corrective actions to safeguard livestock productivity. In this study, CCTV data of cattle activities is utilized, and the model trained is a pre-trained YOLOv5 model. The model evaluation is based on the mean Average Precision (mAP) metrics at p50-95. In a limited dataset (346 images), a mAP50-95 value of 0.747 is obtained.

Recommended citation: Khelli, et al. (2022) “Automatic Cattle Activity Monitoring for Anomaly Detection Using YOLOv5 and Seasonal Trend Loess” GEMASTIK XV.

Medical Decision Analysis Using Random Forest Classifier on High Variance Data

Published in Informatika Institut Teknologi Bandung

Decision-making related to medical matters must be carried out effectively and be accountable. One method that can be employed to determine a medical decision is by utilizing decision-making models in conjunction with automation systems. The use of decision trees is one proficient method to generate classifications with high accuracy and swift execution, facilitated by a tree data structure that represents exploitable information. Accuracy can be further enhanced by incorporating the concept of a random forest or the amalgamation of numerous decision trees to mitigate potential errors in models with high complexity. The validity of this method will be discussed in depth in this paper.

Recommended citation: Irawan, P. A. (2021) “Medical Decision Analysis Using Random Forest Classifier on High Variance Data” Informatika Rinaldi Munir.