IQAD: Iraqi Arabic Dialect Dataset for Multi-Regional Dialect Classification Using Conventional and Machine Learning Approaches
DOI:
https://doi.org/10.51173/jt.v7i3.2695Keywords:
Machine Learning, Dialect, Language, Classification, SVMAbstract
The work's main contribution is creating a dataset for specifying Iraqi Arabic dialects from written texts. With the increase of Iraqi dialectal Arabic usage across social media platforms, accurate dialect identification has become an important step for such tasks as sentiment analysis, social media monitoring, and linguistic studies. We collected, annotated, and prepared normal text data: 53,146 unique text samples taken from social media, divided into three major dialects in Iraq: Middle, Western, and Southern. The lexical variability of the corpus is 78,582 unique tokens. The dataset was passed through preprocessing to clean and prepare it for classification-based tasks. To verify the quality of this dataset, we carried out experiments with two approaches for the classification: a dictionary-based methodology and a TF-IDF-based SVM classification. The SVM outperformed the dictionary-based classifier by achieving 74% accuracy and F1-score, whereas the classifier peaked at 63.6% accuracy and 63.4% F1 score. The results show the effectiveness of the dataset in supporting dialect classification tasks and its potential for use in future Iraqi Arabic NLP applications and research.
Downloads
References
A Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
C Zhou, C Sun, et al. "A C-LSTM neural network for text classification," arXiv, Nov 2015, https://doi.org/10.48550/arXiv.1511.08630.
A. Alnawas and N. Arici, "Sentiment Analysis of Iraqi Arabic Dialect on Facebook Based on Distributed Representations of Documents," ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 18, no. 3, Article 20, pp. 1–17, Sep. 2019, https://doi.org/10.1145/3278605.
N. Tibi and M. A. Messaoud, "Arabic dialect classification using an adaptive deep learning model," Bull. Electr. Eng. Inform., vol. 14, no. 2, pp. 1108–1116, Apr. 2025, https://doi.org/10.11591/eei.v14i2.8165.
Y. Matrane, F. Benabbou, and N. Sael, "A systematic literature review of Arabic dialect sentiment analysis," J. King Saud Univ. - Comput. Inf. Sci., vol. 35, no. 6, p. 101570, June 2023, https://doi.org/10.1016/j.jksuci.2023.101570.
E Alsarsour, R Mohamed, and T. Elsayed, "DART: A Large Dataset of Dialectal Arabic Tweets," in Proc. 11th Int. Conf. Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.
A. Keleg and W. Magdy, "Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification," arXiv preprint, Oct. 2023, https://doi.org/10.48550/arXiv.2310.13661.
I. Alansari, "Artificial Intelligence Model to Detect and Classify Arabic Dialects," J. Softw. Eng. Appl., vol. 16, pp. 287–300, Jul. 2023, https://doi.org/10.4236/jsea.2023.167015.
A. Aliwy, H. Taher, and Z. AboAltaheen. (2020, Dec.). "Arabic Dialects Identification for All Arabic countries," Proc. Fifth Arabic Natural Language Processing Workshop [Online]. pp. 302–307. Available: https://aclanthology.org/2020.wanlp-1.32/.
A. A. Hnaif, E. Kanan, and T. Kanan, "Sentiment Analysis for Arabic Social Media News Polarity," Intell. Autom. Soft Comput., vol. 28, no. 1, pp. 107–119, Feb. 2021, https://doi.org/10.32604/iasc.2021.015939.
T. Kanan et al., "A Review of Natural Language Processing and Machine Learning Tools Used to Analyze Arabic Social Media," 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 2019, pp. 622-628, https://doi.org/10.1109/JEEIT.2019.8717369.
A. Alnawas and N. Arici, "Sentiment Analysis of Iraqi Arabic Dialect on Facebook Based on Distributed Representations of Documents," ACM Trans. Asian Low-Resour. Lang. Inf. Process., vol. 18, no. 3, Article 20, pp. 1–17, Sep. 2019, https://doi.org/10.1145/3278605.
U. Braga-Neto, Fundamentals of Pattern Recognition and Machine Learning. Cham, Switzerland: Springer, 2020.
P. Dangeti. Statistics for machine learning. UK: Packt Publishing Ltd, 2017.
Jo, T. "Machine learning foundations: Supervised, Unsupervised, and Advanced Learning. Cham: Springer International Publishing." 2021.
D. A. Pisner and D. M. Schnyer, "Support vector machine," in Machine Learning, A. Mechelli and S. Vieira, Eds. Academic Press, 2020, pp. 101–121.
J. Lilleberg, Y. Zhu, and Y. Zhang, "Support vector machines and Word2vec for text classification with semantic features," in 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCICC)*, Beijing, China, 2015, pp. 136–140.
D. Jurafsky and J. H. Martin, "Vector Semantics and Embeddings" in Speech and Language Processing, draft, Jan. 12, 2025.
D. E. Cahyani and I. Patasik, "Performance comparison of tf-idf and word2vec models for emotion text classification," Bull. Electr. Eng. Inform., vol. 10, no. 5, pp. 2780–2788, Sep. 2021, https://doi.org/10.11591/eei.v10i5.3157.
Scikit Developers. (2025, January 1). Scikit-learn: Machine Learning in Python [Online]. Available: https://scikit-learn.org.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Noora Aljubouri, Naderi Hassan

This work is licensed under a Creative Commons Attribution 4.0 International License.










