Fast Ways to Detect Outliers

Emad Obaid Merza; Nashaat Jasim Mohammed

doi:10.51173/jt.v3i1.287

Authors

Emad Obaid Merza Information Technology Department, Technical College of Management-Baghdad Middle Technical University, Baghdad, Iraq.
Nashaat Jasim Mohammed Information Technology Department, Technical College of Management-Baghdad Middle Technical University, Baghdad, Iraq.

DOI:

https://doi.org/10.51173/jt.v3i1.287

Keywords:

outlier, outlier detection, big data, normal distribution, Z-Score, Hample's test

Abstract

The occurrence of tremendous developments in the field of data has led to the formation of huge volumes of data, and it is normal that this leads to the presence of outliers in this data for many reasons, which may have small or large values compared to the rest of the normal data, and the presence of outliers in the data affects the statistical analysis of this data, so we must try to reduce its impact in various ways. On the other hand, the presence of outliers may be of great benefit, for example knowledge of geological activities that precede natural disasters such as (earthquakes, forest fires, floods ... etc.). Therefore, detection of outliers is of great importance in various fields. In this research, we aim to develop easy methods for detecting outliers in big data, as the problem that this research addresses is that many of the newly developed methods for detecting outliers suffer from computational complexity or are efficient when the sample size is small. An experimental approach was used in this research by suggesting three methods for detecting outliers, the first method is based on standard deviation and was tested and compared with the normal distribution method and the z-score method. The second method depends on the maximum and minimum value of the data, and the third method depends on the range between successive data points. The results of second and third methods are compared with Hample's Test method result. The accuracy of the results is measured based on the confusion matrix. The results of the proposed methods test showed the conformity of the first method with the results of the normal distribution method and the Z-Score method, as well as the superiority of the third method over the Hample's test method. In this paper, it was concluded that the Hample's test method suffers from a serious weakness when the zero values in the data constitute more than 50% of the number of elements.

Downloads

Download data is not yet available.

Author Biography

Nashaat Jasim Mohammed, Information Technology Department, Technical College of Management-Baghdad Middle Technical University, Baghdad, Iraq.

Assist.Prof. Dr. Nashaat Jasim Mohammed Anber

Information Technology Department, Technical College of Management- Baghdad

Middle Technical University, Baghdad, Iraq

References

J. Han, M. Kamber, J. Pei, Data mining: concepts and techniques, Third Edition, 225Wyman Street,Waltham, MA 02451, USA: Elsevier, 2011.

P. J. Rousseeuw and A. M. Leroy, Robust regression and outlier detection, Canada: John wiley & sons, 2005.

H. P. Kriegel, P. Kröger, A. Zimek, "Outlier detection techniques," in the 2010 SIAM International Conference on Data Mining, Columbus, Ohio, 2010.

S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C., Faloutsos, " Loci: Fast outlier detection using the local correlation integral," In Proceedings 19th international conference on data engineering, Bangalore, India, pp. 315-326, 2003.‏

M. Breaban and H. Luchian, " Outlier detection with nonlinear projection pursuit," International Journal of Computers Communications & Control, 8(1):30-36, ISSN 1841-9836, February, 2013.

S. Akter and M. H. Khan, " Multiple-Case Outlier Detection in Multiple Linear Regression Model Using Quantum-Inspired Evolutionary Algorithm," JOURNAL OF COMPUTERS, VOL. 5, NO. 12, DECEMBER, 2010.

O. G. Alma, 'Performances Comparison of Information Criteria for Outlier Detection in Multiple Regression Models Having Multicollinearity Problems using Genetic Algorithms," Matematika, Volume 29, Number 2, pp 119–131, 2013.

F. Angiulli and C. Pizzuti, " Fast outlier detection in high dimensional spaces," In European conference on principles of data mining and knowledge discovery, Springer, Berlin, Heidelberg, pp. 15-27, 2002.

‏[9] H. Wu, W. Sun, B. Zheng, "A fast trajectory outlier detection approach via driving behavior modeling," In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, pp. 837-846, 2017.‏

C. C. Aggarwal. (2016, November 25). Outlier analysis (second edition) [online]. Available: http://rd.springer.com/book/10.1007/978-3-319-47578-3

N. SURI, N. M, G. Athithan, Outlier detection: techniques and applications. Switzerland Springer Nature, 2019.

J. Astal, “Comparison of Methods for Detecting Outliers in Medical Data”, M.Sc. thesis, Al-Azhar University–Gaza, 2018.