Detecting Source Code Plagiarism in Student Assignment Submissions Using Clustering Techniques
DOI:
https://doi.org/10.51173/jt.v6i2.1851Keywords:
Source Code, C++ Programming Language, Python, Plagiarism, Machine LearningAbstract
In pragmatic courses, graduate students are required to submit programming assignments, which have been susceptible to various forms of plagiarism. Detecting counterfeited code in an academic setting is of paramount importance, given the prevalence of publications and papers. Plagiarism, defined as the unauthorized replication of written work without proper acknowledgment, has become a critical concern with the advent of information and communication technology (ICT) and the widespread availability of scholarly publications online. However, the extensive use of freeware text editors has posed challenges in detecting source code plagiarism. Numerous studies have investigated algorithms for revealing different types of plagiarism and detecting source code plagiarism. In this research, we propose an innovative strategy that combines TF-IDF (Term Frequency-Inverse Document Frequency) modifications with K-means clustering, achieving a remarkable precision rate of 99.2%. Additionally, we explore the hierarchical clustering method, which estimates an even higher precision rate of 99.5% compared to previous techniques. To implement our approach, we utilize the Python programming language along with relevant libraries, providing a robust and efficient system for source code plagiarism detection in student assignment submissions.
Downloads
References
A. Ahtiainen, S. Surakka, and M. Rahikainen. "Plaggie: GNU-licensed source code plagiarism detection engine for Java exercises." Proceedings of the 6th Baltic Sea conference on Computing education research: Koli Calling 2006, Uppsala, Sweden, 2006. https://doi.org/10.1145/1315803.1315831.
C. Daly and J. Horgan. (2005, February) "Patterns of plagiarism." SIGCSE Bull., vol. 37, no. 1, pp. 383–387, 2005, 10.1145/1047124.1047473.
I. Rahal and C. Wielga. (2014, September). "Source Code Plagiarism Detection Using Biological String Similarity Algorithms," Journal of Information & Knowledge Management, vol. 13, no. 03, p. 1450028, 10.1142/s0219649214500282.
D. M. Breuker, J. Derriks, and J. Brunekreef, "Measuring static quality of student code," presented at the Proceedings of the 16th annual joint conference on Innovation and Technology in computer science education, Darmstadt, Germany, 2011, https://doi.org/10.1145/1999747.1999754.
R. Brixtel, M. Fontaine, B. Lesner, C. Bazin, and R. Robbes, "Language-Independent Clone Detection Applied to Plagiarism Detection," presented at the Proceedings of the 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation, 2010, https://doi.org/10.1109/SCAM.2010.19.
H. Cheers, Y. Lin, and S. P. Smith, "Detecting Pervasive Source Code Plagiarism through Dynamic Program Behaviours," presented at the Proceedings of the Twenty-Second Australasian Computing Education Conference, Melbourne, VIC, Australia, 2020, https://doi.org/10.1145/3373165.3373168.
X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker. "Shared information and program plagiarism detection," IEEE Trans. Inf. Theor., vol. 50, no. 7, pp. 1545–1551, (2004, June), 10.1109/tit.2004.830793.
D. Chuda, P. Navrat, B. Kovacova, and P. Humay. "The Issue of (Software) Plagiarism: A Student View," IEEE Trans. on Educ., vol. 55, no. 1, pp. 22–28, (2012, February), 10.1109/te.2011.2112768.
J. A. W. Faidhi and S. K. Robinson. "An empirical approach for detecting program similarity and plagiarism within a university programming environment," Comput. Educ., vol. 11, no. 1, pp. 11–19, (1987, May), 10.1016/0360-1315(87)90042-x.
E. Flores, A. Barrón-Cedeño, L. Moreno, and P. Rosso. (2014, August). "Uncovering source code reuse in large-scale academic environments," Comput. Appl. Eng. Educ., vol. 23, no. 3, pp. 383–390, 10.1002/cae.21608.
E. Flores, A. Barrón-Cedeño, P. Rosso, and L. Moreno. "Towards the detection of cross-language source code reuse." Natural Language Processing and Information Systems: 16th International Conference on Applications of Natural Language to Information Systems, NLDB 2011, Alicante, Spain, June 28-30, 2011. Proceedings 16. Springer Berlin Heidelberg, 2011, https://doi.org/10.1007/978-3-642-22327-3_31.
E. Flores, A. Barrón-Cedeño, P. Rosso, and L. Moreno. "DeSoCoRe: Detecting source code re-use across programming languages." Proceedings of the Demonstration Session at the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2012.
E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello. "On the Detection of Source Code Re-use." presented at the Proceedings of the 6th Annual Meeting of the Forum for Information Retrieval Evaluation. Bangalore, India. p. 21-30, 2014, https://doi.org/10.1145/2824864.2824878.
M. Freire. "Visualizing program similarity in the Ac plagiarism detection system." presented at the Proceedings of the working conference on Advanced visual interfaces. Napoli, Italy. p. 404-407, 2008, https://doi.org/10.1145/1385569.1385644.
D. Gitchell and N. Tran. "Sim: a utility for detecting similarity in computer programs." SIGCSE Bull. vol. 31, no. 1, pp. 266–270, 1999, Available: 10.1145/384266.299783.
J. Hage, P. Rademaker, and N. v. Vugt. "Plagiarism detection for Java: a tool comparison." presented at the Computer Science Education Research Conference, Heerlen, Netherlands, p. 33-46, 2011.
B. Heeren, D. Leijen, and A. v. IJzendoorn. "Helium, for learning Haskell." presented at the Proceedings of the 2003 ACM SIGPLAN workshop on Haskell. Uppsala, Sweden. p. 62-71, 2003, https://doi.org/10.1145/871895.871902.
D. Heres and J. Hage. "A Quantitative Comparison of Program Plagiarism Detection Tools." presented at the Proceedings of the 6th Computer Science Education Research Conference. Helsinki, Finland. p. 73-82, 2017, https://doi.org/10.1145/3162087.3162101.
M. Joy and M. Luck. "Plagiarism in programming assignments." IEEE Trans. on Educ., vol. 42, no. 2, pp. 129–133, (1999, May), DOI: 10.1109/13.762946.
R. M. Karp and M. O. Rabin. "Efficient randomized pattern-matching algorithms." IBM J. Res. Dev., vol. 31, no. 2, pp. 249–260, (1987, March), DOI: 10.1147/rd.312.0249.
M. Kaya and S. A. Özel. "Integrating an online compiler and a plagiarism detection tool into the Moodle distance education system for easy assessment of programming assignments." Comput. Appl. Eng. Educ., vol. 23, no. 3, pp. 363–373, (2014, July), DOI: 10.1002/cae.21606.
H. Li, C. Reinke, and S. Thompson. "Tool support for refactoring functional programs." presented at the Proceedings of the 2003 ACM SIGPLAN workshop on Haskell. Uppsala, Sweden, p. 27-38, 2003. DOI: 10.1145/871895.871899.
A. Marcus, A. Sergeyev, V. Rajlich, and J. I. Maletic. "An Information Retrieval Approach to Concept Location in Source Code." presented at the Proceedings of the 11th Working Conference on Reverse Engineering, p. 214-223, 2004, https://doi.org/10.1109/WCRE.2004.10.
P. Modiba, V. Pieterse, and B. Haskin. "Evaluating plagiarism detection software for introductory programming assignments." presented at the Proceedings of the Computer Science Education Research Conference 2016. Pretoria, South Africa, p. 37-46, 2016, DOI: 10.1145/2998551.2998558.
M. Novak, M. Joy, and D. Kermek. "Source-code Similarity Detection and Detection Tools Used in Academia: A Systematic Review." ACM Trans. Comput. Educ., vol. 19, no. 3, p. Article 27, (2019, May), DOI: 10.1145/3313290.
V. Pieterse, "Automated Assessment of Programming Assignments," presented at the Proceedings of the 3rd Computer Science Education Research Conference on Computer Science Education Research, Arnhem, Netherlands, 13, 4-5, 2013.
R. Rivest, RFC1321: The MD5 Message-Digest Algorithm. RFC Editor, 1992.
F. Rosales, A. Garcia, S. Rodriguez, J. L. Pedraza, R. Mendez, and M. M. Nieto. "Detection of Plagiarism in Programming Assignments." IEEE Trans. on Educ., vol. 51, no. 2, pp. 174–183, (2008, May), DOI: 10.1109/te.2007.906778.
N. Tahaei and D. C. Noelle, "Automated Plagiarism Detection for Computer Programming Exercises Based on Patterns of Resubmission," presented at the Proceedings of the 2018 ACM Conference on International Computing Education Research, Espoo, Finland, 2018. DOI: 10.1145/3230977.3231006.
N. R. Wagner, "Plagiarism by student programmers," Ph.D. dissertation, The University of Texas at San Antonio Division Computer Science San Antonio, TX, vol. 78249, 2000.
R. A. Wagner and M. J. Fischer. "The String-to-String Correction Problem." J. ACM, vol. 21, no. 1, pp. 168–173, (1974, January) DOI: 10.1145/321796.321811.
G. Whale. "Software metrics and plagiarism detection." J. Syst. Softw. vol. 13, no. 2, pp. 131–138, (1990, October), DOI: 10.1016/0164-1212(90)90118-6.
M. J. Wise. "YAP3: Improved detection of similarities in computer program and other texts." Proceedings of the twenty-seventh SIGCSE technical symposium on Computer science education. p. 130-134, 1996, DOI: 10.1145/236452.236525.
Aldelemy, A., & Raed A. Abd-Alhameed. Binary Classification of Customer’s Online Purchasing Behavior Using Machine Learning. Journal of Techniques, 5(2), 163–186, (2023, June), https://doi.org/10.51173/jt.v5i2.1226.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Raddam Sami Mehsen, Majharoddin M. Kazi, Hiren Joshi
This work is licensed under a Creative Commons Attribution 4.0 International License.