dc.description.abstract |
This study reports our investigation and experiments on the development of paraphrase detection
and its application in automatic plagiarism detection for Afan Oromo texts. Paraphrasing is making
a sentence in another form, like changing the sentence by the synonym of a keyword, adding a
phrase to the word, or adding more details to a particular word; which is a way of conveying the
same message without compromising the meaning. However, due to the rapidly increasing digital
media and paraphrasing tools, paraphrasing increases the opportunity to commit paraphrase
plagiarism, which is difficult to detect easily. Plagiarism is a persistent headache that plagiarism
detection systems face because most plagiarism detection systems (many of which are
commercially based) are designed to detect word co-occurrences and light modifications but they
are incapable of detecting severe semantic, structural, and paraphrase texts. Paraphrase detection
is a natural language processing task that involves determining the degree to which two text
segments are related and has a great role to detect paraphrase plagiarism. Paraphrase detection has
many applications in the field of natural language processing and understanding, such as machine
translation, information retrieval, and question-answering. However, many research studies have
been reported and implemented to detect paraphrases for resource-rich languages such as English,
Chinese, German, French, and so on. To the best of the researcher's knowledge, there is no formal
study reported on resource-scarce Ethiopian languages like Afan Oromo, Amharic, Somali,
Sidama, and so on. Therefore, this study aimed to design and develop an automatic paraphrase
detection model for Afan Oromo texts using deep learning techniques. To this end, a dataset was
gathered and prepared from Afan Oromo documents publicly available at the Addis Ababa
University Institutional Repository. First of all, we performed text preprocessing and data
annotation tasks in cooperation with domain experts. While 80% of the data is used for training
and creating deep learning models, the remaining 20% is used to test the performance of the model.
Accordingly, the convolutional neural network model scored an accuracy of 67% with fast-Text
word embedding, which is a promising performance for automatic paraphrase detection for Afan
Oromo texts. |
en_US |