Abstract:
The purpose of the paraphrase identification (PI) problem is to determine if
two statements are similar enough in meaning to be classified as paraphrases,
and it is the task of automatically recognizing whether sentence pairs have
the same meaning, but It is difficult to accurately define the criteria for
semantic equivalence (that is, the same or almost the same meaning) and can
vary from task to task, and It is usually a binary classification issue. It is an
alternative expression with the same (or similar) meaning. For example,
"መርሳት" is a paraphrased form of "ማስታስአሇመቻሌ". The identification of
paraphrases and the degree of their semantic similarity have proven useful in
many NLP applications (Erfaneh Gharavi, Kayvan Bijari and Kiarash
Zahirnia, 2017). For example, it can be used as a feature to enhance many
other NLP tasks such as Information retrieval, machine translation scoring,
text summarization, question answering, etc. Although a lot of paraphrase
identification systems have been developed for various natural language
texts, but no research has been conducted yet for Amharic Language.
The proposed model will consider different word embedding methods such
as word2vec, and fastText, and also we will use three different deep learning
models such as BiLSTM_GRN, Siamese Network, and Feature Fusion
Network models, to detect the paraphrased Sentence automatically and
compare accuracy of all models. The proposed model will help people to
detect the paraphrased sentence accurately and quickly, in order to avoid
duplicate sentences that entail the same meaning and also to detect
palajarism.
Since there is no publicly available Amharic paraphrase dataset, the Dataset
used for this purpose is gathered from online public available dataset of
Addis Ababa University Institutional Repository which contains the
collection of Amharic language masters of Art student‟s thesis. Then
prepared the dataset consists of pairs of annotated sentences with linguistic
expert of the domain. While 80% of the data is used for train and develop
deep learning models, and the remaining 20% is used to test the performance
of the model. Accordingly, the Siamese neural network model scored an
accuracy of 0.9583 with fastText word embedding, which is a promising
performance for automatic paraphrase detection for Amharic langage than
BiLSTM-GRN and FFN models.