DOI: 10.5593/SGEM2014/B21/S7.035


E. Sharapova
Wednesday 1 October 2014 by Libadmin2014

References: 14th International Multidisciplinary Scientific GeoConference SGEM 2014, www.sgem.org, SGEM2014 Conference Proceedings, ISBN 978-619-7105-10-0 / ISSN 1314-2704, June 19-25, 2014, Book 2, Vol. 1, 273-278 pp

In paper considered the problem of fuzzy duplicate detecting. There are given the basic approaches to detection of text duplicates. We review the existing methods of fuzzy duplicate detection. The most famous is method of shingles. We present algorithm of fuzzy duplicate detection. Text of document is replaced by filtered copy. HTML tags, punctuation marks, special characters, stop words are deleted from document. It is done processing of replacement characters and stemming. Text is divided into a sequence of words. Length of sequence is fixed. For each sequence of words is computed MD5 code. The number of matching MD5 codes shows the measure of matching documents. If all codes in two documents are match, then documents are full duplicates. If there aren’t match codes, then documents are different. If only part of MD5 codes is match, documents are fuzzy duplicates. The algorithm of fuzzy duplicate texts detection was implemented in system AVTOR.NET. In article the results of algorithm testing are given. Algorithm show good results for all types of tests.

Keywords: fuzzy duplicate, text, duplicate detection.

Home | Contact | Site Map | Site statistics | Visitors : 140 / 353063

Follow site activity en  Follow site activity INFORMATICS  Follow site activity Papers SGEM2014   ?

CrossRef Member    Indexed in ISI Web Of Knowledge   Indexed in ISI Web Of Knowledge

© Copyright 2001 International Multidisciplinary Scientific GeoConference & EXPO SGEM. All Rights Reserved.

Creative Commons License