Efficient Hash Function for Duplicate Elimination in Dictionaries
Date issued
2009
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Slovenská technická univerzita v Bratislavě
Abstract
Fast elimination of duplicate data is needed in many areas, especially in the
textual data context. A solution to this problem was recently found for geometrical data
using a hash function to speed up the process. The usage of the hash function is extremely
efficient when incremental elimination is required especially for processing large data
sets. In this paper a new construction of the hash function is presented, giving short
clusters with few collisions only. The proposed hash function is not a perfect hash
function, nevertheless it gives similar properties to it. The hash function used takes
advantage of the relatively large amount of available memory on modern computers, and
works well with large data sets.
Experiments have proved that different approaches should be used for different types
of languages, because the structures of Slavonic and Anglo-Saxon languages are different.
Therefore, tests were made with a Czech dictionary having 2.5 million words and an
English dictionary having 130 thousands words. Algorithm was also tested for a few other
languages. Experimental results are presented in this paper as well.
Description
Subject(s)
hešovací funkce, hešovací tabulka, struktura dat
Citation
Algoritmy 2009: 18th Conference on Scientific Computing, p. 382-391.