World Library  
Flag as Inappropriate
Email this Article

String metric

Article Id: WHEBN0009809306
Reproduction Date:

Title: String metric  
Author: World Heritage Encyclopedia
Language: English
Subject: Levenshtein distance, Damerau–Levenshtein distance, Approximate string matching, Most frequent k characters, Apostolico–Giancarlo algorithm
Collection: Metrics, String Similarity Measures
Publisher: World Heritage Encyclopedia
Publication
Date:
 

String metric

In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. A necessary requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example the strings "Sam" and "Samuel" can be considered to be close. A string metric provides a number indicating an algorithm-specific indication of distance.

The most widely known string metric is a rudimentary one called the Levenshtein Distance (also known as Edit Distance). It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.

String metrics are used heavily in information integration and are currently used in areas including fraud detection, fingerprint analysis, plagiarism detection, ontology merging, DNA analysis, RNA analysis, image analysis, evidence-based machine learning, database data deduplication, data mining, Web interfaces, e.g. Ajax-style suggestions as you type, data integration, and semantic knowledge integration.

Contents

  • List of string metrics 1
  • Selected string measures examples 2
  • References 3
  • External links 4

List of string metrics

Selected string measures examples

Name Example
Hamming distance "karolin" and "kathrin" is 3.
Levenshtein distance and Damerau–Levenshtein distance kitten and sitting have a distance of 3.
  1. kitten → sitten (substitution of "s" for "k")
  2. sitten → sittin (substitution of "i" for "e")
  3. sittin → sitting (insertion of "g" at the end).
Jaro–Winkler distance JaroWinklerDist("MARTHA","MARHTA")=d_j = \frac{1}{3}\left(\frac{m}{|s_1|} + \frac{m}{|s_2|} + \frac{m-t}{m}\right) = \frac{1}{3}\left(\frac{6}{6} + \frac{6}{6} + \frac{6-\frac{2}{2}}{6}\right) = 0.944
  • m is the number of matching characters;
  • t is half the number of transpositions("MARTHA"[3]!=H, "MARHTA"[3]!=T).
Most frequent k characters MostFreqKeySimilarity('research', 'seeking', 2) = 2

References

  1. ^ Cohen, William; Ravikumar, Pradeep; Fienberg, Stephen (2003-08-01). "A Comparison of String Distance Metrics for Name-Matching Tasks.". pp. 73–78. 

External links

  • http://www.dcs.shef.ac.uk/~sam/stringmetrics.html A fairly complete overview Archive copy at the Wayback Machine
  • Carnegie Mellon University open source library
  • StringMetric project a Scala library of string metrics and phonetic algorithms
  • Natural project a JavaScript natural language processing library which includes implementations of popular string metrics
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
 
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
 
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.
 


Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.