World Library  
Flag as Inappropriate
Email this Article

Proximity search (text)

 

Proximity search (text)

In text processing, a proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. In addition to proximity, some implementations may also impose a constraint on the word order, in that the order in the searched text must be identical to the order of the search query. Proximity searching goes beyond the simple matching of words by adding the constraint of proximity and is generally regarded as a form of advanced search.

For example, a search could be used to find "red brick house", and match phrases such as "red house of brick" or "house made of red brick". By limiting the proximity, these phrases can be matched while avoiding documents where the words are scattered or spread across a page or in unrelated articles in an anthology.

Contents

  • Rationale 1
  • Boolean syntax and operators 2
  • Usage in commercial search engines 3
  • See also 4
  • Notes 5

Rationale

The basic linguistic assumption of proximity searching is that the proximity of the words in a document implies a relationship between the words. Given that authors of documents try to formulate sentences which contain a single idea, or cluster of related ideas within neighboring sentences or organized into paragraphs, there is an inherent, relatively high, probability within the document structure that words used together are related. On the other hand, when two words are on the opposite ends of a book, the probability of a relationship between the words is relatively weak. By limiting search results to only include matches where the words are within the specified maximum proximity, or distance, the search results are assumed to be of higher relevance than the matches where the words are scattered.

Commercial internet search engines tend to produce too many matches (known as recall) for the average search query. Proximity searching is one method of reducing the number of pages matches, and to improve the relevance of the matched pages by using word proximity to assist in ranking. As an added benefit, proximity searching helps combat spamdexing by avoiding webpages which contain dictionary lists or shotgun lists of thousands of words, which would otherwise rank highly if the search engine was heavily biased toward word frequency.

Boolean syntax and operators

Note that a proximity search can designate that only some keywords must be within a specified distance. Proximity searching can be used with other search syntax and/or controls to allow more articulate search queries. Sometimes query operators like NEAR, NOT NEAR, FOLLOWED BY, NOT FOLLOWED BY, SENTENCE or FAR are used to indicate a proximity-search limit between specified keywords: for example, "brick NEAR house".

Usage in commercial search engines

In regards to implicit/automatic versus explicit proximity search, as of November 2008, most Internet search engines only implement an implicit proximity search functionality. That is, they automatically rank those search results higher where the user keywords have a good "overall proximity score" in such results. If only two keywords are in the search query, this has no difference from an explicit proximity search which puts a NEAR operator between the two keywords. However, if three or more than three keywords are present, it is often important for the user to specify which subsets of these keywords expect a proximity in search results. This is useful if the user wants to do a prior art search (e.g. finding an existing approach to complete a specific task, finding a document that discloses a system that exhibits a procedural behavior collaboratively conducted by several components and links between these components).

Web search engines which support proximity search via an explicit proximity operator in their query language include Walhello, Exalead, Yandex, Yahoo!, Altavista, and Bing:

  • When using the Walhello search-engine, the proximity can be defined by the number of characters between the keywords.[1]
  • The search engine Exalead allows the user to specify the required proximity, as the maximum number of words between keywords. The syntax is (keyword1 NEAR/n keyword2) where n is the number of words.[2]
  • Yandex uses the syntax keyword1 /n keyword2 to search for two keywords separated by at most n - 1 words, and supports a few other variations of this syntax.[3]
  • Yahoo! and Altavista both support an undocumented NEAR operator.[4][5] The syntax is keyword1 NEAR keyword2.
  • Google Search supports AROUND(#).[6]
  • Bing supports NEAR. [7] The syntax is keyword1 near:n keyword2 where n=the number of maximum separating words.

Ordered search within the Google and Yahoo! search engines is possible using the asterisk (*) full-word wildcards: in Google this matches one or more words,[8] and an in Yahoo! Search this matches exactly one word.[9] (This is easily verified by searching for the following phrase in both Google and Yahoo!: "addictive * of biblioscopy".)

To emulate unordered search of the NEAR operator can be done using a combination of ordered searches. For example, to specify a close co-occurrence of "house" and "dog", the following search-expression could be specified: "house dog" OR "dog house" OR "house * dog" OR "dog * house" OR "house * * dog" OR "dog * * house".

See also

Notes

  1. ^ "About Walhello", visited 23 December 2009
  2. ^ "Web Search Syntax", visited 23 December 2009
  3. ^ Yandex help page on query language (in Russian)
  4. ^ "Successful Yahoo! proximity query" (22 Feb 2010)
  5. ^ "Unsuccessful Yahoo! proximity query" (22 Feb 2010)
  6. ^ "GuidingTech: Meet Google Search's Little Known AROUND Operator"
  7. ^ "How to Use Bing’s Advanced Search Operators"
  8. ^ "More Google Search Help" visited 23 December 2009
  9. ^ "Review of Yahoo! Search", by Search Engine Showdown, visited 23 December 2009
This article was sourced from Creative Commons Attribution-ShareAlike License; additional terms may apply. World Heritage Encyclopedia content is assembled from numerous content providers, Open Access Publishing, and in compliance with The Fair Access to Science and Technology Research Act (FASTR), Wikimedia Foundation, Inc., Public Library of Science, The Encyclopedia of Life, Open Book Publishers (OBP), PubMed, U.S. National Library of Medicine, National Center for Biotechnology Information, U.S. National Library of Medicine, National Institutes of Health (NIH), U.S. Department of Health & Human Services, and USA.gov, which sources content from all federal, state, local, tribal, and territorial government publication portals (.gov, .mil, .edu). Funding for USA.gov and content contributors is made possible from the U.S. Congress, E-Government Act of 2002.
 
Crowd sourced content that is contributed to World Heritage Encyclopedia is peer reviewed and edited by our editorial staff to ensure quality scholarly research articles.
 
By using this site, you agree to the Terms of Use and Privacy Policy. World Heritage Encyclopedia™ is a registered trademark of the World Public Library Association, a non-profit organization.
 


Copyright © World Library Foundation. All rights reserved. eBooks from Project Gutenberg are sponsored by the World Library Foundation,
a 501c(4) Member's Support Non-Profit Organization, and is NOT affiliated with any governmental agency or department.