Inverted index

From Wikipedia, the free encyclopedia

(Redirected from Inverted file index)
Jump to: navigation, search

An inverted index (also referred to as postings file or inverted file) is an index structure storing a mapping from words to their locations in a document or a set of documents, allowing full text search. It is the most popular data structure used in document retrieval systems.

There are two main variants of inverted indexes: A record level inverted index (or inverted file index or just inverted file) contains a list of references to documents for each word. A word level inverted index (or full inverted index or inverted list) additionally contains the positions of each word within a document.[1] The latter form offers more functionality (like phrase searches), but needs more time and space to be created.

Contents

Given the texts T0 = "it is what it is", T1 = "what is it" and T2 = "it is a banana", we have the following inverted file index (where the integers in the set notation brackets refer to the subscripts of the text symbols, T0, T1 etc.):

"a":      {2}
"banana": {2}
"is":     {0, 1, 2}
"it":     {0, 1, 2}
"what":   {0, 1}

A term search for the terms "what", "is" and "it" would give the set \{0,1\} \cap \{0,1,2\} \cap \{0,1,2\} = \{0,1\}.

With the same texts, we get the following full inverted index, where the pairs are document numbers and local word numbers. Like the document numbers, local word numbers also begin with zero. So, "banana": {(2, 3)} means the word "banana" is in the third document (T2), and it is the fourth word in that document (position 3).

"a":      {(2, 2)}
"banana": {(2, 3)}
"is":     {(0, 1), (0, 4), (1, 1), (2, 1)}
"it":     {(0, 0), (0, 3), (1, 2), (2, 0)} 
"what":   {(0, 2), (1, 0)}

If we run a phrase search for "what is it" we get hits for all the words in both document 0 and 1. But the terms occur only consecutively in document 1.

The inverted index data structure is a central component of a typical search engine indexing algorithm. A goal of a search engine implementation is to optimize the speed of the query: find the documents where word X occurs. Once a forward index is developed, which stores lists of words per document, it is next inverted to develop an inverted index. Querying the forward index would require sequential iteration through each document and to each word to verify a matching document. The time, memory, and processing resources to perform such a query are not always technically realistic. Instead of listing the words per document in the forward index, the inverted index data structure is developed which lists the documents per word.

With the inverted index created, the query can now be resolved by jumping to the word id (via random access) in the inverted index. Random access is generally regarded as being faster than sequential access.

In pre-computer times, Concordances to important books were manually assembled. These were effectively inverted indexes with a small amount of accompanying commentary, that required a tremendous amount of effort to produce.

  1. ^ Ribeiro, Berthier de Araújo Neto; Baeza-Yates, R. (1999). Modern information retrieval. Reading, Mass: Addison-Wesley Longman, 192. ISBN 0-201-39829-X. 
  • Donald Knuth. The Art of Computer Programming, Volume 3: Sorting and Searching, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89685-0. Pages 560–563 of section 6.5: Retrieval on Secondary Keys.
  • Justin Zobel, Alistair Moffat and Kotagiri Ramamohanarao, Inverted files versus signature files for text indexing. ACM Transactions on Database Systems (TODS), Volume 23, Issue 4 (December 1998), Pages: 453 - 490.
  • Justin Zobel and Alistair Moffat, Inverted Files for Text Search Engines . ACM Computing Surveys 38(2).

Advanced Search
Included Web Search Engines


Safe Search

close

Top Matching Results

Occasionally Search.com will highlight specialized results that are based on the context of your query. Examples of specialized results include specific links to news, images, or video.

Top Matching Results may highlight information from other Search.com pages, content from the CNET Network of sites, or third party content. The listings are based purely on relevance. Search.com does not receive payment for listings in this section but our partners that provide this data may get paid for listing these products.

Sponsored Links

This section contains paid listings which have been purchased by companies that want to have their sites appear for specific search terms and related content. These listings are administered, sorted and maintained by a third party and are not endorsed by Search.com.

Search Results

Search.com sends your search query to several search engines at one time and integrates the results into one list which has been sorted by relevance using Search.com's proprietary algorithm. You can customize the list of search engines included in your metasearch from the preferences.

The search engines that are used in your metasearch may allow companies to pay to have their Web sites included within the results. To view the Paid Inclusion policy for a specific search engine, please visit their Web site. Search.com does not accept payment or share revenue with any search engine partner for listings in this section.