Probabilistic latent semantic analysis

From Wikipedia, the free encyclopedia

(Redirected from PLSI)
Jump to: navigation, search

Probabilistic latent semantic analysis (PLSA), also known as Probabilistic latent semantic indexing (PLSI, especially in Information retrieval circles) is a statistical technique for the analysis of two-mode and co-occurrence data. PLSA evolved from Latent semantic analysis, adding a sounder probabilistic model. PLSA has applications in information retrieval and filtering, natural language processing, machine learning from text, and related areas. It was introduced in 1999 by Thomas Hofmann,[1] and it is related to non-negative matrix factorization.

Compared to standard latent semantic analysis which stems from linear algebra and downsizes the occurrence tables (usually via a singular value decomposition), probabilistic latent semantic analysis is based on a mixture decomposition derived from a latent class model. This results in a more principled approach which has a solid foundation in statistics.

Considering observations in the form of co-occurrences (w,d) of words and documents, PLSA models the probability of each co-occurrence as a mixture of conditionally independent multinomial distributions:

P(w,d) = P(c)P(d | c)P(w | c) = P(d) P(c | d)P(w | c)
c c

The first formulation is the symmetric formulation, where w and d are both generated from the latent class c in similar ways (using the conditional probabilities P(d | c) and P(w | c)), whereas the second formulation is the asymmetric formulation, where, for each document d, a latent class is chosen conditionally to the document according to P(c | d), and a word is then generated from that class according to P(w | c). Although we have used words and documents in this example, the co-occurrence of any couple of discrete variables may be modelled in exactly the same way.

It is reported that the aspect model used in the probabilistic latent semantic analysis has severe overfitting problems[2]. The number of parameters grows linearly with the number of documents. In addition, although PLSA is a generative model of the documents in the collection it is estimated on, it is not a generative model of new documents.

PLSA may be used in a discriminative setting, via Fisher kernels.[3]

  • Hierarchical extensions:
    • Asymmetric: MASHA ("Multinomial ASymmetric Hierarchical Analysis") [4]
    • Symmetric: HPLSA ("Hierarchical Probabilistic Latent Semantic Analysis") [5]
  • Generative models: The following models have been developed to address an often-criticized shortcoming of PLSA, namely that it is not a proper generative model for new documents.
  • Higher-order data: Although this is rarely discussed in the scientific literature, PLSA extends naturally to higher order data (three modes and higher), ie it can model co-occurrences over three or more variables. In the symmetric formulation above, this is done simply by adding conditional probability distributions for these additional variables. This is the probabilistic analogue to non-negative tensor factorisation.

  1. ^ Thomas Hofmann, Probabilistic Latent Semantic Indexing, Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval (SIGIR-99), 1999
  2. ^ Blei, David M.; Andrew Y. Ng, Michael I. Jordan (2003). "Latent Dirichlet Allocation". Journal of Machine Learning Research 3: 993-1022. 
  3. ^ Thomas Hofmann, Learning the Similarity of Documents : an information-geometric approach to document retrieval and categorization, Advances in Neural Information Processing Systems 12, pp-914-920, MIT Press, 2000
  4. ^ Alexei Vinokourov and Mark Girolami, A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections, in Information Processing and Management, 2002
  5. ^ Eric Gaussier, Cyril Goutte, Kris Popat and Francine Chen, A Hierarchical Model for Clustering and Categorising Documents, in "Advances in Information Retrieval -- Proceedings of the 24th BCS-IRSG European Colloquium on IR Research (ECIR-02)", 2002

Advanced Search
Included Web Search Engines


Safe Search

close

Top Matching Results

Occasionally Search.com will highlight specialized results that are based on the context of your query. Examples of specialized results include specific links to news, images, or video.

Top Matching Results may highlight information from other Search.com pages, content from the CNET Network of sites, or third party content. The listings are based purely on relevance. Search.com does not receive payment for listings in this section but our partners that provide this data may get paid for listing these products.

Sponsored Links

This section contains paid listings which have been purchased by companies that want to have their sites appear for specific search terms and related content. These listings are administered, sorted and maintained by a third party and are not endorsed by Search.com.

Search Results

Search.com sends your search query to several search engines at one time and integrates the results into one list which has been sorted by relevance using Search.com's proprietary algorithm. You can customize the list of search engines included in your metasearch from the preferences.

The search engines that are used in your metasearch may allow companies to pay to have their Web sites included within the results. To view the Paid Inclusion policy for a specific search engine, please visit their Web site. Search.com does not accept payment or share revenue with any search engine partner for listings in this section.