Tfidf

TF–idf is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. It is a refinement over the simple bag-of-words model, by allowing the weight of words to depend on the rest of the corpus. It was often used as a weighting factor in searches of information retrieval, text mining, and user modeling. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries used TF–idF. The model was developed by Karen Spärck Jones (1972) and used by search engines as a central tool in scoring and ranking a document's relevance given a user query. For example, the df (document frequency) and idf for some words in Shakespeare's 37 plays are as follows: "Romeo", "Falstaff", and "salad" appears in very few plays, so seeing these words, one could get a good idea as to which play it might be. In contrast, "good" and "sweet" appear in every play and are completely uninformative as to what play it is.