From the introduction we know that the query is the text, that the user gives to the system for searching.
But in information retrieval the text we entered is not necessarily what the system reads. For example, we might enter the query "What is love but a word among words. It is just an imaginary concept created by the feeble." but the system may remove stop-words, stem it, tokenize it into something like "love word amon imag concept creat fee" Now the system will search for these words in the index instead of the original words.
Why would the system break up the query? One might ask.
Well It turn out it to easier to find and return the relevant documents if it removes some of the noise from the query.
Some words like is, it, be, and, was, and so do not give any importance to the query. A query X has the text "Who is George Washington" has same meaning as the query Y "George Washington". Both queries X and Y are telling the system to search for "George Washington" but query X has useless article words only important to humans.
In some cases these words can even hurt the retrieval of important documents. Say a simple document D has the text "George Washington was the first president of The United States of America.", all terms in query Y are found in D, while only some terms in X matched with D. We will see in the Boolean Model, that D will be considered not relevant if query X is used but would be relevant if Y is used instead.
These unnecessary words are called stop-words and most system remove these words from both the query and the system's index.
Here are some stop-words that google removes from there searches. StopWords
Ok, the words from the link may not be the exact stopwords Google removes, but close enough.
There is also stemming. This is the removal of affixes from words, to produce a stem word. So affixes like suffix and prefix are removes from the words. If a word is plural then the 's' maybe removed from the word.
For example:
cats => cat
connections => connect
connection => connect
mice => mice
houses => hous
precaution => caut
Notice that some of the words stemmed are not actual words (hous, caut). This depends on the stemmer used. Words can be stemmed from lookup table, n-grams and algorithms.
If an algorithm is used then the system can produce words like hous and caut. The system may just follow a simple set of rules ... If there is a 'es' at the end of the word then remove it, if there is a 'ly' at the end of the word then remove it...
Non-lexical stem words like hous and whit, do not hurt the search because the index will also be stemmed as well. So if someone were to search for "white house" and it is stemmed to "whit hous", well all reference to the words "white house" are also stemmed to "whit hous" in the index. And the "whit hous" in the query matchs the "whit hous" in the index. Happy.
One famous algorithmic stemmer is the Porter Stemmer. You can try an online version here.
lookup table on the other hand are like dictionaries, the word is looked up but a stem word is returned instead of the meaning. If the stem word cannot be found in the look up table then the original word is used.
N-grams are list of words or list of letters where n is the number of elements. I won't go into n-grams, it is far too broad of a topic. But basically, if a query term "crayone" that is misspelt and there is a n-gram "crayons" then the system may use probability to figure out to replace the original word with the n-gram.
There are other ways change the query such as lemmatization.
So far all the change I have talked about are part of query reformulation.
Next we'll talk about query expansion, which is adding more words to the query to expand its search.
Wednesday, May 23, 2012
Learning Information Retrieval 1
Hi I'm here to talk to you about Information Retrieval (IR). I know IR sounds like some irrelevant field, probably librarians used it to retrieve a book from the library or maybe IR is getting information from a database. Well that is not true, IR is not an obsolete field but one that is alive and growing. As a matter of fact all companies use IR, when you use a search engine that's information retrieval or when you use Siri on the iphone. IR is NOT about SELECT statements, that's data retrieval. IR is about sending queries like "Who is Batman" to a IR system where it returns an answer.
Let's get started.
Concept of Information Retrieval
IR is about using a query to search a corpus for relevant documents.
What do I mean by that?
When a user googles "who is George Washington?" (query) . It goes to Google's systems and Google searches the web (corpus) then returns a list websites (relevant documents). Now you know who George is. Technically this is not how Google works, because Google can't search the whole internet in 0.032 seconds to find the best documents, but we'll ignore the details until later.
Information retrieval is broken down in three parts:
1. The query: a list of terms that the user wants to search for.
2. The indexing: the system keeps a index basically a database so it can match the query.
3. The retrieval of documents: Self explanatory.
In the next episodes I will talk more about these concepts. I will later on discuss the various models that information retrieval have such as the boolean model, vector space model, etc. I will even talk about PageRanking, that's the model that Google uses.
And from the above query "who is George Washington?", an astute reader will notice that exact matching will not work, so it implies the query needs to be translated into another form. We will talk about Natural Language Processing and Machine Learning, in the distant distant future.
For now understand that the concept of IR is not to search a structured data collection like a database but it is to search any unstructured collection of data like the world wide web, where it is full of webpages, images, videos, etc.
In the next part I will be introducing you to query reformulation. The is a fundamental technique that all IR systems do. As the name suggests, it translates the user's query into another query for better recall.
Let's get started.
Concept of Information Retrieval
IR is about using a query to search a corpus for relevant documents.
What do I mean by that?
When a user googles "who is George Washington?" (query) . It goes to Google's systems and Google searches the web (corpus) then returns a list websites (relevant documents). Now you know who George is. Technically this is not how Google works, because Google can't search the whole internet in 0.032 seconds to find the best documents, but we'll ignore the details until later.
Information retrieval is broken down in three parts:
1. The query: a list of terms that the user wants to search for.
2. The indexing: the system keeps a index basically a database so it can match the query.
3. The retrieval of documents: Self explanatory.
In the next episodes I will talk more about these concepts. I will later on discuss the various models that information retrieval have such as the boolean model, vector space model, etc. I will even talk about PageRanking, that's the model that Google uses.
And from the above query "who is George Washington?", an astute reader will notice that exact matching will not work, so it implies the query needs to be translated into another form. We will talk about Natural Language Processing and Machine Learning, in the distant distant future.
For now understand that the concept of IR is not to search a structured data collection like a database but it is to search any unstructured collection of data like the world wide web, where it is full of webpages, images, videos, etc.
In the next part I will be introducing you to query reformulation. The is a fundamental technique that all IR systems do. As the name suggests, it translates the user's query into another query for better recall.
Subscribe to:
Posts (Atom)