Wednesday, May 23, 2012

Learning Information Retrieval 1

Hi I'm here to talk to you about Information Retrieval (IR). I know IR sounds like some irrelevant field, probably librarians used it to retrieve a book from the library or maybe IR is getting information from a database. Well that is not true, IR is not an obsolete field but one that is alive and growing. As a matter of fact all companies use IR, when you use a search engine that's information retrieval or when you use Siri on the iphone. IR is NOT about SELECT statements, that's data retrieval. IR is about sending queries like "Who is Batman" to a IR system where it returns an answer.

Let's get started.

Concept of Information Retrieval

IR is about using a query to search a corpus for relevant documents.

What do I mean by that?

When a user googles "who is George Washington?" (query) . It goes to Google's systems and Google searches the web (corpus) then returns a list websites (relevant documents). Now you know who George is. Technically this is not how Google works, because Google can't search the whole internet in 0.032 seconds to find the best documents, but we'll ignore the details until later.

Information retrieval is broken down in three parts:

1. The query: a list of terms that the user wants to search for.
2. The indexing: the system keeps a index basically  a database so it can match the query.
3. The retrieval of documents: Self explanatory.

In the next episodes I will talk more about these concepts. I will later on discuss the various models that information retrieval have such as the boolean model, vector space model, etc. I will even talk about PageRanking, that's the model that Google uses.

And from the above query "who is George Washington?", an astute reader will notice that exact matching will not work, so it implies the query needs to be translated into another form. We will talk about Natural Language Processing and Machine Learning, in the distant distant future.

For now understand that the concept of IR is not to search a structured data collection like a database but it is to search any unstructured collection of data like the world wide web, where it is full of webpages, images, videos, etc.

In the next part I will be introducing you to query reformulation. The is a fundamental technique that all IR systems do. As the name suggests, it translates the user's query into another query for better recall.

No comments:

Post a Comment