Advanced Topics for Information Retrieval: January 2010

Sunday, January 31, 2010

The articles for weeks #4 talks about

1- A good searcher has three categories:
- General principles of searching ,Academic skills, system dependent skills.
2- Profile of a good searcher
- Good communication skills and people arientation
- Self confidence
- Patience and perseverance
- Logical and flexible approach to problem solving
- Memory for details
- Good organization and efficient work habits
- Spelling, grammar and typing skills
- Motivation for additional training
- Subject area knowledge
- Willingness to share knowledge with others
3- Interaction in Information Retrieval: Selection and
Effectiveness of Search Terms
4- Methodology for Data Collection and the Data Corpus

Sunday, January 24, 2010

Muddiest point and discussion question #2:

I do not know, is creating Blog required for this course or no ?
Just want to ask to make sure ?

Unite #3

Notes about Required Readings: (I take the notes directly from the chapter)

IR sections 1.4
Talks about The extended Boolean model versus ranked retrieval,
The Boolean retrievalmodel contrasts with ranked retrRANKED RETRIEVAL models such as the
MODEL vector space model in which users largely use free text queries, that is, just typing one or more words rather than using a precise language with operators for building up query expressions, and the system decides which documents best satisfy the query

chapters 6,
talks about Scoring, term weighting & the vector space model,
This chapter consists of three main ideas.
1. it introduce parametric and zone indexes in Section 6.1, which serve Two purposes. First, they allow us to index and retrieve documents by metadata such as the language in which a document is written. Second, they give us a simple means for scoring (and thereby ranking) documents in response to a query.
2. Next, in Section 6.2 we develop the idea of weighting the importance of a term in a document, based on the statistics of occurrence of the term.
3. In Section 6.3 we show that by viewing each document as a vector of such weights, we can compute a score between a query and each document. This view is known as vector space scoring.

Chapter 11 and 12
Talks about Probabilistic information retrieval and
There is more than one possible retrieval model which has a probabilistic basis. Chapter 11 will introduce probability theory and the Probability Ranking
Principle (Sections 11.1–11.2), and then concentrate on the Binary Inde- pendence Model (Section 11.3), which is the original and still most influential
probabilistic retrieval model. Finally, it will introduce related but extended methods which use term counts, including the empirically successful Okapi BM25weighting scheme, and BayesianNetworkmodels for IR (Section 11.4).
In Chapter 12, it then present the alternative probabilistic language modelOnline

Saturday, January 16, 2010

Notes for unite #2

Required Readings:

Notes for chapters (1,2,3)
IIR sections 1.2, 1.3, chapters 2 and 3

(1.2) explians in steps how we can build an inverted index and section, (1.3) explians how can we process a query using an inverted index. Also these two sections provide the readers with figures to let us imagine how each step can work.
the major steps in inverted index construction:
1. Collect the documents to be indexed.
2. Tokenize the text.
3. Do linguistic preprocessing of tokens.
4. Index the documents that each term occurs in.

Chapter (2)(The term vocabulary and postings lists)
Chapter 2 talks about how the basic unit of a document can be defined and how the character sequence that it comprises is determine. Then it defines Tokenization which is the Process of chopping character streams into tokens, while linguistic preprocessing then deals with building equivalence classes of tokens which are the set of terms that are indexed.

Chapter(3) Dictionaries and tolerant retrieval
- Develop data structures that help the search for terms in the vocabulary in an inverted index
- we study the idea of a wildcard query: a query such as *a*e*i*WILDCARD QUERY o*u*, which seeks documents containing any term that includes all the five vowels in sequence.
- Users make spelling errors either by accident, or because the termthey are searching for (e.g., Herman) has no unambiguous spelling in the collection. We detail a number of techniques for correcting spelling errors in queries, one term at a time as well as for an entire string of query terms.
- study a method for seeking vocabulary terms that are phonetically close to the query term(s).
------------

Notes for FOA Sections 1.2-1.5 in Chapter 1. PDF version of Ch. 1. on author's site http://www.cs.ucsd.edu/~rik/foa/.

Chpter(1) shows you how that there are many of the tools that are useful for searching collections and other media.
People use many tools of FOA:
- Language
- Writing
- But then people started to write a lot and we started to have a lot of books which makes people start to have a difficult time to find what they want. Therefore, people go to library to find what they need
- Now today people stated to use WWW to find what they need
- What happened between two persons, when one ask question and the other one answer him. The same thing that happened between person and computer engine.
Also this chapter defines indexing and mentions that there are two kind of indexing which are
1) manual indexing
2) automatic indexing.

Muddiest point and discussion question #1:

As you said, there is a connection between Information retrieval (IR) and database(IR=DATABSE).
So don’t you think we as a librarian, we should have a specific class focus on just databases and how we can create database and write queries?

Advanced Topics for Information Retrieval

Sunday, January 31, 2010

Muddiest points

Unit #4

Sunday, January 24, 2010

Muddiest point and discussion question #2:

Unite #3

Saturday, January 16, 2010

Notes for unite #2

Muddiest point and discussion question #1:

Followers

Blog Archive

About Me