Saturday, November 14, 2009

Unit 10 Reading Notes- Web Search and OAI Protocol

Unit 10 Reading Notes- Web Search and OAI Protocol
1) David Hawking , Web Search Engines: Part 1 and Part 2
Part 1
-Hawking focuses on the data processing "miracle" of search engines that sort through hundreds of millions of queries every day, by examining the problems that whole-of-web search engines face, and techniques available to solve these problems
-infrastructure: large search engines operate multiple, geographically distributed data center; services built up from clusters of commodity PCs, the types of which are dependent on various factors; total number of servers for the largest engines is estimated to be in the hundreds of thousands; clusters or individual servers can be dedicated to specialized functions (e.g. crawling, indexing, etc.); largescale replication is required to handle the necessary throughput
-crawling algorithms: crawler initializes queue with one or more "seed" URLs; a good seed URL will link to many high quality websites; crawling proceeds by making an HTTP request to fetch the page at the first URL in the queue, then scans content for links to other URLs and adds each one to the queue
-crawling algorithm must address the following issues:
1) speed
2) politeness
3) excluded content
4) duplicate content
5) continuous crawling
6) spam rejection
-crawlers are highly complex parallel systems communicating with millions of Web servers, and as such there are many issues involved with engineering a Web-scale crawler

Part 2
-reviews algorithms and data structures necessary to index 400 terabytes of text on the Web and deliver high-quality results
-indexing algorithms: search engines use inverted file to rapidly identify indexing terms, using two phases (scanning and inversion)
-real indexers: store additional information in the postings, such as term frequency or position; aspects of real indexers include
-scaling up
-term look up
-compression
-phrases
-anchor text
-link popularity score
-query-independent score

Part 2 also includes an outline of the techniques real search engines use to 'speed things up' given the vast amount of information they have to sort through to produce quality results quickly

This 2 part series of articles is extremely helpful in explaining the basics of how search engines function. I thought they were easy to read and actually pretty interesting.


2) Current developments and future trends for the OAI protocol for metadata harvesting
OAI- Open Archives Initiative
I didn't fully understand what the OAI was at first, but as I read more of the article, it began to become clearer. The article did provide a brief explanation of the OIA, including its mission to "provide a worldwide virtual library of language resources" through developing of community based standards for archiving. The examples provided of the Sheet Music Consortium and National Science Digital Library were very interesting- I have never heard of either of these previous to reading the article. The article gave an overview of the standards and objectives for searching OAI repositories, and the future work necessary to further improve this.

3) “The Deep Web: Surfacing Hidden Value”
"Searching on the Internet today can be compared to dragging a net across the surface of the ocean. While a great deal may be caught in the net, there is still a wealth of information that is deep, and therefore, missed. "
I thought this analogy to describe searching on the Internet was a really great one. It really highlights the challenges of searching on the Web, as well as the incredible content that's buried out there and can be accessed with the development of the right technologies. The list of findings that BrightWeb published on their study of the Deep Web is somewhat astonishing; it is hard to believe that that amazing quantity of information is available on the web and is currently largely unaccessible to the average searcher. The thorough explanations of how search engines function, and what sort of technology is necessary to access the Deep Web, was truly interesting and helped increase my knowledge a good deal. Also, I shared the opinion of some of my other classmates that the illustrations and graphs in this article really helped make the point sink in for me.

No comments:

Post a Comment