Jenn Chaney- LIS 2600: November 2009

Unit 10 Reading Notes- Web Search and OAI Protocol

1) David Hawking , Web Search Engines: Part 1 and Part 2

Part 1

-Hawking focuses on the data processing "miracle" of search engines that sort through hundreds of millions of queries every day, by examining the problems that whole-of-web search engines face, and techniques available to solve these problems

-infrastructure: large search engines operate multiple, geographically distributed data center; services built up from clusters of commodity PCs, the types of which are dependent on various factors; total number of servers for the largest engines is estimated to be in the hundreds of thousands; clusters or individual servers can be dedicated to specialized functions (e.g. crawling, indexing, etc.); largescale replication is required to handle the necessary throughput

-crawling algorithms: crawler initializes queue with one or more "seed" URLs; a good seed URL will link to many high quality websites; crawling proceeds by making an HTTP request to fetch the page at the first URL in the queue, then scans content for links to other URLs and adds each one to the queue

-crawling algorithm must address the following issues:

1) speed

2) politeness

3) excluded content

4) duplicate content

5) continuous crawling

6) spam rejection

-crawlers are highly complex parallel systems communicating with millions of Web servers, and as such there are many issues involved with engineering a Web-scale crawler

Part 2

-reviews algorithms and data structures necessary to index 400 terabytes of text on the Web and deliver high-quality results

-indexing algorithms: search engines use inverted file to rapidly identify indexing terms, using two phases (scanning and inversion)

-real indexers: store additional information in the postings, such as term frequency or position; aspects of real indexers include

-scaling up

-term look up

-compression

-phrases

-anchor text

-link popularity score

-query-independent score

Part 2 also includes an outline of the techniques real search engines use to 'speed things up' given the vast amount of information they have to sort through to produce quality results quickly

This 2 part series of articles is extremely helpful in explaining the basics of how search engines function. I thought they were easy to read and actually pretty interesting.

2) Current developments and future trends for the OAI protocol for metadata harvesting

OAI- Open Archives Initiative

I didn't fully understand what the OAI was at first, but as I read more of the article, it began to become clearer. The article did provide a brief explanation of the OIA, including its mission to "provide a worldwide virtual library of language resources" through developing of community based standards for archiving. The examples provided of the Sheet Music Consortium and National Science Digital Library were very interesting- I have never heard of either of these previous to reading the article. The article gave an overview of the standards and objectives for searching OAI repositories, and the future work necessary to further improve this.

3) “The Deep Web: Surfacing Hidden Value”

"Searching on the Internet today can be compared to dragging a net across the surface of the ocean. While a great deal may be caught in the net, there is still a wealth of information that is deep, and therefore, missed. "

I thought this analogy to describe searching on the Internet was a really great one. It really highlights the challenges of searching on the Web, as well as the incredible content that's buried out there and can be accessed with the development of the right technologies. The list of findings that BrightWeb published on their study of the Deep Web is somewhat astonishing; it is hard to believe that that amazing quantity of information is available on the web and is currently largely unaccessible to the average searcher. The thorough explanations of how search engines function, and what sort of technology is necessary to access the Deep Web, was truly interesting and helped increase my knowledge a good deal. Also, I shared the opinion of some of my other classmates that the illustrations and graphs in this article really helped make the point sink in for me.

1) Introduction to XML

2) A Survey of XML Standards

3) Extending Your Markup: an XML Tutorial

4) W3Schools XML Schema Tutorial

To be honest, I was somewhat confused about what exactly XML is and what it's used for even after reading all of the articles and tutorials. From what I understood, XML is somewhat similar to HTML except that you can use it for anything, as opposed to just using it for documents. Or, it's an easier way to describe hierarchical data, and in a form that is machine/human readable. I got very confused by all the acronyms the articles threw out- what I got of DTD is that it's an important element of XML because it tells you what format to expect and it can validate that it's in the proper form. SGML, I did not understand exactly what it was except that it was a precursor to XML and HTMl, etc. I thought all the articles and tutorials were thorough, but I don't feel like I had enough background knowledge to fully understand them. Maybe I'm wrong, but I felt like XML is supposed to be simple but seems to often be misused (as the articles kept stressing the differences between XML and "PROPER" XML), possibly because it is so strict. I'm looking forward to next week's lecture to learn more about what XML is and how it can be used.

Jenn Chaney- LIS 2600

Sunday, November 22, 2009

Assignment 6- Website

Saturday, November 14, 2009

Unit 10 Comments

Unit 9 Muddiest Point

Unit 10 Reading Notes- Web Search and OAI Protocol

Thursday, November 5, 2009

Blog Comments- Week 9 XML

Muddiest Point

Week 9 Readings- XML

Blog Archive

Followers

About Me