Search Engine
Introduction
In this lab you will be creating a basic web search engine. You will be given a web crawler program that collects information such as the title and keywords form the pages on a given web site and produces an index. An index in our case is simply a text file storing information about each page that is visited. Your search engine will read the information that the web crawler saved in the index and reorganize it to make it easier to search. Your program will then allow the user to perform keyword searches to find pages. The pages that are found will then be displayed for the user in an order determined by a highly simplified version of Google's PageRank algorithm.
Getting Started
Update the 132Labs project from the SVN repository. Instructions for updating the labs project can be found under Updating the Labs Project from the SVN Repository in the How-To Document for the course.
In addition to updating your 132Labs project you will also need to add some library code to the build path of your project. To add the necessary libraries use the following steps:
libs package within the lab08.searching package.
htmlparser.jar file, select "Build Path" and choose "Add to Build Path".
htmllexer.jar file, select "Build Path" and choose "Add to Build Path".
norbert-0.3.2.jar file, select "Build Path" and choose "Add to Build Path".
Background
Web Page Basics
Every page on the web is identified by a uniform resource locator (URL). For example my home page has the URL:
http://users.dickinson.edu/~braught/index.htmlThe http:// part identifies the protocol that will be used to transmit the page (http = hypertext transport protocol). The users.dickinson.edu identifies the web server on which the page is located and the ~braught/index.html identifies the directory and file that contain the web page.
The content of a web page is written in a language called Hyptertext Markup Language (HTML). The structure of a basic web page is shown below:
The main text of the page goes in the body. There can be links to other pages in the body as well.
The key feature of HTML pages is that their content is labeled using tags. In HTML tags begin with < and end with > and come in pairs. For example the tag <TITLE> indicates the start of the title for the page and the tag </TITLE> indicates the end of the title. The tag that begins <a href> and ends </A> indicates a clickable link to another page. The string following href= is the URL of the page to which the link goes. The text between the <A HREF="..."> and </A> tags is the clickable link text which appears in the browser.
Web Crawlers
A web crawler is a program that downloads a page from the web, examines its contents, follows the links that it contains and constructs an index of the search terms that are relevant to each page. The web crawler that you have been given for this lab uses some of the tags described above to identify words that may be useful in finding the pages relevant to the search terms entered by a user. Specifically, our web crawler identifies each word of in the <TITLE> and each keyword listed in a <META> tag as search terms for the page. For example, if the above page were indexed the search terms would be "The", "title", "goes", "here", "list", and "keywords". In addition, if the web crawler finds any links to the page, it also identifies each word in the link text as a search term for the page to which the link points. In the above example, the words "links", "to", "other" and "pages" would be identified as search terms for the page with the URL http://www.mysite.org/stuff.html.
The web crawler that you are given collects the search terms for each page that it visits and writes the information for each page to a text file. The text file contains 4 lines for each URL visited by the web crawler. An example of these lines for two of the pages on Dickinson's web site is shown here:
The first line gives the title of the page. The second line is the URL of the page. The third line contains a comma delimited list of all of the search terms that the web crawler has associated with the page. The fourth line is the number incoming links to the page (i.e. links from other pages to this page).
To run the provided web crawler:
webcrawler package in the lab08.searching package in your 132Labs project
WebCrawler.java file
When the web crawler has finished it will write the index file to your 132Labs project folder. The name of the file will be the URL you entered followed by ".index". For example, if you enter www.dickinson.edu the index file will be www.dickinson.edu.index. You should be sure to run the web crawler once and open the resulting index file with a text editor.
Sorting Search Results: PageRank
When the results of a search are returned to the user, ideally they will appear in order from the most likely to be relevant to the least likely to be relevant. Techniques for sorting search results to achieve this goal are an area significant research by companies such as Google, Microsoft, Yahoo, IBM and many others. Much of Google's early success came because its search engine came much closer to this ideal ordering search results than the others. One of the key components of the algorithm that Google uses to order its search results is called PageRank:
"PageRank relies on the uniquely democratic nature of the web by using its vast link structure as an indicator of an individual page's value. In essence, Google interprets a link from page A to page B as a vote, by page A, for page B." (Wikipedia).
"PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value." (Google)
We will be using a highly simplified version of the actual PageRank algorithm to sort the results of our searches. As mentioned above, the fourth line of our index entry for each page contains the incoming link count for the page. In the terminology of PageRank, the incoming link count is the number of votes that the page received. Your search results will be presented in order of decreasing incoming link count (i.e. the pages with the most votes first.)
Design
The lab08.searching package in your 132Labs project contains partial stubs for three classes that you will use to create your search engine.
PageInfo
The PageInfo class is simply a container for the data about a given web page. It contains the title, URL and incoming link count as well as accessors for those values.
KeyWord
There will ultimately be one KeyWord object for every search term that is associated with the web pages that have been indexed. For the sample index file given earlier, there would be KeyWord objects for "dickinson", "college", "liberal", "arts", "private", "selective" etc. Each KeyWord object will also have a list containing a PageInfo objects for every web page that listed the keyword as one if its search terms. So, for the sample index file given earlier the PageInfo object for the "www.dickinson.edu" page will be in the list of PageInfo objects for the keywords "dickinson", "college", "liberal", "arts", "private", "selective" etc. The PageInfo object for the "sustainability" page will be in the list for all of those keywords and also in the lists for the keywords "sustainability", "LEARN", "MORE" etc.
KeyWordList
The KeyWordList class will hold a list of all of the KeyWord objects and will provide the methods that do the actual searching. The constructor for this class will read in an index file produced by the web crawler and build the list of KeyWord objects.
The Assignment
This assignment is divided into three parts. The first part involves the implementation and testing of the parts of the above classes that are used to construct the KeyWord list. The second part of the assignment involves adding and implementing methods involved in sorting and searching the KeyWordList and sorting the search results. The third part is to write a program that performs a search.
Part 1: Building the KeyWordList
For part 1 of the lab you should implement and test the methods defined in the following classes:
PageInfo
KeyWord
KeyWordList
NOTE: For the PageInfo and KeyWord you should implement and test the entire class. For the KeyWordList you should only implement and test the constructors, the addKeyWord and getKeyWords methods. Many of these methods in these classes will be quite easy for you at this point. The most challenging part will be implementing the KeyWordList constructor that reads the index file produced by the web crawler. Do not implement the sort, search, searchAll or searchAny methods in the KeyWordList class at this point. You will be completing those methods in part 2 of the lab.
Part 2: Sorting and Searching
In this part of the lab you will be adding methods to the PageInfo and KeyWord classes that allow them to be sorted and searched for using the sorting algorithms built into the Java Collections class. You will also be implementing the the sort, search, searchAll or searchAny methods in the KeyWordList class. You should complete the following steps:
PageInfo class implement the Comparable<PageInfo> interface. Pages should be ordered such that pages with higher link counts come before those with smaller link counts. In the case of ties, the tied pages should be ordered alphabetically by title. The alphabetization should be case insensitive. Be sure to add jUnit tests for this functionality to your PageInfoTest class.
KeyWord class implement the Comparable<KeyWord> interface. KeyWord objects should be ordered alphabetically. The alphabetization should be case insensitive. Be sure to add jUnit tests for this functionality to your KeyWordTest class.
sort, search, searchAll and searchAny methods in the KeyWordList class. You do not need to write your own sort and search implementations. Instead, make use of the sort and binarySearch methods in Java's Collections class.
Part 3: A Search Program
Create a program to search an indexed web site (NOTE: Bonus #1 may be done in lieu of Part #3). Your program should be in a new class named SearchEngine. The SearchEngine class should have a main method that allows the user to perform searches of a web site that has been indexed using the web crawler. A run of the Search Engine might look something like the following (bold text has been entered by the user):
What index file do you want to use: www.dickinson.edu.index Enter search term(s) [Q to quit]: college sustainability Require all terms (y/n): y Results: sustainability http://www.dickinson.edu/about/sustainability/ Enter search term(s): college sustainability Require all terms (y/n): n Results: www.dickinson.edu http://www.dickinson.edu/ sustainability http://www.dickinson.edu/about/sustainability/ Enter search term(s) [Q to quit]: Q Bye bye.
Note that the first search would be done with the searchAll method and the second would be done with the searchAny method. Your main method must provide some means for users to invoke these two different types of search.
Submitting your solution
Turn in your solution by committing the 132Labs project to the SVN server. Information about how to turn in your completed project to the SVN server can be found under Turning in the Labs Project to the SVN Repository in the How-To Document for the course.
Bonus Features
Each of the following features may be implemented for extra credit. These features may be implemented in any order. All implemented features must be documented using JavaDoc comments. Note that some of these features will require you to learn additional Java on your own.
SearchEngineGUI. The SearchEngineGUI class must have a main method that launches the program displays the GUI. The GUI must allow the user to choose an index file, enter search terms, select the type of search to perform (i.e. searchAll or searchAny) and display the search results. (+2)
searchAll or searchAny), allow the user to use "AND" and "OR" between search terms. For example, the search Cat AND Cute OR Dog would return all pages that have both cat and cute as search terms as well as all pages that have dog as a search term. Note that AND has higher precedence than OR. (+3)
Collections class research and implement the quick sort algorithm to sort the KeyWordList. You should write the code for quick sort in the sort method of the KeyWordList (though you may also use helper methods). (+2)