Introduction 
In this project you will be implementing, analyzing, optimizing and comparing the performance of quick sort, merge sort and selection sort. Quick sort and merge sort are both divide and conquer recursive algorithms with O(n lg n) average case running time. Merge sort is a THETA(n lg n) sort while quick sort is O(n2) in the worst case. Selection sort is a THETA(n2) algorithm. You will need to collect empirical data illustrating the best, worst and average cases for each of these algorithms. The goal of this project is to use this data to optimize the performance of quick sort and merge sort.
Both quick sort and merge sort are recursive and thus have a base case. For both these algorithms the natural base case occurs when n=1, i.e. the list to be sorted has only one element. With n=1 the list is already sorted and there is nothing more to be done! The recursive cases of the quick sort and merge sort algorithms are fairly complex and of course require recursive calls to implement. This complexity and recursion result in large constants hidden in the asymptotic notation. Several O(n2) sorting algorithms, like selection sort, are relatively simple and thus have small constants hidden in their asymptotic running times. Therefore, for sufficiently small values of n, the O(n2) selection sort will run faster than quick sort or merge sort. An empirical analysis of quick sort, merge sort and selection sort can be used to determine the problem sizes for which selection sort will be faster.
Suppose it is determined that selection sort is faster than merge sort for n <= n0. When merge sort is applied to a list of length n, the list is repeatedly divided in half and each half is recursively merge sorted. Eventually the size of the halves being merge sorted will be <= n0. At this point it would be faster to invoke a selection sort to sort each half than it would be to continue recursively invoking merge sort. Thus, merge sort can be optimized by changing the base case from n=1 to n<=n0 and using selection sort to implement the base case. A similar argument can be made for quick sort.
The goal of this project is to optimize the performance of quick sort and merge sort. This can be done by determining the values of n0 at which your implementations of merge sort and quick sort should invoke a selection sort as the base case. To reach this goal you will implement the quick, merge and selection sort algorithms. Once implemented you will collect best, worst and average case empirical data on the performance of these algorithms. You will use this data to determine the values of n0 at which selection sort should be invoked as the base case for merge and quick sort. Finally, you will compare the best, worst and average case performance of the optimized versions of quick sort and merge sort to see if the optimization was effective.
Writing the Sorts 
The program SortingProject.java gives you a head start on the project. This program accepts 5 command line arguments:
The provided program contains the basic framework for the project and a very inefficient implementation of the selection sort algorithm. You will need to optimize the selection sort algorithm and the implement the quick and merge sort algorithms. Chapter 19 in the text discusses quick and merge sort. Each of your sorts must be implemented as a static method that accepts a reference to an array of integers as a parameter. The method should then use the respective sorting algorithm to sort the integers in the array.
Obtaining the Data 
Chapter 4 in the text talks about many of the issues associated with empirical evaluation of algorithms. It also provides a simple but inaccurate means of timing programs. The Java virtual machine also provides profiling tools that can be used to measure the amount of time spent in each method of a program. Because each of your sorts will presumably be written as a method the Java VM profiling tools will provide you with exactly the data that you need.
As mentioned above the Java VM contains a profiling agent that can be used to measure the amount of time that your program spends in each method. Unfortunately this data is not in a particularly user friendly format. So we'll use another program to process this information into a format that we can use. To have the Java VM produce the profiling information you use the command line:
Looking at the data in the SelSort.10 file is not particularly informative. To view this data in a useful way we can use the PerfAnal program. This program reads output files from the hprof profiling agent and displays them graphically. Among many other things this program allows you to see the running time for each method in your program. To run the PerfAnal program you will first need to download it:
The PerfAnal program is run using the following command:
The key information for this project will be the time spent in each of the sorting methods. While it would be possible to run the SortingProject program, generate the profile data, run the PerfAnal program and read the times by hand for a wide range of problem sizes, it would be tedious at best. So, to facilitate the batch processing of a range of problem sizes I have modified the PerfAnal program so that it can automatically save the processed profile data to a file. For example, the command:
While saving the data from PerfAnal to a file is nice it still does not make it possible to process a whole range of problem sizes automatically. To do that we'll use a shell script. A shell script is a text file that contains a sequence of commands to be executed by a shell (e.g. bash). Conceptually executing a shell script is the equivalent of typing each of the commands at the command prompt. However, shell scripts can also contain variable, loops and conditionals that make them very useful. The following sites have information on writing shell scripts:
Fortunately for you, you will not have to write a script from scratch. I have written the script SelSort.bash that runs the selection sort in the SortingProject program for a range of problem sizes and collect the relevant data. To use SelSort.bash you will need to save it into your account and make sure that it is an executable file. To save SelSort.bash right click on the above link and choose "Save Link As...". To make the SelSort.bash file executable you will need to use the unix chmod command:
You should study the SelSort.bash script and understand how it works. You will need to modify it and extend it to handle your other sorts. As written, SelSort.bash runs the selection sort algorithm on problem sizes of 10, 20 and 30 numbers (not a very complete list!). For each problem size the list is sorted 1000 times for each trial and 3 trials (3 is insufficient for real data!) are performed. Thus, the time for each trial will represent the time taken to sort a list of 10, 20 or 30 numbers 1000 times. SelSort.bash produces two files:
You will need to decide what problem sizes for which to collect data. Chapter 4 of the text talks about how to select test data for evaluation of an algorithm. Remember that the goal here is to find the value of n0 for which selection sort should be used as the base case for quick sort and merge sort. So I suggest collecting rough data at first to narrow down the range of problem sizes in which n0 will be found. When you know the range of problem sizes in which n0 lies you should collect more complete data in those ranges. Your final report should contain graphs with all of your data.
In understanding how the SelSort.bash script works you might find the following resources useful:
Displaying the Data 
You will use the gnuplot program to produce graphs of your data. Gnuplot is a plotting program for unix. The following sources have information about gnuplot:
It is fairly simple to get a simple plot to appear using gnuplot. For example to plot the Sum Ticks: column of the SelSort.ave.dat file as a function of Prob. Size you could use the following commands:
Fortunately for you, you will not have to master gnuplot on your own. The file SelSort.gp contains a script for gnuplot that will plot the relevant data from SelSort.ave.dat in an appropriate way. This script plots the average number of ticks required to sort an array of the given size. To do this the script divides the Sum Ticks: column by the Num. Trial: column and by the Num Sorts: column and plots the result of this against the Prob. Size: column.
To use the SelSort.gp script use the command:
A plot similar to the following should appear on the screen:
Notice that the plot also places error bars on each data point that reflect the standard deviation of the times that that point represents. The standard deviation is a measure of the variability of times that went into the average. Basically, if the times in the average vary widely then the standard deviation will be large. Conversely, if the times are all closely bunched then the standard deviation will be smaller. Roughly, we can interpret the error bars to mean that 67% of the measured times for each point were within the error bars for that point.
The Write Up 
Your write up for this project should contain 5 sections:
Bonus Extensions 