Computer Science 332
Analysis of Algorithms

Dickinson College
Spring Semester 2001
Grant Braught

Project #1 - Empirical Optimization of Sorting

Introduction

In this project you will be implementing, analyzing, optimizing and comparing the performance of quick sort, merge sort and selection sort. Quick sort and merge sort are both divide and conquer recursive algorithms with O(n lg n) average case running time. Merge sort is a THETA(n lg n) sort while quick sort is O(n2) in the worst case. Selection sort is a THETA(n2) algorithm. You will need to collect empirical data illustrating the best, worst and average cases for each of these algorithms. The goal of this project is to use this data to optimize the performance of quick sort and merge sort.

Both quick sort and merge sort are recursive and thus have a base case. For both these algorithms the natural base case occurs when n=1, i.e. the list to be sorted has only one element. With n=1 the list is already sorted and there is nothing more to be done! The recursive cases of the quick sort and merge sort algorithms are fairly complex and of course require recursive calls to implement. This complexity and recursion result in large constants hidden in the asymptotic notation. Several O(n2) sorting algorithms, like selection sort, are relatively simple and thus have small constants hidden in their asymptotic running times. Therefore, for sufficiently small values of n, the O(n2) selection sort will run faster than quick sort or merge sort. An empirical analysis of quick sort, merge sort and selection sort can be used to determine the problem sizes for which selection sort will be faster.

Suppose it is determined that selection sort is faster than merge sort for n <= n0. When merge sort is applied to a list of length n, the list is repeatedly divided in half and each half is recursively merge sorted. Eventually the size of the halves being merge sorted will be <= n0. At this point it would be faster to invoke a selection sort to sort each half than it would be to continue recursively invoking merge sort. Thus, merge sort can be optimized by changing the base case from n=1 to n<=n0 and using selection sort to implement the base case. A similar argument can be made for quick sort.

The goal of this project is to optimize the performance of quick sort and merge sort. This can be done by determining the values of n0 at which your implementations of merge sort and quick sort should invoke a selection sort as the base case. To reach this goal you will implement the quick, merge and selection sort algorithms. Once implemented you will collect best, worst and average case empirical data on the performance of these algorithms. You will use this data to determine the values of n0 at which selection sort should be invoked as the base case for merge and quick sort. Finally, you will compare the best, worst and average case performance of the optimized versions of quick sort and merge sort to see if the optimization was effective.

Writing the Sorts

The program SortingProject.java gives you a head start on the project. This program accepts 5 command line arguments:

  1. SORT: An integer indicating the sorting algorithm to be used.
    1. Selection Sort
    2. Quick Sort
    3. Merge Sort
    4. Optimized Quick Sort
    5. Optimized Merge Sort
  2. SIZE: The number of integers to be sorted.
  3. TIMES: The number of times to perform the sort.
  4. LOW: The low end of the range of values to be sorted.
  5. HIGH: The high end of the range of values to be sorted. So the command line: uses the selection sort algorithm to sort 10 numbers in the range 1 to 1000. The list of numbers that is being sorted is generated randomly. The time required to sort a list of 10 numbers, even with a O(n2) algorithm, is smaller than the resolution of the Java VM profiling agent. Therefore, we can not measure the time it takes to sort this small list. Instead we will find the total time it takes to sort many lists. Then the total time can be divided by the number of lists that were sorted to find the average time to sort a list of the specified length. For example the command line: will sort 1000 lists of 10 numbers in the range of 1 to 1000. The profiling agent will report the total time for these 1000 sorts which we will eventually divide by 1000 to find the average time per sort.

    The provided program contains the basic framework for the project and a very inefficient implementation of the selection sort algorithm. You will need to optimize the selection sort algorithm and the implement the quick and merge sort algorithms. Chapter 19 in the text discusses quick and merge sort. Each of your sorts must be implemented as a static method that accepts a reference to an array of integers as a parameter. The method should then use the respective sorting algorithm to sort the integers in the array.

    Obtaining the Data

    Chapter 4 in the text talks about many of the issues associated with empirical evaluation of algorithms. It also provides a simple but inaccurate means of timing programs. The Java virtual machine also provides profiling tools that can be used to measure the amount of time spent in each method of a program. Because each of your sorts will presumably be written as a method the Java VM profiling tools will provide you with exactly the data that you need.

    As mentioned above the Java VM contains a profiling agent that can be used to measure the amount of time that your program spends in each method. Unfortunately this data is not in a particularly user friendly format. So we'll use another program to process this information into a format that we can use. To have the Java VM produce the profiling information you use the command line:

    This command uses the hprof profiling agent to collect data on the SortingProject program and store it in the file SelSort.10. The SortingProject program will use selection sort to sort a list of 10 numbers in the range 0 to 1000, 1000 times. The -classic flag causes the JVM to run in a mode that is compatible with the hprof profiling agent.

    Looking at the data in the SelSort.10 file is not particularly informative. To view this data in a useful way we can use the PerfAnal program. This program reads output files from the hprof profiling agent and displays them graphically. Among many other things this program allows you to see the running time for each method in your program. To run the PerfAnal program you will first need to download it:

    The PerfAnal program is run using the following command:

    where SelSort.10 is the profiling data that you want to view. The following web site contains complete documentation for PerfAnal:

    The key information for this project will be the time spent in each of the sorting methods. While it would be possible to run the SortingProject program, generate the profile data, run the PerfAnal program and read the times by hand for a wide range of problem sizes, it would be tedious at best. So, to facilitate the batch processing of a range of problem sizes I have modified the PerfAnal program so that it can automatically save the processed profile data to a file. For example, the command:

    will cause PerfAnal to process the profile data in the SelSort.10 file and save it in a file named SelSort.10.perf.dat. NOTE: This command will work on the PerfAnal.jar file that you downloaded from this page but not with the one from Sun because I had to hack it to make it work this way! So be sure you download PerfAnal.jar from the link on this page.

    While saving the data from PerfAnal to a file is nice it still does not make it possible to process a whole range of problem sizes automatically. To do that we'll use a shell script. A shell script is a text file that contains a sequence of commands to be executed by a shell (e.g. bash). Conceptually executing a shell script is the equivalent of typing each of the commands at the command prompt. However, shell scripts can also contain variable, loops and conditionals that make them very useful. The following sites have information on writing shell scripts:

    Fortunately for you, you will not have to write a script from scratch. I have written the script SelSort.bash that runs the selection sort in the SortingProject program for a range of problem sizes and collect the relevant data. To use SelSort.bash you will need to save it into your account and make sure that it is an executable file. To save SelSort.bash right click on the above link and choose "Save Link As...". To make the SelSort.bash file executable you will need to use the unix chmod command:

    If you haven't used chmod before you can learn about it using the man chmod command.

    You should study the SelSort.bash script and understand how it works. You will need to modify it and extend it to handle your other sorts. As written, SelSort.bash runs the selection sort algorithm on problem sizes of 10, 20 and 30 numbers (not a very complete list!). For each problem size the list is sorted 1000 times for each trial and 3 trials (3 is insufficient for real data!) are performed. Thus, the time for each trial will represent the time taken to sort a list of 10, 20 or 30 numbers 1000 times. SelSort.bash produces two files:

    You will need to decide what problem sizes for which to collect data. Chapter 4 of the text talks about how to select test data for evaluation of an algorithm. Remember that the goal here is to find the value of n0 for which selection sort should be used as the base case for quick sort and merge sort. So I suggest collecting rough data at first to narrow down the range of problem sizes in which n0 will be found. When you know the range of problem sizes in which n0 lies you should collect more complete data in those ranges. Your final report should contain graphs with all of your data.

    In understanding how the SelSort.bash script works you might find the following resources useful:

    Displaying the Data

    You will use the gnuplot program to produce graphs of your data. Gnuplot is a plotting program for unix. The following sources have information about gnuplot:

    It is fairly simple to get a simple plot to appear using gnuplot. For example to plot the Sum Ticks: column of the SelSort.ave.dat file as a function of Prob. Size you could use the following commands:

    Fortunately for you, you will not have to master gnuplot on your own. The file SelSort.gp contains a script for gnuplot that will plot the relevant data from SelSort.ave.dat in an appropriate way. This script plots the average number of ticks required to sort an array of the given size. To do this the script divides the Sum Ticks: column by the Num. Trial: column and by the Num Sorts: column and plots the result of this against the Prob. Size: column.

    To use the SelSort.gp script use the command:

    A plot similar to the following should appear on the screen:

    Notice that the plot also places error bars on each data point that reflect the standard deviation of the times that that point represents. The standard deviation is a measure of the variability of times that went into the average. Basically, if the times in the average vary widely then the standard deviation will be large. Conversely, if the times are all closely bunched then the standard deviation will be smaller. Roughly, we can interpret the error bars to mean that 67% of the measured times for each point were within the error bars for that point.

    The Write Up

    Your write up for this project should contain 5 sections:

    Introduction:
    Describe the project, what the goals are, why it is worth investigating. The job of this section of the paper is to motivate the work that follows. Everything in the following sections is directed at reaching the goals of the project that are stated here.

    Background:
    This section should contain a description of each of the sorting algorithms (pseudo-code) and an asymptotic analysis of each. A simple statement of their complexity is insufficient you should have at least a paragraph and possibly some equations to explain the asymptotic bounds. You should give a best, worst and average case analysis for each algorithm (if possible). You should state which data sets produce the best and worst cases if possible. This section lays the theoretical ground work that is used to direct the experiments described in the next section.

    Experiments:
    This section describes the experiments that you performed, relates them to the theoretical foundations of the previous section and states how the results will help to achieve the goals of the project. This section will contain the meat of this project. It will describe all of the experiments you performed and how you collected the data to find n0 for quick and merge sort. It will also describe the experiments that you performed to compare the optimized versions of quick and merge sort. The results of these experiments are not presented in this section!

    Results & Discussion:
    This section presents and discusses the results of the experiments described in the previous section.

    Conclusions:
    This section looks at the results of your work in light of the goals of the project. You must conclude based on the results of your experiments if the project met its goals or if it did not. You should use the results of the experiments to make the argument for your conclusion. If the results dictate it, it is perfectly valid to make the argument that the experiments show that is is not possible to achive the goals! Finally, you should highlight any interesting directions for further research.

    Bonus Extensions

    1. Consider the other O(n2) sorts (Rank, Bubble, Insertion) and select the one that provides the best optimization to the quick and merge sorts.
    2. Analyze the effect of using the optimize option in the java compiler.
    3. Compare your implementations to to the java.util.Arrays.sort method (quick sort for primitive data types).
    4. Determine if java.util.Arrays.sort uses selection sort for small values of n or if it has a base case of n=1.