Searching and Sorting

Document history:
Written by AJH/CN in 1996.
Minor editing, additions and conversion to HTML format RIK Oct 2000

1. Introduction

Two very common operations carried out on data stored in a computer are sorting and searching. We shall initially assume the data to be stored in a one dimensional array (or vector) although we shall later examine alternative data structures.

These two processes are very much interrelated, since most algorithms for searching for a particular item in an array require that the data is in alphabetic/numeric order.

Internal and External Sorting

Internal sorting refers to the sorting of data stored in a computer's main memory in the form of an array. In this case, each of the elements of the array is accessible with equal speed.

External sorting refers to the sorting of data stored in auxiliary memory, such as on magnetic tape or disc; this is normally referred to as a file. In such cases, the data may be only accessible sequentially; i.e. record 1 must be read before record 2 etc.

The algorithms used to sort/search data will obviously depend on whether it is sorted in main or auxiliary memory. We shall initially consider internal sorting/searching algorithms.

Keys

Normally when sorting or searching an array, each element of the array will actually be a record; for example, a record containing the name and examination mark of a student.

If such an array is in order, then the field used to determine the order is known as the key. In the example just given, it is likely that the name field will be the key; i.e. the array will be sorted such that the names are in alphabetical order. When searching the array, we would examine the name field of each record. When the required record is found, the information required will be in the mark field of the same record.

Each entry in a computerised personnel file might consist of a record with four fields:

employee_number, employee_name, date_of_birth, address

The array could be in alphabetical order of the employees' names. It two or more employees had identical names, then their records could be sorted according to their date of birth. If two or more employees had identical names and dates of birth, then their records could be sorted according to their addresses. In this case, the key is said to be 'according to address, within date of birth within name'.

In a practical data processing application, because employees might share names, dates of birth and addresses an additional field: employee_number is likely to be allocated to ensure that each employee record has a unique key.

While sorting or searching an array, it will only be the record field containing the key that will be examined; the nature of the remainder of the record will be of less concern. Thus for simplicity the following sections assume each element of the array contains only a single integer numeric value. In a more practical application each element of of the array would contain a structure record with member fields.
 

2. Search Algorithms

 

2.1 A general search function


In general, a function to search an array for a particular value will require a function as shown below:

int search( int value );

It is assumed in this example that the array to be searched is a global variable: a that contains 100 elements. The integer parameter: value is the value to be searched for; obviously this could be a character or other type of variable as required depending upon the key being used. The integer value returned indicates the position in the array at which the required value was found if successful or -1 to indicate an unsuccessful search.
 

2.2 Sequential or Linear Search


The simplest way of determining if a particular value is present in an array is to examine each entry in turn until the required value has been found. If the required value does not appear in the array, then the action taken depends on whether the array is sorted or not.
 

2.3 Sequential Search of an unsorted array


When searching an unsorted array, it is not until the last element used has been examined that it is proven that a particular value does not appear in the array. Thus searching must continue either until the required value is found or until all elements of the array in use have been examined. The following program segment searches through array 'a' until it finds the required value. This program must terminate if 'value' is not found in array 'a'. This is achieved by checking the value of 'posn'.

int posn=0;
for (posn=0; posn<100; posn++){ 
        if (a[posn]==value )
                break;
}
In this case, however the only indication of whether the search has terminated because 'value' has been found or because the whole array has been searched and value has not been found is posn having an invalid value of 100. This may be better signalled by returning -1 if the search was unsuccessful. A complete sequential function based on the above is:
int search( int value ){
        int posn;
        for (posn=0; posn<100; posn++)
                if (a[posn]==value )
                        return posn;
        return -1;
}

2.4 Sequential Search of a Sorted Array


If the elements of 'a' are in ascending order, then it is possible to determine that the required value is not in the list as soon as an element of the array is examined and found to be larger than 'value'. The change to the for statement is shown below.

for (posn=0; posn<100 && a[posn]<=value; posn++)
This will obviously speed up the search when the required value is not in the array.
 

2.5 But not all available positions in the array are used !


We also don't want to search beyond the last position in the array used. In practical applications the number of records present in an array is likely to be less than the size of an array and the program will need to keep track of the maximum index used. This can be achieved by giving our search function an additional integer parameter: high to indicate the maximum array index used (this is one less than the number of elements used given we start counting them at 0 not 1). This is achieved below:

int search( int value, int high ){
        int posn;
        for (posn=0; posn<=high && a[posn] <= value ; posn++)
                if (a[posn]==value )
                        return posn;
        return -1;
}

2.6 Binary or Logarithmic Search

A sequential search will obviously be very inefficient if a large list of items are to be searched. It is, however, the only solution if the data is unsorted. For sorted data the binary search is more efficient.

A binary search is analogous to the way you look up a word in a dictionary or a name in a telephone directory. You might grasp the next paragraph more easily if you first think of a word and then look it up in a dictionary, or look up the telephone number for a friend or business in the phone book, writing down as you carry out your search exactly how you decide where you will make each guess about where in the dictionary or phone book you will next look for the entry.

The middle element of the list to be searched is compared with the required value from which it can be determined which half of the list the required value is in. The process in then repeated on the relevant half of the list. Each iteration of the process reduces the number of items still to be searched by a factor of 2, until eventually only one item remains.

int search( int value, int high){ /* implements binary split search algorithm */ 
        int low = 0, mid;
        do
        {
                mid = ( low + high ) / 2;
                if ( value < a[mid] )
                        high = mid - 1;
                if ( value > a[mid] )
                        low = mid + 1;
        } while ( value != a[mid] && low <= high ); 
        if ( value == a[mid] )
                return mid;
        else
                return -1;
}

2.7 Efficiency of Search Algorithms

The time taken for a search routine to find a key (or to determine that the key is not in the array) will depend on the actual data used. Thus all figures will be average figures assuming randomly distributed data. The time will be proportional to the number of comparisons required which in turn will depend on the number of items to be searched (n).
 
 
Number of Comparisons Unsorted Sorted
Sequential found n/2 n/2
Sequential not found n n/2
Binary found - 1+log2n
Binary not found - 1+log2

Thus the sequential search takes on average n/2 comparisons to find a key; ie 500,000 comparisons to search 1,000,000 items. The binary search takes about 20 comparisons to search the same number of items. Frequently we will simply quote the order of the algorithm; ie the sequential search is of order n whilst the binary search is of order log2n.
 
 

3. Sorting Algorithms

3.1 Record Exchanges

Most sorting routines involve a number of passes through the array to be sorted. Each pass will involve comparing the key fields of various records, and depending on the outcome of the comparisons, the swapping of some records. ‘C’ enables all of a record to be moved in a single statement.

Swapping two records will require three moves and involve a third temporary record. Assuming the following declarations:

struct student_info {
        char name[20];
        int mark;
}
typedef struct student_info STUDENT_INFO;
STUDENT_INFO student[100],temp;
Then to swap records i and j we must use:
temp = student[i];
/* copy all data on student[i] into temp */
student[i] = student[j];
student[j] = temp;
 

3.2 Efficiency of Sorting Algorithms

The time taken to sort an array of records will depend on both the number of comparisons and on the number of record moves required. The figures given below can only be approximations, assuming randomly distributed data.
 

3.3 Exchange Sort

Algorithm

Compare the first item with each item in turn. Exchange the pair if the first item is larger. At the end of the first pass, the smallest item will be at the start. Repeat the second and subsequent items after the ith pass (n-i) items remain to be sorted. (n-1) passes will be required.

Efficiency

pass comparisons exchanges moves
1st (n-1)  (n-1)/2 exchanges (average) 3(n-1)/2
2nd (n-2) (n-2)/2 exchanges (average) 3(n-2)/2
all passes = (n-1)+(n-2) .... +1 = n(n-1)/2 =~ n2/2 = n(n-1)/4=~ n2/4 (average) ~3n2/4 moves

 

3.3 Selection Sort

In order to place the record with the smallest key at the start of the array the exchange sort will, on average, have to exchange records (n-1)/2 times with 3 moves per exchange. If the record contains a large amount of data, the time taken is considerable. The selection sort searches the array for the record with the smallest key, and then swaps this record for the first record in the array. The second pass repeats the process except that it starts with and swaps the second record for the record with the smallest remaining key. Thus each pass only requires one record swap.

Algorithm

Search the array for the record with the smallest key.
Exchange this record with record in the first position.
Repeat, searching for the smallest element in the remainder of the array and place this in the second position etc.
(n-1) passes will be required.
 

Efficiency

pass comparisons exchanges moves
1st n-1  1 exchange 3(n-1)/2
2nd n-2 1 exchanges 3(n-2)/2
all passes = (n-1)+(n-2) .... +1 = n(n-1)/2 =~ n2/2  = n-1 (maximum) ~3n moves

 

3.4 Bubble Sort

Algorithm

Compare the 1st and 2nd items and exchange, if necessary, so that the smaller is in the 1st position.
Repeat for the 2nd and 3rd pair, 3rd and 4th pair etc until a pass through all adjacent pairs has been made.
At the end of this first pass the last item will be in its proper place (i.e. it will be the largest).

Perform a second pass on the first (n-1) items, after which the last 2 items will be in place.
(n-1) passes will be required to sort an array containing n items.

At the end of the ith pass the last i items will be ordered.
If the pass is made in which no exchanges are required, then the list is in order,
  even if less than (n-1) passes have been made.

Efficiency

pass comparisons exchanges moves
1st (n-1)  (n-1)/2 exchanges (average) 3(n-1)/2
2nd (n-2) (n-2)/2 exchanges (average) 3(n-2)/2
all passes = (n-1)+(n-2) .... +1 = n(n-1)/2 =~ n2/2 = n(n-1)/4=~ n2/4 (average) ~3n2/4 moves

 
 

3.5 Insertion Sort

Algorithm

Compare item 1 & 2;. Exchange if necessary so that the smaller is in position 1. For items 3... n, search the already ordered portion of the array for the correct point to insert the next item. (item i, say). In order to insert the item at this point, it will be necessary to move all the following sorted items up one, so that the last ordered item overwrites the location previously occupied by item i. Note that the search for the correct place to insert each item in the already sorted part of the array may be achieved using either a linear or a binary search, which leads to either a linear insertion sort, or to a binary insertion sort.
 

Efficiency of linear insertion sort

pass comparisons moves
i'th i/2 (average) (i/2) +1
all passes = (1.5+2.5+3.5 ... +n-0.5) = n(n-1)/4 =~ n2/4 =~ n2/4 

 
 

Efficiency of a binary insertion sort

pass comparisons moves
i'th log2i (average) (i/2) +1
all passes = (log21 + log22 +log23 ... + log2(n-1) )  <  nlog2 =~ n2/4 

 

3.6 Merge Sort

Algorithm

If the number of keys to be sorted is only one, then return because the data is already sorted.
Otherwise:
  1. Divide the array to be sorted into 2 halves and sort each half.
  2. Then merge (see below) both array halves into a sorted temporary array.
  3. Block copy records from the sorted temporary array back into the permanent array.

Merging is accomplished by comparing the 2 lowest (uncopied) keys and copying the lowest into a temporary array.
Repeat comparing uncopied keys in both halves until all keys in at least one half have been copied.
Finally copy any remaining (already sorted) keys in the other half.

This algorithm is recursive, in the sense that it calls upon itself. Each time the sort is called it calls itself twice, once for each array (portion) half. This continues with each half calling upon the sort twice. until halves contain only 1 record.

Recursion is a topic about which there will be more later in this module.
 

Efficiency

This algorithm requires twice as much memory, but for large arrays it is very much faster than any sort requiring a number of moves or comparisons to the order of  n2 . E.G to sort 10,000 records an n2 order algorithm will need around 100 million operations while for a nlog2n order algorithm only 130 thousand are needed, a difference of 800 to 1. Another recursive sort algorithm is available (Quick Sort) which is even more efficient. Students who are interested will find details in  Donald Knuth's book: "The art of computer programming".
 
passes comparisons moves
log2n (nlog2n)/2 2nlog2n

 
 

3.7 Summary of Sort Algorithm Efficiency

 
Type of sort comparisons moves
Exchange n2/2 3n2/4
Selection n2/2 3n
Bubble n2/2 3n2/2
Linear Insertion n2/2 n2/4
Binary Insertion < nlog2n n2/4
Merge  (nlog2n)/2 2nlog2n

 
 

4. Sorting Algorithm 'C' Source Codes

/* Exchange sort */

void sort(void)
{
        int temp,passno,i;
        for ( passno = 0 ; passno < n - 1 ; passno++)
                for ( i = passno+1 ; i < n ; i++ )
                if ( a[i] < a[passno] )
                {
                        temp = a[i];
                        a[i] = a[passno];
                        a[passno] = temp;
                }
}

/*----------------------------------------------------------------------------------------*/

        45      84      12      9       34      1       62      87      3       51
        1       84      45      12      34      9       62      87      3       51
        1       3       84      45      34      12      62      87      9       51
        1       3       9       84      45      34      62      87      12      51
        1       3       9       12      84      45      62      87      34      51
        1       3       9       12      34      84      62      87      45      51
        1       3       9       12      34      45      84      87      62      51
        1       3       9       12      34      45      51      87      84      62
        1       3       9       12      34      45      51      62      87      84
        1       3       9       12      34      45      51      62      84      87

/*--------------------------------------------------------------------------------------*/


/*Selection sort*/
/* Search positions first..n of array                                           */
/* for the position of the smallest value                                       */

void sort(void)
{
        int temp,passno,i,smallest;
        for ( passno = 0 ; passno < n - 1 ; passno++)
        {
                smallest = search(passno);
                temp = a[smallest];
                a[smallest] = a[passno];
                a[passno] = temp;
        }
}

int search(int first)
{
        int tinyest, posn;
        tinyest = first;
        for ( posn = first+1; posn < n; posn++)
                if ( a[posn] < a[tinyest] )
                        tinyest = posn;
        return tinyest;
}


/*----------------------------------------------------------------------------------------*/

        45      84      12      9       34      1       62      87      3       51
        1       84      12       9      34       45     62      87      3       51
        1       3       12       9      34      45      62      87      84      51
        1       3       9       12      34      45      62      87      84      51
        1       3       9       12      34      45      62      87      84      51
        1       3       9       12      34      45      62      87      84      51
        1       3       9       12      34      45      62      87      84      51
        1       3       9       12      34      45      51      87      84      62
        1       3       9       12      34      45      51      62      84      87
        1       3       9       12      34      45      51      62      84      87

/*----------------------------------------------------------------------------------------*/

/*Standard bubble sort */

void sort(void)
{
        int temp,passno,i;
        for ( passno = 0 ; passno < n - 1 ; passno++)
                for ( i = 0 ; i < n-passno ; i++ )
                if ( a[i] > a[i+1] )
                {
                        temp = a[i];
                        a[i] = a[i+1];
                        a[i+1] = temp;
                }
}
/*----------------------------------------------------------------------------------------*/

        45      84      12      9       34      1       62      87      3       51
        45      12      9       34       1       62     84       3      51      87
        12      9       34       1      45      62       3      51      84      87
        9       12      1       34      45       3      51      62      84      87
        9       1       12      34       3      45      51      62      84      87
        1       9       12       3      34      45      51      62      84      87
        1       9       3       12      34      45      51      62      84      87
        1       3       9       12      34      45      51      62      84      87
        1       3       9       12      34      45      51      62      84      87
        1       3       9       12      34      45      51      62      84      87

/*-----------------------------------------------------------------------------------------*/
/* Insertion Sort */
/* Linear search of the sorted portion of the array                             */
/* to find the correct position to insert 'key'.                                */
/* Position returned to 'insert'.                                               */

void sort(void)
{
        int key,passno,i,insert;
        for ( passno = 0 ; passno < n - 1 ; passno++)
        {
                key = a[passno+1];
                insert = search(passno,key);
                for(i = passno; i >= insert; i--)
                        a[i+1] = a[i];
                a[insert] = key;
        }
}

int search(int high, int key)
{
        int insert=0;
        while ( key >= a[insert] && insert <= high )
                insert++;
        return insert;
}


/*-----------------------------------------------------------------------------------------*/

        45      84      12       9      34       1      62      87       3      51
        45      84      12       9      34       1      62      87       3      51
        12      45      84       9      34       1      62      87       3      51
         9      12      45      84      34       1      62      87       3      51
         9      12      34      45      84       1      62      87       3      51
         1       9      12      34      45      84      62      87       3      51
         1       9      12      34      45      62      84      87       3      51
         1       3       9      12      34      45      62      84      87      51
         1       3       9      12      34      45      51      62      84      87

/*-----------------------------------------------------------------------------------------*/

Merge Sort

Example source code for a merge sort program is available.