Heaps and the Heapsort

Introduction to heaps

A heap is a data structure used to implement an efficient priority queue. The idea is to make it efficient to extract the element with the lowest priority - the next item in the queue to be processed. This would also be easy using a sorted linked list, putting the lowest priority at the head of the list, which makes extracting the item at the head of the queue require 1 operation. But this would also require O(N) operations to insert a new item into the queue which is too slow. The heap data structure ensures than only (O)log2N operations are needed to insert an item into the queue, (O)1 operations are needed to read the lowest priority item and (O)log2N operations are needed to restore the heap after this item is extracted.

A heap can be visualised as a binary tree in which every layer is filled from the left. For every layer to be full, the tree would have to have a size exactly equal to 2n-1, e.g. a value for size in the series 1, 3, 7, 15, 31, 63, 127, 255 etc. So to be practical enough to allow for any particular size, a heap has every layer filled except for the bottom layer which is filled from the left.

This usage of the term heap is unrelated to usage for the pool of memory available for dynamic allocation, i.e. using malloc(). It does relate to the winner of a competive process, e.g. a football league, as being "at the top of the heap".

In this diagram nodes are labelled based on position, and not their contents. Nodes are numbered from 1 to size, where size is the number of nodes in the heap. Also note that the left child of each node is numbered node*2 and the right child is numbered node*2+1. The parent of every node is obtained using integer division (throwing away the remainder) such that for a node i it's parent has position i/2 . Because this numbering system makes it very easy to move between nodes and their children or parents a heap is commonly implemented as an array with element 0 unused. The main disadvantage of using an array, as opposed to a dynamically allocated tree, to implement a heap is that a maximum size needs to be specified based on the allocation of the array. A minor disadvantage is that the array will be 1 greater than the actual maximum heap size if array position 0 is unused. The advantage is that this approach makes the programming of a heap less complicated.

The heap structure property

For a heap based on the above structure to be maintained, every layer must be complete except the bottom layer, which must be filled from the left. This means that every node insertion must involve moving a node into the bottom layer filling this from the left. As this structure implements a priority queue, and the highest priority node is removed from the top, deletion of this node requires initially replacing this with one from the bottom layer, removing it from the right.

The heap ordering property

A data structure with the shape described above becomes useful if data within it is organised, so that the key of every node is smaller or equal to the keys of its 2 (or sometimes one) children. For a child to have a key smaller than that of its parent would violate this condition. When a heap is organised like this, it can be useful as a priority queue, because the lowest key will always be at the top of the heap and most easy to remove. This is called a min heap. This ordering property is reversed (a max heap) if it is desired for the highest key should always to be removed first. When a heap is ordered in this manner, insertion and removal of keys will require movement of nodes along a single path from the top to bottom. As the maximum lengths of insertion and deletion bubbling paths is log2n for a heap containing n nodes, this value enables us to derive the maximum numbers of comparisons and moves. The rest of these notes assume a min heap will be used.

Removal of top priority node

Removal of the top node creates a hole at the top which is "bubbled" downwards by moving values below it upwards, until the hole is in a position where it can be replaced with the rightmost node from the bottom layer. This process restores the heap ordering property.

Insertion of new node in heap

When a node is inserted into the heap a hole is first created at the next right position available within the bottom layer. If the bottom layer is full, a new layer is started from the left. If an array is used to implement the heap, a size check must first be performed to avoid array overflow. Values above a hole within the structure are bubbled down into the hole, so the hole "bubbles up" to the position where the hole can receive the value to be inserted while maintaining the heap ordering property.

Heap sort

Using a heap to sort data involves performing n insertions followed by N delete min operations as described above. Memory usage will depend upon whether the data already exists in memory or whether the data is on disk. Allocating the array to be used to store the heap will be more efficient if N, the number of records, can be known in advance. Dynamic allocation of the array will then be possible, and this is likely to be preferable to preallocating the array and terminating the program if this size is too small.

Source Code

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* max array size for heap */
#define MAXHEAP 100

/* DATA could be any record */
typedef char DATA ;

typedef struct heap {
	DATA a[MAXHEAP]; /* array element 0 unused */
	int size; /* size is number of nodes within heap */
} HEAP;


DATA gettop(HEAP *h); /* returns and removes top element from heap */
void insert(HEAP *h, DATA d); /* inserts record d into heap */
int left(HEAP *h,int i); /* computes and returns i's left child index or 0 */
int right(HEAP *h,int i); /* computes and returns i's right child index or 0 */
int parent(HEAP *h,int i); /* computes and returns i's parent index or 0 */
int cmp(DATA a,DATA b); /* compares a and b, similarly to strcmp() 
			   returns -1 if(a<b), 
			   returns 0  if(a == b) or
			   return  +1 if(a>b) */

int main(void){
	HEAP h; 
	char str[]="a quick brown fox jumps over the lazy dog";
	int len,i;
	h.size=0; /* empty heap, no records yet */
	len=strlen(str);
	for(i=0;i<len;i++){ /* assemble heap */
		if(str[i]!=' '){ /* ignore spaces */
			insert(&h,(DATA)str[i]);
			printf("%c",str[i]);
		}
	}
	printf("\ncreated heap of size: %d\n",h.size);
		
	/* print heap out */
	while(h.size)
		printf("%c",(char)gettop(&h));
	printf("\nextracted data from heap, size now: %d\n",h.size);
	return 0;

}

int left(HEAP *h,int i){ /* computes and returns i's left child index */
	if(i>h->size||i<1){
		printf("left: impossible index %d\n",i);
		exit(1);
	}
	if(i*2 > h->size)
		return 0; /* i on lowest layer has no left child */
	return i*2; /* return index of left child */ 
}

int right(HEAP *h,int i){ /* computes and returns i's right child index */
	if(i>h->size||i<1){
		printf("right: impossible index %d\n",i);
		exit(1);
	}
	if(i*2+1 > h->size)
		return 0; /* i on lowest layer has no right child */
	return i*2+1; /* return index of right child */ 
}
	

int parent(HEAP *h,int i){ /* computes and returns i's parent index */
	if(i>h->size||i<1){
		printf("parent: impossible index %d\n",i);
		exit(1);
	}
	if(i==1)
		return 0; /* top has no parent */
	else
		return i/2;
}

int cmp(DATA a,DATA b){ /* compares a and b, 
			   returns -1 if(a<b), 
			   returns 0  if(a == b) or
			   return  +1 if(a>b) */
	if(a<b) return -1;
	else if(a==b) return 0;
	else return 1;
}

DATA gettop(HEAP *h){ /* returns and removes top element from heap */
	DATA top,last;
	int i,j,k;	
	if(h->size == 0){ /* error case for empty heap */
		printf("gettop: empty heap\n");
		exit(1);
	}
	top=h->a[1];  	
	if(h->size == 1){ /* special case for heap with only 1 node */
		h->size--;
		return top; /* heap is now empty */
	} else {
		/* unlink and store last node in heap */
		last=h->a[h->size--];
		/* need to rearrange heap following extract, */
		i=1;
		while(1){
			j=left(h,i);
			k=right(h,i);
			if(j && k){ /* i has both left and right children */
				if(cmp(last,h->a[j])<0 &&
				   cmp(last,h->a[k])<0)
					break; /* heap OK so quit loop */
				if(cmp(h->a[j],h->a[k])<=0){ 
					/* move hole down to left */
					h->a[i]=h->a[j];
					i=j;
				} else {	
					/* move hole down to right */
					h->a[i]=h->a[k];
					i=k;
				}
			} else if(j){ /* i has a left but no right child */
				if(cmp(last,h->a[j])<0)
					break; /* heap OK, quit loop */
				/* move hole to left */
				h->a[i]=h->a[j];
				i=j;
			} else  /* node i now at bottom of heap */
				break;
		}
		/* replace unlinked last node in hole now at lowest layer */
		h->a[i]=last;
		return top;
	}	
}
	
	
void insert(HEAP *h, DATA d){ /* inserts d into heap */
	int i,j;
	if(h->size==0){ /* special case of empty heap */
		h->size++;
		h->a[1]=d;
	} else {
		h->size++; /* create hole at bottom of heap */
		i=h->size;
		while(1){ /* move hole to position where d can be inserted */
			j=i;
			i=parent(h,i);
			if(!i){ /* hole must now be at top so insert here */
				h->a[1]=d;
				return;
			}
			if(cmp(d,h->a[i])>=0){ 
				/* heap OK so insert below parent here */
				h->a[j]=d;
				return;
			}
			/* bubble parent value down, and hole up */
			h->a[j]=h->a[i];
		}
	}
}

Test results

[rich@saturn heap]$ ./heap1
aquickbrownfoxjumpsoverthelazydog
created heap of size: 33
aabcdeefghijklmnoooopqrrstuuvwxyz
extracted data from heap, size now: 0
Note that heaps are inherently unsuited for removing duplicate keys, and are inherently suited for storing data where duplicate keys exist.