11. CGI data storage and concurrency issues


11.1 CGI data storage

CGI programs can read and write files in much the same ways that any Python program can. When combined with approaches explored in previous weeks for tracking users and sessions this enables information to be maintained by the server about the state of an interaction with a particular user, e.g. someone playing an on-line game requiring a number of moves, where the server has to remember the state of play, e.g. the position of pieces on a game board for every game user currently playing.

This data needs to be stored persistently in files or database tables to handle any session requiring more than one move from the user. This is because the CGI program will need to be restarted every time the client makes a move or submits a form or runs the CGI from a URL. When the CGI program has identified the user and the session from the input, it can then read the data associated with that session from the relevant database table or file. The program will then process its move and send data back to the client's browser, and updates this stored data as required.

Python makes it easier to do this, thanks to the object serialisation and storage facilities provided by the pickle and shelve modules. These store single or multiple program objects directly in disk files, with the parsing and unparsing between internal and external storage formats taking place "under the hood" within the pickle or shelve module at object load and save time.

11.2 Designing a database table class for CGI use

We will be looking at data security strategies in the CGI environment in more detail later. Before we do let's consider an approach for storing data using CGI programs, that makes these easier to write. Much real-world data consists of various facts about a number of things, objects or persons, where a particular object can have a number of facts stored about it. The collection of facts (called fields) about a particular object are called a record. For example a record might store the mark, student number and name of a particular student. If we have more than 1 student to store similar facts about we would have a record for each one.

In database parlance, this kind of collection of data is thought of in terms of rows and columns, where a set of facts about an object is a record or row, while the same fact or field about a number of different students (e.g. their ages in years) is represented as a database column. Here is an example:

Patrick Murphy50.933
Ravi Shankhar66.629

One reason for designing a class for this requirement, instead of just using functions, is that this allows us to have multiple objects of this class within the same CGI program, with each object corresponding to a particular table which is stored within a named file. For example in a product ordering system, one table might store current product names, identifying codes and prices, another table would store the dispatch addresses of known customers, while a third table might store items within the current "shopping trolley" of a particular customer.

The Python language, as we shall see, is powerful and flexible enough to provide useful CGI program facilities for managing and presenting all of these table objects within the source code of a single class, while being simple enough for relatively inexperienced programmers to use such a class relatively easily compared to other languages.

11.3 CGI table class methods


For a class to manage one such collection of data (a table) flexibly and usefully in a CGI environment the following operations (methods) need to be defined:

  1. Loading the table from a file.     For this data to be consistent across users and uses of the same CGI program it will be stored in a file when the program isn't running. Before the table can be modified it will be loaded from the file. We will be using the pickle module to handle the details, as this can load and save any structured object from and to a file.
  2. Constructing objects of the table class     We will call the table load method when an object of our table class is created. If the table already exists in a pickle file it will be loaded. The first time we create an object for a particular table the file won't exist and this table will be empty, i.e. having 0 rows.
  3. Adding a record    Some collections of data, e.g. hexadecimal samples from a data logger taken at specific time intervals might not need to have a unique "key" for each record. Many tables do need keys to identify records, and in this case a particular column is typically used to store the key field. E.G. a student number will be used uniquely to identify a student within a student records database. In this situation a new record would only be added to this collection if the keyfield within the new record wasn't already present within another record within the table. It would be incorrect, for example to store 2 module marks for the same module for the same student, so before adding the record a check will be made to ensure a record for the same student is not already present.
  4. Modifying a record    For a record already in the database table to be modified it first needs to be identified. Again this is achieved through use of a keyfield, or unique column value. Any other fields apart from the keyfield could be modified. If a keyfield is to be changed, the record would instead be deleted and then added.
  5. Deleting a record    Again for a record to be deleted it has to be identified through a keyfield or unique column.
  6. Finding a record    In order to identify a record for modification or deletion it has to be found. In the case of records stored as a list, with one record per list item, finding a record means obtaining the list index, which in Python is an integer starting at 0. Also if adding a record is done in a manner to avoid duplicating a unique keyfield, if it already exists this fact has to be determined by finding the record whose unique column already has this value. For our purposes we will therefore specify the keyfield or unique column as a parameter to our find method. This method will return either the index (0 or greater) of the record found within the list, or the impossible index value of -1 if the record is not present in the table.
  7. Displaying the records     To be able to print the data out we will map it to an HTML table. Column heading are displayed within <th> start and end tags. Rows will start with <tr> tags and end with </tr> tags. Within each row each field will start with <td> tags and end with </td> tags . This will be achieved by nesting a loop of field print statements within a loop of row print statements.
  8. Saving a modified table to the file    Once the table has been changed using the add, modify or delete methods the table will be saved to its file using the pickle module to handle the details.

Other table operations

Sometimes it might be useful to be able to display a subset of records, e.g all students with marks between 40% and less than 60%. However, if we develop a class which can handle the above generic operations, a specialised method could be provided within a subclass designed to suit a particular application. It is possible to imagine many other specialised operations to suit particular requirements, e.g. where multiple columns are used to obtain a unique row index which would require a specialised find method. We are likely to achieve a cleaner and more reusable design if we organise these specialised application specific details into subclasses of our main table class using inheritance.

11.4 CGI table class data

What data will be associated with an object of this class and how will we organise this data ? One way to store this table would be as a list of lists. While this is a feasible approach, I will use a list of dictionaries, so that the user doesn't have to remember in which order the columns are stored. Each row will correspond to a dictionary. Each column will be a set of items with the same dictionary key, but having different values in each row or dictionary.


Each object of this class, or table of data will require a unique filename for saving and loading purposes.

Meta data

To be able to print the data out we will need to map it to an HTML table. To do this, it will be useful to store some additional data for each column. This "meta" data (or data about data) will include:

  1. The order in which we want to print out the columns (Python dictionaries are unordered)
  2. The column headings, which might be the same as the key names, or more descriptive if required. E.G a suitable dictionarly key name for a column might be "snum" but the column heading might be "Student Number".
  3. The escape codes used for formatting columns within a Python print statement, e.g. %.2f could be used to print a floating point column containing money data to 2 decimal places, e.g. to represent pounds and pence, placing a literal pound sign () in front of the float conversion.

While a dictionary used as a row is inherently unordered, we need to know in which order to print the columns. To do this I will use a dictionary of lists for the meta data, one list for the column names, another list for the column heading texts, and another list for the format specifiers, so that all these lists are in same order as that in which columns will be printed.

11.5 CGI table class source code

   Class to provide support for simple database tables for
   CGI applications with semi-persistent data.

   table class contains methods to load, save, print as HTML,
   and to find, add, remove and modify rows. Objects store a
   list of dictionaries, each dictionary is one row. Objects also
   store meta data, used to control column operations,
   and the filename in which the updated object is
   saved as a pickle file.

   See tester() code for usage examples, and read the
   source code and Notes week 11 for the full documentation.

   Distributed under the terms of the GNU Public License,
   version 2 or higher, see http://www.gnu.org/ for details.

   This program comes without warranty of any kind.

   Richard Kay, 03/04/02"""

import cgiutils

ColumnException="column error" # incorrectly coded column value
                               # in application using the table class
class table:
  """ class which stores a table using a pickle file and do
  add/mod/del row operations on the table. table data is a list of
  dictionaries each dictionary is for a row. meta data is a dictionary
  of lists each list holding ordered information about columns for
  display purposes """

  def __init__(self,meta,filename):
    # meta is information about data, is a set of lists in matching order.
    # each element of a list is information about a column (col) in data.
    # meta["keys"] = list of keys in the column order of printing
    # meta["colheads"] = column headings for printing
    # meta["formats"] = print formats e.g. '%.2f', '%s', '%d'
    # filename is where the data is loaded from or saved to
    # data is a list of dictionaries saved in filename

  def load(self): # loads data from file, or empty list
    """ loads serialised data object from pickle file """
    import pickle
      return data
    except(IOError): # at start no rows in table
      return []

  def save(self):
    """ saves structured object into pickle file"""
    import pickle

  def find(self,value,column):
    """ returns index of 1st record whose column equals value or returns -1"""
    if not column in self.meta["keys"]:
      raise ColumnException
    for rec in self.data:
      if rec[column] == value:
        return i
    return -1 # value not present in column

  def addrow(self,row,uniq_col=None):
    """ adds row. If uniq_col specified, row with same column value must
        not already exist or row will not be added. Other data must be
        validated before calling this method.  """
    keys=self.meta["keys"][:] # need a copy of keys list, not another
    # reference to it or sort will put meta["keys"] into wrong order !!
    if ((not keys == rowkeys) or
        (uniq_col and not uniq_col in keys)):
      raise ColumnException
    if uniq_col:
      if atrow != -1:
        cgiutils.html_end("Can't add record, keyfield already exists.")
    self.data.append(row) # add row at end of data

  def modrow(self,row,col):
    """ modifies row. col required and row with same column value must
        already exist or row will not be modified. Other data must be
        validated prior to calling this method."""
    keys=self.meta["keys"][:] # need a copy of keys list, not another
    # reference to it or sort will put meta["keys"] into wrong order !!
    if ((not keys == rowkeys) or
        (not col in keys)):
      raise ColumnException
    if atrow == -1:
      cgiutils.html_end("Can't modify record, keyfield doesn't exist.")
      self.data[atrow]=row # change row position with modified data

  def delrow(self,value,col):
    """ deletes row. col required and row with same column value must
        already exist or row will not be deleted """
    if not col in keys:
      raise ColumnException
    if atrow == -1:
        cgiutils.html_end("Can't delete record, keyfield doesn't exist.")
    del self.data[atrow] # remove row whose col == value

  def tab2html(self,caption='',align='center',bgcolor='#ffffff',border=5):
    """ prints data as a HTML table using meta data for formatting """
    print "<hr>"
    print "<p>"
    print """<table align="%s"
                    border=%d >""" % (align,bgcolor,border)
    if caption: # print caption
      print "<b><CAPTION> %s </CAPTION></b>" % caption
    # print column headings
    print "<tr>"
    for heading in self.meta["colheads"]:
      print "<th> %s </th>" % heading
    print "</tr>"
    # loop through rows
    for row in self.data:
      print "<tr>"
      # loop through cols
      for key in self.meta["keys"]:
        formatstr="<td> %s </td>" % self.meta["formats"][i]
        # substitute %s|%f|%d type format conversion
        print formatstr % row[key] # substitute value
      print "</tr>"
    print "</table>"
    print "</p><br>"

11.6 Testing and debugging the table class

To do this we will be adopting a similar approach to the one we used to test cgiutils.py . A generally accepted python convention is to provide some test code so that when a Python module is run as a standalone program the test routine exercises the classes and functions provided within the module. This also provides some of the source code which can be adapted for suitable applications (e.g. the meta data which defines the table column output formatting.)

def tester():
  """ Runs tests on table class. Each use of the menu is intended to
      create valid HTML output which can be cut/pasted/saved using
      an HTML source editor e.g. Quanta plus. HTML validation/viewing
      can be done using HTML development environment. The table class
      will create a pickle file: temp.pkl and use this to store data
      between program runs. """
  # meta data for test table
        "colheads":["Student No.","Name","Mark"],
  recs=[{"snum":"12345670","name":"Peter Smith"   ,"mark":45.2},
        {"snum":"01234567","name":"Wu Chan"       ,"mark":68.7},
        {"snum":"80123456","name":"Siobhan Murphy","mark":59.2}]

  while 1:
    # instantiate table object with no records (first time round)
    print "<pre>" # menu works interactively and in a html
    print "enter   to test"
    print "  1     add record"
    print "  2     modify record"
    print "  3     delete record"
    print "  4     show table as HTML"
    print "  5     Quit"
    test=int(raw_input("option: "))
    print "</pre>"
    if test == 1:
      recno=int(raw_input("which record to add (0-2): "))
    elif test == 2:
      recno=int(raw_input("which record to change (0-2): "))
      mark=float(raw_input("enter new mark (0.0-100.0): "))
    elif test == 3:
      recno=int(raw_input("which record to delete (0-2): "))
    elif test == 4:
      tbl.tab2html(caption="unit test code")
    elif test == 5:

if __name__ == "__main__": # unit test condition

11.7 CGI program concurrency and file storage

When we run a Python program interactively in standalone mode a single user runs the program at any one time. This may seem obvious, but this fact can lead to wrong assumptions which we can't afford to make in a high volume CGI environment where many people unknown to each other will be using the same CGI program on the same server at the same time. Of course in a low-volume test and development environment the "single-user at any one time" assumption might still be effectively valid, because if there are a few users of an application the chances of a file update collision or "race condition" through simultaneous use is low.

The problem arises when more than one person attempts to update a file on the server using a CGI program at the same time. This can cause the file to become corrupted.

Generally one of the following approaches are taken to address this issue:

a. Live with the risk

It may be possible to estimate and accept the risk that data can become corrupted. To give a very rough idea of this risk, for example if there are 2 updates of a file every hour and each update takes 1 second to complete, the risk of both updates occuring during the same second is 1/3600 within a particular hour. If this risk were taken every hour of the week, and exclusive access for 1 second were required to successfully update this file, the chances of a simultaneous update during a particular week would be approximately 1/3600 (chance per hour) x 168 (no. of hours in a week) = .047 or about 1 in 20. In this situation we could expect the file to become corrupted and need restoring from backups (you are keeping regular backups aren't you ?) perhaps once every 20 weeks.

In practice the calculation will be more complex, as an accurate estimate either requires very extensive testing, or detailed knowledge of how the operating system and disk drives etc. update files. However, if it is possible to identify when file corruption occurs, and notify a maintainer in this event and restore data from backup copies it may be possible to accept temporary loss of system availability and live with this risk, e.g. for an experimental or low volume application.

b. Use file locking

Instances of the CGI program must obtain and release a resource lock associated with file update access, such that:

  1. i. Only 1 program instance can obtain the lock at any one time.
  2. ii. A program must obtain the lock before it can update the file associated with the resource lock.
  3. iii. The program releases the lock after the file is updated.

This approach is cheap, in the sense that memory does not need to be permanently occupied by a continuously running database management system or queued update process. However, it will probably not achieve the same throughput as a multi-user database product designed for the purpose. This is especially true if the CGI program doesn't run on the same computer as that managing the filesystem on which the table is stored, as will be the case with network-attached file storage systems.

Issues which need to be addressed with file locking software include how to handle stale locks, e.g. when a process has locked a file and died before unlocking it. Another design issue to consider is whether to timeout a blocked file-update request or let it go on retrying indefinitely. One approach which may be taken with stale locks is to break them after a certain period of time, but this carries a small risk of file corruption if the process which had acquired the apparantly stale lock comes back to life unexpectedly after the lock is broken.

c. Use a queue or FIFO arrangement

It is possible to arrange for CGI program instances to feed update instructions independantly to another program as a queued sequence. In this arrangement a number of CGI programs can feed data into the back of the queue or FIFO (First In First Out), but only one program (responsible for all updates) can dequeue data from the front of the queue and use it to update the protected file. We can picture this arrangement as being like a number of people independently joining the back of a bus queue, with the bus driver signalling people at the front of the queue to get onto the bus one at a time. In this case the multiple CGI programs are acting independently of each other in putting update requests at the back of the queue and the single update program, or bus driver, is responsible for updating the protected resource in a safe manner.

Operating systems such as Unix and Linux provide FIFO's as memory structures. This is fast and practical for programs running on the same CPU. However, for multiple CGI program instances operating on a cluster of networked machines using a shared network attached filesystem a different approach is needed. This configuration is typically operated by a customer website-hosting company. Here a FIFO can be implemented by using unique filenames within a particular directory on the filesystem to queue each table-update request. These filenames will combine the time queued, the computer host identifier and process ID of the CGI program creating it, guaranteeing all filenames will be unique and (if dequeued based on the time queued) the file that has been in the queue longest is always at the front of the queue.

Using this approach, we could either run the protected update process all of the time, or would need a resource-locking facility as in b. to ensure that only one instance of the update program (the bus driver in the analogy above) is allowed to continue into its protected region (the bus driver's seat) if a new instance of the update program is needed. This approach is more efficient than file locking if the pattern of updates for the protected table is very "bursty" with long periods of inactivity interspersed with occasional flurries of activity involving multiple table updates occuring close to each other. The advantage in this situation is that no continuously running process occupying memory is needed because the update program can be allowed to terminate once activity has ceased for a certain period. E.G. the bus driver goes to the rest room (the update program terminates) after he or she hasn't seen any passengers for 5 minutes or if another bus driver is already in the driver's seat (the update program detects another instance of itself with a lower process ID).

However, a single resource lock and unlock cycle (to ensure that you never have 2 bus drivers dequeing passengers at the same time) can be used to secure a greater number of table updates. To continue with the bus queue analogy, you could either have one bus driver on rota at all times (read program occupying memory), or you could have a pool of bus drivers one of whom will come out of the rest room and try to get into the driver's seat when a passenger arrives at an empty bus and rings a bell. If 2 passengers arrive at an empty bus at the same time, and ring for a bus drivers at the same time, a resource lock ensures that only 1 bus driver can get into the driver's seat (only one update program instance can enter its protected region at the same time). By placing the resource lock on the driver's seat, as opposed to creating and removing a resource lock each time a passenger gets onto the bus (i.e. each time the protected file is updated) greater throughput can be obtained because fewer locks need to be set and unset to achieve the same number of updates, and the critical region of the update process can handle updates sequentially at greater speed.

d. Use a concurrent multi-user database software product

For very many CGI environments this is the preferred option. Various free and commercial products use the techniques described above (e.g. FIFOs and resource locking at the table or row level) to provide data persistence within the multi-user environment. Such products include MySQL, Postgresql (free software), and Oracle and Informix (proprietary). Python has various DBI (Data Base Interface) modules to connect to these popular RDBMS (Relational Database Management Systems) in a portable manner. One very clear advantage is that programmers writing applications for these tools can delegate the concurrency issues described above to the RDBMS, without needing to give these as much consideration.

Some of the more advanced of these (e.g. PostgresQL and Oracle) also manage transactions. The transaction concept requires a transaction, or set of logically connected updates, to occur in its entirety or not at all. This is achieved through use of a log where the state of rows affected by a transaction before the transaction are archived, and the fact that the transaction is started are logged. If the transaction does not complete, e.g. due to a system crash leaving the completion of the transaction in an unknown condition, this fact is verified from the log on restart of the database management system, and the transaction is rolled back to the initial state. An example of such a transaction might be where a purchased item is recorded as having been picked up by a delivery company in a dispatch notes table, and this item is also deducted from a warehouse items in-stock table. In this situation it should be impossible for one table to be updated, without the other update also taking place.

RDBMS will generally involve a number of background processes occupying memory continuously, and web-hosting companies will charge their customers higher fees for hosting web sites which using these more advanced facilities.

11.8 The Mailman LockFile.py module

Which concurrency approach described above suits us best ?

Initially the cheapest and simplest is to accept the risk. However, this is not the cheapest option if there is any significant cost associated with loss of data and system downtime. For someone operating a low-budget active website, using a RDBMS is robust, but this can double the site-hosting price. The scope of the syllabus (Website Programming Applications IV) for which these notes were prepared do not extend to relational database management and SQL, the Structured Query Language used to program these resources. To help students climb the learning curve described above starting with simpler, lower-volume web-sites I will describe an approach which reuses the Python file-locking module supplied with the Mailman free-software web-based mailing list management system

The advantage of this approach is therefore partly cost, partly about keeping these notes and demonstration source files simple enough for student use, and partly that the Mailman LockFile module already exists, is very robust, is reusable when combined with other free software released under the GNU Public License, and is relatively easy to import and interface to the tablcgi module described above.

The part of the LockFile.py source file which documents much of the application interface is reproduced below. The entire module is provided as a download . To find out more about it you will have to read its source code. The download contains the source obtained from http://www.list.org/ except for a minor modification to create logfiles in the working directory, instead of a temporary directory.

Exerpt from LockFile.py taken from Mailman sources

# Exceptions that can be raised by this module
class LockError(Exception):
    """Base class for all exceptions in this module."""

class AlreadyLockedError(LockError):
    """An attempt is made to lock an already locked object."""

class NotLockedError(LockError):
    """An attempt is made to unlock an object that isn't locked."""

class TimeOutError(LockError):
    """The timeout interval elapsed before the lock succeeded."""

class LockFile:
    """A portable way to lock resources by way of the file system.

    This class supports the following methods:

    __init__(lockfile[, lifetime[, withlogging]]):
        Create the resource lock using lockfile as the global lock file.  Each
        process laying claim to this resource lock will create their own
        temporary lock files based on the path specified by lockfile.
        Optional lifetime is the number of seconds the process expects to hold
        the lock.  Optional withlogging, when true, turns on lockfile logging
        (see the module docstring for details).

        Set a new lock lifetime.  This takes affect the next time the file is
        locked, but does not refresh a locked file.

        Return the lock's lifetime.

    refresh([newlifetime[, unconditionally]]):
        Refreshes the lifetime of a locked file.  Use this if you realize that
        you need to keep a resource locked longer than you thought.  With
        optional newlifetime, set the lock's lifetime.   Raises NotLockedError
        if the lock is not set, unless optional unconditionally flag is set to

        Acquire the lock.  This blocks until the lock is acquired unless
        optional timeout is greater than 0, in which case, a TimeOutError is
        raised when timeout number of seconds (or possibly more) expires
        without lock acquisition.  Raises AlreadyLockedError if the lock is
        already set.

        Relinquishes the lock.  Raises a NotLockedError if the lock is not
        set, unless optional unconditionally is true.

        Return 1 if the lock is set, otherwise 0.  To avoid race conditions,
        this refreshes the lock (on set locks).


11.9 Adding file locking to the tablcgi module

The tablcgi.table() class methods concerned with updating the file used to store the table are addrow(), modrow() and delrow() . These all update the file using the save() method, which turns out to have been a convenient design choice, as it greatly simplifies the addition of file locking, by doing all file updating in one place.

The approach taken is to import LockFile and use this to create a resource lock object and request a lock on it prior to opening, saving and closing the file, and to attempt to unlock the resource lock afterwards in all cases. If an exception is raised it is assumed that the requested update did not take place. The return code of save() is set to 1 to indicate likely success, and 0 to indicate likely failure. This return code from save() is then returned by addrow(), modrow() or delrow() so that suitable error or success actions can be taken by the CGI application. For example, if a file can't be updated because it is too busy to obtain a lock on it within a reasonable period (say 15 seconds) the user attempting to carry out this update might be asked to try again. In the case of an update which gathers statistics but isn't a critical part of a transaction (e.g. counting the number of posts by a user when posting a message to an email list) the user could either be told about the failure of this part of the transaction or a design choice might be made to ignore it.

The modified save method of tablcgi.py is shown below:

  def save(self):
    """ saves structured object into pickle file. Uses Mailman
    LockFile() module. returns 1 if update successful, otherwise 0."""
    import pickle
    import LockFile
    # change line below to =0 to switch off logging
        lock.lock(15) # set a lock of 15 secs duration on the table resource
        lock.unlock() # must always attempt to remove the lock afterwards
    except LockFile.LockError:  # superclass catching all LockFile exceptions
      return 0 # didn't update stored table
      return 1 # did update stored table

The only other change needed is to the last lines of addrow(), delrow() and modrow() so that instead of just calling the self.save() method, they call this method and return the 0 (failure) or 1 (success) return code of self.save to the calling application as follows:

    return self.save()

This extended version of tablcgi.py together with LockFile.py is included in the downloads with a set of CGI and HTML component source files which implement a simple mailing list management CGI in the downloads section, which will be described in next week's notes.