in070304


-----Original Message-----
From: John Murtari [mailto:jmurtari@thebook.com]
Sent: Saturday, February 24, 2007 3:15 PM
To: Bennett, Barbara; ANSI, ATTN PSA Dept.
Cc: Keith W. Hare; Deutsch, Donald
Subject: Public Review Comments on Proposed SQL Standard: ISO/IEC FCD 9075-2

DATE: February 24, 2007

REGARDING:

 Public Review Comment on Proposed SQL standard: ISO/IEC FCD 9075-2
 Title: Information technology - Database Languages - SQL - Part 2:
        Foundation (SQL/Foundation)

TO:  "Barbara Bennett" <bbennett@itic.org>, "ANSI, ATTN PSA Dept." <psa@ansi.org>
CC: "Keith W. Hare" <Keith@jcc.com>, "Don Deutsch" <Donald.deutsch@oracle.com>

SUBMITTED BY: John Murtari, jmurtari@thebook.com,
          Software Workshop Inc., 55 E. Genesee St., Baldwinsville, NY 13027
          Office: 315-635-1968, x-211


Part I: SUMMARY OF PROPOSED CHANGE

That the SQL standard be extended with the addition of an optional language statement to display increased statistical data regarding database server activity.  This change to be referred to as 'Extended Statistics'.


1. Justification:

The advent of the Web has seen an explosion of SQL database usage in a user community which has less training as DBAs and has high performance requirements. At present statistical reporting is a very implementation specific area.  If statistics are available they are usually at the server level, with very little support for reporting activity at the individual database or user.  Some implementations require the DBA configure complex modules to achieve usage recording.

The goal of this change is to provide DBAs and also application developers with an easy way to monitor database server activity. To improve user satisfaction with their SQL database by making it easier to monitor performance and identify bad application usage patterns and other bottlenecks.


2. Syntax Summary/Examples:

The summary syntax for the proposed new statement is:

  SHOW STATISTICS ( * | Var list) FROM Class [WHERE Var ( '<' | '>' | '=' ) num]
     [LIKE 'Instance pattern'] [ORDER BY Var] [HISTORY]  [LIMIT num]


In usage a DBA could enter:

  SHOW STATISTICS select, update, insert FROM user WHERE
                select > 100 HISTORY LIMIT 24

and see a display of all user activity for the last 24 hours in hourly increments with the number of queries of each type issued by each system user.

An implementation specific configuration element would allow the DBA to call for collection of specified data, e.g.

   statistics_class_list="
   user,max-50,time-60,units-m,(queries,select,update,bytes_sent),
   db,max-50,time-10,units-h,(bytes_sent,aborted_clients,qcache_hits),
   host,max-5,time-3,units-d,(select,update,connects,bytes_sent)"

In the above we want the server to track information on three different Classes: user, db, and host. 

For each Class, the 'max-' precedes the number of Instances of that Class to be supported.  The 'time-' precedes the number of historical time units to store and 'units-' designates the period of interest:
'm' minutes, 'h' hours, and 'd' days.


3. Alternatives:

The present standard supports the INFORMATION_SCHEMA as a way to support meta data collection from the server.

    standard: SELECT SCHEMA_NAME AS `Database` FROM
        INFORMATION_SCHEMA.SCHEMATA [WHERE SCHEMA_NAME LIKE 'wild']

    MySQL alternate: SHOW DATABASE [LIKE 'wild']

This may present a valid approach, although with a goal of ease of use, the amount of syntax required may be intimidating.  Perhaps either implementation could be allowed.


4. Performance:

The collection of statistical data with finer granularity and the storage of historical data will have a performance impact on the database server.  The amount of impact is server dependent.  This definition allows each implementation to define what can be collected and gives DBAs control over the extent and frequency of data collection.


5. Security:

At present some database servers allow any user access to gross figures on server performance, i.e. total traffic, number of queries, etc...  This proposed change would allow data to be collected and reported on a per user/host/database basis.  This would definitely call for better control and each implementation should support a configuration element which specifies the users allowed to have statistical access, e.g.

    statistics_users="john,domenic,christina"



Part II: SYNTAX and SEMANTIC DETAILS

(Note: I apologize for not including BNF productions for the proposed syntax.  Time was limited as these review comments were prepared.)

1. Overview

In Extended Statistics the different types of items that can be reported on are called Classes.  In its present version these are:

    db - a database.
    user - a user.
    host - the host machine of a connection.
    server - a database server.

Within a Class, specific instantiations are called instances. An instance of a user could be domenic or christina.  Implementations would be allowed to define additional Classes.

For each instance we are allowed to track individual items called Vars.  These could be items such as: number of select queries, bytes sent, cache hits, aborted connections.  Some Vars would be shared among Classes, other could only have meaning when a specific class is selected.

The implementation would define the total available set of Classes and Vars supported.  A configuration element would allow the DBA to configure the extent of data collection, e.g.
 
    statistics_class_list="
    user,max-50,time-60,units-m,(queries,select,update,bytes_sent),
    db,max-50,time-10,units-h,(bytes_sent,aborted_clients,qcache_hits),
    host,max-5,time-3,units-d,(select,update,connects,bytes_sent)"

For each Class, the 'max-' precedes the number of Instances of that Class to be supported.  The 'time-' precedes the number of historical time units to store and 'units-' designates the period of interest:
'm' minutes, 'h' hours, and 'd' days.


2. Language syntax

A goal of the implementation was to keep the syntax used to query the statistics as close to standard SQL as possible.  The syntax described below should look very familiar, especially if you replace SHOW STATISTICS with SELECT.


SHOW STATISTICS ( * | Var list) FROM Class [WHERE var ( '<' | '>' | '=' ) num]
     [LIKE 'Instance pattern'] [ORDER BY Var] [HISTORY]  [LIMIT num]


The significant differences are the use of LIKE to match specific Instances and the new keyword HISTORY to produce historical output.
The LIMIT option operates differently depending on whether HISTORY has been specified.  Without HISTORY it operates as in normal SQL, limiting the number of result rows displayed.  If HISTORY is chosen it functions as a time limit, e.g. show me just the last 3 time units for each user.


3. Usage examples


# Limiting output to just two rows.
mysql> SHOW STATISTICS * FROM user LIMIT 2;
+-------------+--------+--------+------------+
| user        | select | update | Bytes_sent |
+-------------+-----------------+------------+
| kelly       | 8      | 0      | 27350      |
| brian       | 1      | 0      | 28303      |
+-------------+--------+--------+------------+
2 rows in set (0.00 sec)


# Using HISTORY to see historical data. 0 represents the current hour, # -1, -2, etc. are the prior hours. Use of WHERE clause to remove # uninteresting data.

mysql> SHOW STATISTICS
mysql> select,insert,Queries,Connects,Qcache_hits,Bytes_sent
       FROM user WHERE select > 5000 HISTORY limit 2;
+----------+-------+--------+--------+---------+----------+-------------+------------+
| user     | hours | select | insert | Queries | Connects | Qcache_hits | Bytes_sent |
+----------+-------+--------+--------+---------+----------+-------------+------------+
| soulmate | 0     | 5289   | 127    | 14623   | 548      | 7432        | 5775053    |
| soulmate | -1    | 22502  | 590    | 60238   | 2717     | 28737       | 25642797   |
| mandala  | -1    | 39282  | 659    | 156091  | 1971     | 111875      | 117291646  |
| handymao | -3    | 5082   | 21     | 11549   | 991      | 2998        | 6681964    |
+----------+-------+--------+--------+---------+----------+-------------+------------+
4 rows in set (0.00 sec)


# Using LIKE to limit hosts displayed. The LIKE clause automatically matches # against the instance of the class.
mysql> SHOW STATISTICS * FROM host LIKE '%kids%' ORDER BY Bytes_sent;
+--------------------+--------+--------+------------+
| host               | select | update | Bytes_sent |
+--------------------+--------+--------+------------+
| www.akidsright.org | 8      | 0      | 53113      |
| s1.kids-right.org  | 2      | 0      | 495        |
+--------------------+--------+--------+------------+
2 rows in set (0.00 sec)                        



Part III:  FUTURE OPTIONS / EXPERIENCE / EVALUATION

A working implementation of Extended Statistics will soon be released for the MySQL database server.  Committee members may have access for test purposes to a prerelease version of a MySQL server supporting Extended Statistics.

In our experience as a hosting provider, we have many different users accessing MySQL servers in support of their websites.  We were frustrated by an inability to easily track high server utilization to a specific user/database.  Extended Statistics has allowed our DBAs to monitor usage and has become a very useful tool.

While the statistical data itself is read only, we can already see a need to allow a DBA to define and redefine Classes and Vars to be tracked while the server is running.  It would seem a very small reach to also define a 'trigger' mechanism that would alert a DBA when usage patterns exceeded a preset threshold.

I welcome feedback from all on this proposal and the guidance from committee members who have much more database and standards experience.

Best regards!

--
                                          John ___________________________________________________________________
John Murtari                              Software Workshop Inc.
jmurtari@thebook.com 315.635.1968(x-211)  "TheBook.Com" (TM) http://www.thebook.com/