Hierarchical Queries

SQL queries against relational databases have the property that the results of executing a query is a relation (e.g. it looks like a table).    This is very nice from the perspective of relational database theory since it means relational operations are closed – they take relations as input and return relations as a result.   But from a practical perspective this has some disadvantages most notably that the data you require may not fit naturally into a single table.   If your data is organized hiearchically then you will either need to run multiple queries or else create queries that have duplicate data.

To take a simple example suppose your have a Customer Relationship Management system containing Customers and Contacts with each customer having multiple contacts.   If a user needs to obtain a set of customers and some of their associated contacts then using SQL we have a couple of alternatives for doing this.

1.   We might have a single query that duplicate the customer information on each contact.   Of course sending all that duplicate customer information over the network is going to hurt performance.

Select Customer.*, Contact.* from Customers Inner Join Contacts on Customers.CustomerID = Contacts.CustomerID  where <condition on customers> and <condition on Contacts>.

2.  We might have two queries duplicating the customer selection criteria in each query.

Select Customer.* from Customers where <condition on customers>

Select Contacts.* from Customers inner join Contacts on Customers.CustomerID = Contacts.CustomerID where <condition on customers>  and <condition on contacts>.

This second query could alternatively be written using nested queries as follows

Select Contact.* from (Select Distinct CustomerID From Customers Where <Condition On Customers>) As T1 Inner Join Contacts On T1.CustomerID = Contacts.CustomerID where <Condition On Customers>.

Various alternatives have been tried to get around this practically difficulty.   For example in ADO Microsoft created a difficult to use hierarchical query language and associated hierarchical record sets.   I may not have this exactly correct but the syntax in this example would be something like the following.

SHAPE {SELECT * From Customers Where <condition on customers>} APPEND  ({SELECT * From Contacts where <condition of contacts>} RELATE Customers.CustomerID To Contact.ContactID

Now that Microsoft has moved on to ADO.NET and data sets this I imagine this hierarchical query language will disappear from use as old ADO application are replaced.

In an end user query tool I was recently working on I tried a different approach to the problem.    Instead of having a single query I had a “query tree” where the query was organized into a hierarchy of simpler queries.   This was a visual tool so there wasn’t an actual query language but it worked as if you added a new “hierarchical join” construct to the language.  So you might express the above query as

(Select * from Customers Where <condition on customers>) Hierarchical Join (Select * from Contacts where <condition on contacts>) On Customers.CustomerID = Contacts.CustomerID

The following is a list of the various rules for defining a query tree that I used in the End User Query tool without trying to turn it into a language syntax.   Note that in the query tool all joins were predefined and could not be constructed on the fly.   The query tool returned results either in a database as separate tables for each subquery or in an Excel workbook with a separate worksheet for each subquery.

1.   A query tree is a set of flat queries organized into a hierarchy.   A table cannot appear twice in the same subquery but the same table can appear in different subqueries.

2.   The top query in the hierarchical query can optionally be empty in which case the subqueries are treated as independent queries.    This is useful if you want to group a batch of unrelated queries into a single request.

3.   A subquery can optionally be specified to have no output.   In that case it would still affect the records returned by child or parent queries.

4.   A top level query consists of a single initial table followed by a sequence of inner or outer joins.

5.   A subquery consists of a single initial hierarchical join to a table in its parent query followed by a sequence of inner or outer joins.   

6.   In addition to creating subqueries the user can insert summary queries into the query tree.   Summary queries have no tables in them and can only have a single child query. 

7.   Cross product joins are not allowed in a subquery.   So if you include a 1-N link in a query you can only add another 1-N link if it is linked to the N side of the previous 1-N link.

8.   Each subquery has a list of fields, a selection criteria and sort order.   You can specify that some fields are hidden in which case it can be used in selection critera or computations but won’t appear in the output.

9.   A field in a subquery can either be a constant expression, a field from a table in the subquery, a summary function applied to a field in one of its child queries, or a computed field calculated from other fields in the subquery.

Now that we’ve defined a query tree we need to define how to interpet it and turn into a regular SQL expressions.    The special cases can get tricky (for example how do handle nested summary queries) but some of the basic cases are described below.

 1.   For a regular child query you determine the key fields needed in the join expression for between the child and its parent query and create sql something like

Select <child query fields> from (Select distinct <key fields> from parent query SQL) as T1 on <join expression with parent>  <child query tables and join expressions> where <child query selection criteria>.

2.    In a parent query a summary field based on a child query field would create SQL something like the following.

Select Sum(Select childField from <child query join and select SQL>) As SumOfChildField, etc.

3.    If the selection criteria for a parent query involves summary fields of child queries this is handled by correlated subqueries something like the following.

Select <parent query fields> from parent tables where Sum(Select childfield from <parent table joined with child query> inner join <child table>  on <parent child join expression>  inner join <other child tables>) > 1000.

4.   Summary queries added to a hiearchy would be interpreted as

Select <summary fields> from <child query join sql> where <child query selection criteria> having <parent query select query>

Leave a comment