The In-Place Database Processing (IDP) Technology



Contents


Technical Overview

The Purpose and Advantages of IDP

Summary. The In-Place Database Processing (IDP) is an advanced database access technology developed at StatSoft to support high-performance, direct interface between external data sets residing on remote servers and the analytic functionality of STATISTICA products. The IDP technology has been developed to facilitate accessing data in large databases using a one-step process which does not necessitate creating local copies of the data set. IDP significantly increases the performance of STATISTICA; it is particularly well suited for large data mining and exploratory data analysis tasks. IDP technology also provides a security advantage in that data never leave the secure database (remain in the database at all times).

The source of the IDP performance gains. The speed gains of the IDP technology - over accessing data in a traditional way - result not only from the fact that IDP allows STATISTICA to access data directly in databases and skip the otherwise necessary step of first importing the data and creating a local data file, but also from its "multitasking" (technically, asynchronous and distributed processing) architecture. Specifically, IDP uses the processing resources (multiple CPUs) of the database server computers to execute the query operations, extract the requested records of data and send them to the STATISTICA computer, while STATISTICA is simultaneously processing these records as they arrive.

Compatibility with STATISTICA products

The IDP technology can be used with both desktop and enterprise versions of STATISTICA products and it is fully compatible with the Client-Server architecture of WebSTATISTICA (the requests can be made over the Web and data processed asynchronously by WebSTATISTICA Server computers connected to the (next-tier) database server computers which will execute the queries). IDP is also optimized to seamlessly integrate with STATISTICA Data Miner which supports multiple IDP data input channels.

Architecture and Programmability

The IDP technology is implemented around a COM object which wraps an instance of a Microsoft Active Data Object (ADO) Recordset object and implements a subset of the Spreadsheet COM interface in the STATISTICA Object Library. This works because all STATISTICA 6.0 Analyses access the source Spreadsheet data via the Spreadsheet interface. (Actually the InputSpreadsheet interface, which has a subset of the Spreadsheet interface methods. This InputSpreadsheet interface is normally hidden in the Object Browser but can be seen by right-clicking in Object Browser and selecting "Show Hidden Members".) Therefore, to a STATISTICA Analysis, the IDP looks just like a Spreadsheet. Indeed, advanced users of STATISTICA could wrap an InputSpreadsheet interface around any data source at all, and perform STATISTICA Analyses on it programmatically via the STATISTICA Object Model.

Behind the scenes, certain steps must be taken by the spreadsheet wrapper object to make Analyses work seamlessly. For instance, if an Analysis requires the number of cases in a Recordset before that information is known, then either a separate "count" query will be executed synchronously (i.e., the analysis must wait until the count query returns before continuing) and the result returned to the analysis, or some arbitrary upper bound on the case count will be returned immediately. This behavior is configurable on the IDP page of the STATISTICA options dialog. Also, if using a forward only cursor (see below) and the Analysis must make multiple passes through the data or access the data in random order, then any request for a previous case (row) forces the IDP to requery the database and advance the cursor forward to the requested case, since the cursor may not be scrolled backwards. The Analysis would simply wait until this process is completed and the requested data were provided to it.

In the IDP Type Library, there are two main interfaces. DBTable provides programmatic access to the IDP Document, much as the Macro, Graph, and Spreadsheet interfaces provide access to STATISTICA Macros, Graphs, and Spreadsheets. In addition to the standard document methods and properties (Visible, Activate, Close, etc) it provides access to all IDP specific options (cursor type, location, query string, etc.) Its read-only property "Spreadsheet" returns the Spreadsheet wrapper around the ADO Recordset.

The second interface is DBSpreadsheet. This interface is used internally by the IDP to create the Spreadsheet wrapper object, and could also by used by users writing their own macros or programs, although in most cases the DBTable interface is sufficient and will itself use a DBSpreadsheet object. This interface has two methods, Open and CreateNew. Open executes the supplied query and opens an ADO Recordset. It creates a Spreadsheet wrapper object and attaches the ADO Recordset to it, and returns this Spreadsheet object. CreateNew creates a Spreadsheet wrapper object which is not attached to any Recordset and therefore is not useable until you call its "SetRecordset" method to attach an ADO Recordset object of your own creation.

Frequently Asked Questions

How does the IDP work?

The IDP creates an object that implements the STATISTICA Spreadsheet COM interface and which wraps an instance of a Microsoft ADO Recordset object. Since all STATISTICA 6.0 Analyses access the source Spreadsheet's data via this interface, this wrapper object looks just like any other Spreadsheet to the Analysis.

Why does the IDP sometimes only display one case when I set "Preview the first n cases" to some number greater than one?

When using a forward-only cursor, the IDP does not advance the Recordset beyond the first case (row) because then those first n rows would be lost and when an Analysis requested that data, the query would have to be executed again. "Preview the first n cases" only applies when using a static cursor.

What is the purpose of the "case count" method chosen via the STATISTICA options dialog?

Depending on the options you have chosen, the exact number of cases may not be known yet when an Analysis requires that information. If this is the case, then one of two things will happen, depending on your setting for this option. Either a separate "count" query will be executed, or an arbitrary, user chosen upper bound will be returned to the Analysis.

When would the exact number of cases not be known?

In these situations, the exact case count is not known:

    a) Using a forward only cursor and have not advanced to the end yet
    b) Using a static cursor with asynchronous query and/or fetch and the query and/or fetch have not completed yet.

When using a static cursor with synchronous query and fetch, the case count will be known as soon as the query is executed.

What are the implications/advantages/drawbacks of the "Automatically determine the number of cases" option?

When the number of cases is not known and an analysis requires that information, then when this method is in effect a separate, 'count' query will be executed, and the analysis will have to wait on the result. If the query is complicated, this can be slow. In addition, if changes have been made to the database between the time you first executed your original query and when the 'count' query is executed, the results can be inconsistent; i.e., if someone else added or deleted records then the number of records in your cursor will not match the number of records reported to the analysis. The advantage is, if the query is not too complicated and the database has not been changed, this gives you an exact case count without having to resort to a slower cursor type that would always have that information.

What are the implications/advantages/drawbacks of the "Assume fewer than [x] number of cases" option?

This method avoids the overhead of running a separate query to determine the number of cases by always returning the same number for the case count, therefore it is much faster. If there are in fact more cases than this in the Recordset, the cases beyond this number will be ignored by analyses. If there are fewer cases than this, the "extra" cases that the analysis will expect to see but are not available will be treated as missing data. If there are many, many fewer cases in the Recordset than the chosen upper bound indicates, the Analysis will waste time attempting to process the non-existent cases.

What is a cursor?

A cursor is a data structure that stores the results of a query. The type of cursor determines the functionality available. Some cursors allow you to move forward through the results of your query, others allow you to move forward and backward. Common type of cursors are static, dynamic, forward-only, and keyset. The STATISTICA In Place Database Processor supports forward only and static cursors.

What is a static cursor?

A static cursor supports scrolling forward and backward thru the data, thus allowing random access to the data. This cursor type provides a "snapshot" of the result of your query - records modified, added to, or removed from the database after the cursor is populated will not be visible. A server side static cursor can place considerable strain on the database system. A client-side cursor is always static. If an analysis or other usage of the IDP requires random access or multiple passes thru the data, this type of cursor may be needed.

What is a forward only cursor?

The simplest type of cursor, it allows moving only forward thru the results of your query. After you move beyond a record, it is no longer available in the cursor. This type of cursor provides fast access to the data, and places the least stress on the database system. In the context of the IDP, if a user or analysis must make multiple passes thru or needs random access to the data, then the query must be re-run to return to a previous record. If records were modified, added to, or removed from the database in the meantime, this re-query can result in different data or a different number of records. If a particular analysis or usage of IDP require a single forward pass thru the data, this type of cursor will most likely be the fastest.

What is a client side cursor?

The cursor is maintained on the client (user's) machine, with the Microsoft ActiveX Data Objects (ADO) Cursor Engine. This means that all records returned from the query are copied to the user's local machine. A client side cursor is a static cursor.

What is a server side cursor?

The cursor is maintained on the database server. The number of records specified by the "cache size" will be copied to the user's local machine, while the rest are stored on the server. Whenever a record not in the local cache is requested, a new set of records will be copied from the server to the client machine. This type of cursor can place considerable strain on the database system, as it must store the results of all such queries.

Which type and location of cursor is best?

The answer to this question depends on many factors, such as how you intend to use the data, how often the data in the database you are using changes, etc, so there is no 'right' answer. In general, the fastest cursor is the server side, forward only type but if you need to make multiple passes or random access then the necessary requeries may considerably degrade performance, and the data may change between passes. A client side cursor with asynchronous query and asynchronous fetch offers very good performance, allows random access to the data, and the data will never change during the course of an analysis. Just remember that all records returned from the query end up on your local machine.

Can I use a Keyset or Dynamic cursor type with IDP?

These cursor types, which are available with ADO, are not supported via the user interface of the IDP. These cursor types offer features such as allowing the user to update the database, or to see changes made by other users. However, these features are not necessary for the IDP since its purpose is not to update a database. In addition, the additional features provided by these cursor types make them much slower than forward-only and static cursors. It is possible, however, to use this cursor type programmatically when using the IDP Object model.

What is "cache size"?

When using a server side cursor, this number of records is cached locally on the client machine. When records not in the local cache are requested, the ADO Recordset must fetch more data from the remote database.

What does "asynchronous query" mean?

The IDP does not wait on the query to complete before returning control to STATISTICA. Before the query completes, nothing is known about the resulting recordset; after it completes the number of fields (variables) is known, along with their names. Depending on the cursor type, the number of records (cases) may be known as well. After the query completes, individual records can be retrieved from the cursor. When using this feature, you can continue using STATISTICA or even start an analysis on this IDP before the query completes. Naturally, if you start an analysis before the query completes, the analysis itself will have to wait until the query completes before it can begin. For all but the simplest queries, it is recommended to use this option.

What does "asynchronous fetch" mean?

The "fetch" is the process of copying all records to the local client machine, therefore this option only applies when using a client side cursor. When using this type of cursor, behind the scenes ADO runs the specified query and opens an internal forward only cursor. It makes a complete pass through that cursor copying all records to the local machine. When this is done synchronously, all other operations are suspended until all records have been transferred to the client. When done asynchronously, the fetching is done in a separate thread of execution so that processing may continue. Records are available as soon as they are fetched, while requests for records that have not yet been fetched will "block" (wait) until they are retrieved. Thus when using this option, the IDP will not wait on all records to be copied to the local machine before returning control to STATISTICA. You can continue using STATISTICA or even start an analysis on this IDP before the records are all fetched. An analysis begun on a Recordset that is still fetching will only have to wait when it needs records that have not yet been fetched. Except for queries that will only return a few records, this option is recommended, as copying hundreds of thousands of records to your local machine synchronously will tie up STATISTICA until the fetching is completed.

Why does the "maximum number of records to return from query" option not always work?

This option is passed to ADO, which does not actually implement it but passes it on to the OLE DB provider. At this time, this feature is implemented for the OLE DB Provider for SQL Server, the SQL Server ODBC Driver, as well as the OLE DB Provider for Oracle and ODBC Driver for Oracle.

I want to programmatically create my own ADO Recordset object and attach it to an instance of an IDP so I can perform a STATISTICA Analysis on it. Is this possible?

Yes. Use the DBSpreadsheet method CreateNew, which returns a Spreadsheet object, and call that object's SetRecordset function to attach it to an existing ADO Recordset. Or use the DBTable interface's Spreadsheet property to get the Spreadsheet and set the Recordset on that.

I am writing STATISTICA Visual Basic code and accessing a database through the Spreadsheet interface provided by the IDP. Some of the Spreadsheet methods always seem to fail. Why?

Not every Spreadsheet interface method/property is implemented on the IDP Spreadsheet. Many of them make no sense or cannot be implemented. In general, you cannot call any Spreadsheet method or property to change any data in the Spreadsheet; the data is read only.