The Basic Control Spreadsheet

As discussed in the process overview, a data subsetting job starts with the creation of the Basic Control Spreadsheet. This simple spreadsheet defines a Basic Subset. It points the automated VIP workflow to the Source Database that will be subsetted, and specifies the Target Database and Schema.

To configure the Basic Control Spreadsheet, open the Basic Subset Control SpreadSheet that you copied into your Subset folder. The template starting spreadsheet is called "VIPsubsetStartPoint.xlsx" by default. You can rename it.

There are two mandatory sheets in the Basic Control Spreadsheet:

  1. The Configuration sheet

  2. The SQLCriteria sheet

There are two additional, optional sheets that can be created and populated at this point:

  1. FoundCriteria

  2. SoftKeys

Instructions for completing the four sheets and running a Basic Subset are provided in this subsection of the Knowledge Base. The next section instructions on running the Basic Subset Actions based on the Basic Control spreadsheet.


The Configuration Sheet

The first step in any data subset is to define the criteria for the data that you want to Subset, and the target location into that you want to move the subsetted data.

These initial parameters are set in the Configuration sheet.

The Configuration sheet must contain the following Control Parameters:

  1. Database Connection: The database that you will connect into, and any credentials necessary for the connection.

  2. Source Database: The database from which data will be subsetted.

  3. Source Schema: The schema or schemas from which data will be subsetted.

  4. Staging Database: The target database into which Subset will be copied. This can be the same as the Source Database, or can be different.

  5. Staging Schema: The schema or schemas into which the Subset will be copied.

Nb.: The connection must have authority to read from the Source Database and to create tables and indexes in the target database. The same connection must be able to insert into the target database.
There are three optional Parameters:

  1. Max Rows: This specifies the maximum size, in rows, of the subsetted data.

  2. Max Iterations: Subsetting can "iterate", re-running a defined Subset to hunt for additional data each time. The Subset job will end once the maximum number of iterations has been reached.

  3. Substitution Parameters: Substitution Parameters can be referenced in the SQL criteria specified under the SQLCriteria and FoundCriteria Up to five Substitution Parameters can be specified in the Configuration Sheet. They can then be referenced in the SQL Criteria and Found Criteria formulae. When "%1" is used in the formula, Substitution Parameter 1 will be used when that criteria is executed by the Subset Actions. When "%2" is included, Substitution Parameter 2 will be used. The same applies for "%3" (Substitution Parameter 3), "%4" (Substitution Parameter 4), and "%5" (Substitution Parameter 5).

Each Control Parameter is specified as a row the Configuration sheet. A blank, template Configuration sheet therefore appears as follows:

Control Parameter

Description

Notes

Example

Variable

Database Connection

The database that you will connect into, and any credentials necessary for the connection.

The format of this connection will vary by database and depends on your own environment.

server=MARTIN-MF424OK\SQL16;user id=sa;password=Password;

varConnectionString

Source Database

The database from which data will be subsetted.

 

SampleCommerce

varSourceDatabase

Source Schema

The schema or schemas from which data will be subsetted.

If more than one source schema is required supply a list delimited by a semi-colon.

dbo

varSourceSchema

Staging Database

The schema or schemas into which the Subset will be copied.

The staging database can be different to the source database, but both must be accessible to the supplied connection.

AdventureWorks

varStagingDatabase

Staging Schema

The schema or schemas into which the Subset will be copied.

If more than one staging schema is required supply a list delimited by a semi-colon.

subset

varStagingSchema

Max Rows

The maximum size, in rows, of the subsetted data.

Optional. If not set this is defaulted to 1,000,000

2000000

varMaxRows

Max Iterations

Subsetting can "iterate", re-running a defined Subset to hunt for additional data each time. You can set the following maximum number of iterations: 1, 2, 3, 4, 5. The Subset job will end once the maximum number of iterations has been reached.

Optional. If not set this is defaulted to 5

9

varMaxIterations

Substitution parameter 1

This will give a value which will be substituted in SQL Criteria or Found Criteria where "%1" is specified.

Optional.

Bikes

varSubParm1

Substitution parameter 2

The substitution value for when "%2" is used as part of the formulae in the SQLCRITERIA or FOUNDCRITERIA sheets.

Optional.

On Promotion

varSubParm2

Substitution parameter 3

The substitution value for when "%3" is used as part of the formulae in the SQLCRITERIA or FOUNDCRITERIA sheets.

Optional.

 

varSubParm3

Substitution parameter 4

The substitution value for when "%4" is used as part of the formulae in the SQLCRITERIA or FOUNDCRITERIA sheets.

Optional.

 

varSubParm4

Substitution parameter 5

The substitution value for when "%5" is used as part of the formulae in the SQLCRITERIA or FOUNDCRITERIA sheets.

Optional.

 

varSubParm5

An example, completed Configuration sheet therefore looks as follows:


The SQLCRITERIA Sheet - Setting the Basic Subset Criteria

The SQLCriteria Sheet provides the basic criteria that will drive your Subset.

Each criterion is specified as a row in the spreadsheet.

Each criterion is applied to a given table, that must be specified. Each row further requires a Table Owner. A Table Owner is an undelimited source owner or schema name.

The criterion must each be suitable for inclusion in an SQL WHERE Clause.

Each row can additionally specify whether the criterion will be active or not during the subset. This enables you to maintain a master set of criteria for a given database, quickly toggling between re-usable criteria.

Finally, you can set whether the criteria should include children or not.

The columns for each row in the SQLCriteria sheet are therefore as follows:

  1. Active: "Yes" or "No".

  2. Table_Owner: An undelimited source owner/schema name. For instance, the default SQL Server owner is "dbo".

  3. Table_Name: An undelimited source table name.

  4. Criteria: A criterion suitable for inclusion in a SQL WHERE clause, for example "[ProductName]='%1'"

  5. IncludeChildren: "Yes" or "No". This specifies whether Child tables will be included in the Data Subset.

A blank SQLCriteria sheet therefore appears as follows:

A complete criterion appears as follows:

This example criterion will subset all entries in the Dimension Table for Product Category that have a given EnglishProductCategoryName.

The EnglishProductCategoryName is specified as the Substitution Parameter 1 in the Configuration Sheet. This is denoted in the above criterion by '%1'

Below is a more complex example, demonstrating the use of Children columns.

In this example, there is a Fact table with multiple child Dimension tables. The Subset criterion specifies that the Subset will pull a specific Sales Order Number from the Fact table named "FactInternetTables". The Subset will additionally pull the data relevant to the subsetted records from 17 child Fact tables.


The FOUNDCRITERIA SHeet - Specifying any Found Criteria

Once you have completed the Configuration sheet and SQLCRITERIA sheet, you are ready to perform a Basic Subset.

However, two optional sheets of parameters can also be configured at this point: the FoundCriteria sheet and Softkeys sheets. This provides additionally flexibility and customisation in the definition of a Subset.

As described in the process overview, a Subset will run until either a maximum specified number of rows is created, a maximum number of recursions is reached, or the "Found Criteria" are fulfilled.

FoundCriteria are optional parameters that control when a subset will stop. They are specified in the Basic Control Spreadsheet, using the FoundCriteria sheet.

When a Subset "stops", no more rows will be added into the staging database.

If multiple FoundCriteria are supplied, Subsetting will stop when one FoundCriteria has been met.

The FoundCriteria sheet follows a similar format to the SQLCriteria sheet. Each FoundCriteria is specified in a new row. Each row includes the following columns:

  1. Active: "Yes" or "No".

  2. Table_Owner: An undelimited source owner/schema name.

  3. Table_Name: An undelimited source table name.

  4. Criteria: A criterion suitable for inclusion in a SQL WHERE clause.

  5. Met: After a Subset is run, the automation will update this column to say whether the criteria have been met or not. The resultant cell will contain "Yes" or "No".

A blank FoundCriteria sheet therefore appears as follows:

An example FoundCriteria sheet appears as follows:

In this example, the Subset will run until it finds a given SalesReasonName in the DimSalesReason table. This SalesReasonName is specified as Substitution Parameter 2 in the Configuration spreadsheet, specified here as '%2' in the criterion.

Another example FoundCriteria is

"SELECT COUNT( * ) FROM stagingDB.stagingSchema.Table_Name WHERE "[ProductName]='Bikes'"

This FoundCriteria specifies that a Subet will run until a ProductName called "Bike" is found in the DimSalesReason Table of the schema.


The SoftKeys Sheet - Manually Defining “Soft” Keys in the Subsetted Data

The Softkeys sheet is used to include "soft" Keys in the subsetted data. Soft Keys are Keys that are not implemented as hard coded constraints in the source database. They can be either Primary or Foreign Keys.

SoftKeys can be specified prior to running the Basic Subset. The definition of the Keys drives the formulation of the Subset Rules. Data Subsetting using Test Data Automation automatically formulates rules for creating a Subset that satisfies where the data satisfies the relationships defined by Primary and Foreign keys. The SUBSET action will then "Crawl" up and down the tables collecting data until it produces a data set that satisfies all the hard and soft relationships.

Note: Subset Rules are formulated to generate a set of data that fulfils the relationships in the Source Database. It therefore creates a data set where the data reflects the Primary and Foreign Key relationships. However, the core actions involved in running a Subset (GETMETADATA, PREPENV, BUILDMODEL, and SUBSET) do not implement these Keys in the Staging database. To implement them, you must perform Post-Subset actions to add Keys.

The Softkeys sheet contains the following columns:

  1. Active: Set to "Yes" or "No", this will either include or exclude a relationship.

  2. FK_Schema: This should be left blank when defining Softkeys.

  3. FK_Table: This should be left blank when defining Softkeys.

  4. FK_Columns: This should be left blank when defining Softkeys.

  5. PK_Schema: The source primary key schema

  6. PK_Table: Primary key table name

  7. PK_Columns: The column names for the primary key. For composite keys the column names are delimited by a semi-colon.

  8. FK_Name: Foreign key name.

An empty Softkeys sheet therefore appears as follows:

To define a "soft" primary key for your Subset, simply leave the FK_Schema, FK_Table and FK_Columns cells empty. Next, specify the source primary key schema, the primary key table name, column names, and foreign key name:

You can define multiple Softkeys for a subset. To toggle between including them or not in your subsetted data, set the Active column to "Yes" or "No".

TEST DATA AUTOMATION DATA SUBSETTING SUBSET JOB