How to Subset Data with VIP - Video Tutorial
The VIP Subset overview video is part of Curiosity’s Test Data Management playlist. It is a good place to start with VIP Subset, providing a broad overview of the process and concepts involved:
Benefits of Data Subsetting Using Test Data Automation
Test Data Automation offers data subsetting, a high-speed test data management utility for copying referentially intact slices of data.
Testing with small subsets of data is useful for a number of reasons. Data Subsetting from Test Data Automation enables:
Parallel testing and development. Multiple testers and developers can work in parallel from the same data, using isolated subsets. This helps avoid delays caused by cross-team constraints and avoids the further frustration of test data being edited, moved or deleted by another tester or developer.
Shorter test run times. Test run time is shortened, using smaller but representative sets of data that are quicker to run and produce less cumbersome results.
Faster, less costly masking. Masking is also quicker and more affordable, masking a smaller, representative data set instead of full-sized copies.
More efficient data extracts. Slow and cumbersome extracts of large data sets are avoided, instead using a high-speed utility to extract only as much data as is needed.
Quicker testing and development. Test and development teams spend less time searching for data, working instead from data sets just big enough to fulfil their exact needs.
A greater understanding of your data. Subsetting further enables easier data exploration, running exploratory subsets instead of experimenting with complex data joins.
Core Features of Data Subsetting
he core features of Data Subsetting from Test Data Automation include:
Table, query, and runtime parameters. Subset parameters are defined quickly and simply in Microsoft Excel. You can set-up multiple queries and reference tables to use during the extract.
Auto-population of a full parameter spreadsheet. The Advanced Control Spreadsheet is created automatically from an initial set-up spreadsheet (the Basic Control Spreadsheet), accelerating the granular definition of Data Subsets.
High performance extracts. "Cascade" joins walk from one table to another for rapid extracts.
Recursive subsetting. You can select data recursively, gathering chains of data. Filtering rules further allows you to iterate until you have subsetted the perfect data set.
Targetted subsetting. You can subset only as much as you need, using found criteria or setting either a maximum number of rows or iterations. Alternatively, the subset will end when the last iteration did not insert any additional rows.
Data de-deduplication. Rows are not added to a Subset if they already exist in the target database, creating a smaller and leaner set of data.
Supporting utilities. Additional TDM utilities can be run during or after a Subset, including truncation and environment preparation.
On demand Subsets. Configured Subsets can be exposed to a shared web portal. Anyone with requisite permissions can then trigger and run a Subset on demand, increasing testing agility.
How Does data Subsetting Work
Data Subsetting from Test Data Automation reads data from a source database and inserts it into a staging database. The connection details for the Subset are first provided in a Basic Control Sheet, along with basic Subset criteria.
This Basic Control Spreadsheet defines a Basic Subset Job. An automated VIP workflow performs the Actions involved in a Basic Subset job, creating the staging database and any schema(s), and executing the Basic Subset. "Soft" Keys, Primary or Foreign, can also be defined manually at this point, along with any Found Criteria.
The Basic Subset Job generates a set of Subset Results, populated into an Advanced Control Spreadsheet. The Advanced Control Spreadsheet contains the Basic Subset Criteria, as well as Subset Rules and Additional Parameters that can then be toggled on or off. These additional parameters allow you to define an Advanced Subset Job. Herein lies the power of Data Subsetting with Test Data Automation: from a small number of user-defined parameters, it will model the database and provide a set of rules to build coherent Data Subsets that reflect the SourceDatabase relationships. This significantly reduces the complexity and manual effort associated with subsetting.
Each time a new Action is run, it will overwrite any relevant information in the Control Spreadsheet. This process allows you to define a data model and Subset Criteria rapidly and iteratively. You can work towards the ideal data set, or iteratively gain greater understand of your data:
The Control Spreadsheet and VIP Server Controller is used to run and refine a Subset iteratively, working until you are happy with its results. Once the Subset job has been defined, the Actions and associated variables can then be exposed in self-service web forms. This enables other users to trigger the Subset using a simple, self-service web portal.
Note: You can also parameterize actions in a Command Script and execute the actions from the command line rather than the VIP Server Controller. For this, a standard set of editable .CMD scripts are shipped with the VIP Server Controller installation.
During a Subset, the automation issues SQL as it crawls from table to table. Recursion further means the subset will return to the beginning until the criteria of completion are fulfilled. The Subset will run until the completion criteria are fulfilled.
Recursion and database "Crawling" means that the subsetted data retains the referential integrity of the original data: the automation will pull in all related data needed for a complete Subset, moving "up" and "down" child and parent tables. If too much data is produced, Subset Rules and tables can be toggled on or off in the Advanced Control Spreadsheet.
Actions used in Subsetting
The Subsetting is performed using automated VIP workflows. Each high-speed workflow performs "Actions". The actions and workflows are executed using the VIP Server Controller. Each Action is informed either by sheets in the Basic/Advanced Control Spreadsheet, or by information read from the Source Database. Some of this information can be user defined; other information is generated by the Subset actions.
The actions to run the Basic Subset are:
The GETMETADATA Action retrieves the metadata from the Source Data that is needed to run the Subset. It is a composite action, made up of three actions. Each action be run in a single action as "GETMETADATA", or can be run separately. GETMETADATA is recommended for simplicity and speed. Running the actions individually is valuable for closer analysis, learning, and debugging.
The PREPENV Action create tables and indexes in the Staging Database.
The BUILDMODEL creates the rules to drive the Subset.
The SUBSET Action writes data to the Staging Database.
A Subset will run until one of the following completion criteria is fulfilled:
A maximum specified number of rows is reached;
A maximum number of recursions is reached;
The "Found Criteria" are fulfilled;
There are no more rows in the Source Database that match the Subset criteria. The Subset will stop recurring when no rows were added in the last recursion.
This Basic Subset will generate the Advanced Control Spreadsheet, containing additional sheets. These sheets contain additional parameters and the automatically formulated Subset rules. The information in these sheets can then be used to perform Advanced Subsets. You can subset iteratively by toggling the Subset Rules, Tables and Relationships that will be used in the next Subset.
Actions used to perform iterative Subsets after the Basic Subset include:
DROP: Drops the tables registered by the TABLES action. You will only run this if something has gone wrong in a Subset, and you want to create a wholly new Data Subset.
PREPENV: If you drop the tables, you will need to re-register tables before performing the next Subset.
TRUNCATE: Deletes data from the Target Database or Schema.
BUILDMODEL: Creates the rules to drive the Advanced Subset, based on the Control Spreadsheet.
SUBSET: Writes the new Data Subset to the Staging Database, based on the updated Control Spreadsheet.
Post-Subset Actions Introduction
Further, actions can be performed following a subset. Each is again informed by information in the Control spreadsheet and leverages an automated and pre-defined VIP Workflow. The post-Subset Actions are:
ADDPKEYS: Adds Primary Key Constraints to the Subset in the Staging Database.
ADDFKEYS: Adds Foreign Key constraints.
ADDKEYS: A composite action that performs both ADDPKEYS and ADDFKEYS, adding both Primary and Foreign Key Constraints to the Staging Database.
VALIDATEPKEYS: Checks that all Primary Keys are unique.
VALIDATEFKEYS: Checks whether Foreign Keys can be added to the data.
VALIDATEKEYS: A composite action that runs both VALIDATEPKEYS and VALIDATEFKEYS.
Data edits: Various actions will edit the resultant data Subset:
DROPFKEYS: Drops Foreign Keys.
DROPPKEYS: Drops Primary Keys.
DROPKEYS: A composite action that runs DROPFKEYS and DROPPKEYS, dropping both Foreign and Primary Key Constraints.
TRUNCATE: Deletes the data from the target Database or Schema, allowing you to subset iteratively.
DROP: Drops the tables created by the PREPENV Action.
Many of these Actions performed post-Subset create a functioning data set that can be used during testing and QA. For instance, adding Primary and Foreign Keys incorporates the relationships in the data needed for many types of automated testing.
Knowledge Base articles are provided on configuring each relevant sheet in the Basic/Advanced Control Spreadsheet, and running the Action associated with them. The below table summarises all Actions available when subsetting using Test Data Automation, and the sheets used to configure them. It also summarises which sheets the actions overwrite or populate in the Control Spreadsheet; if blank, it means that these actions are performed on the basis of the Source Data only:
Informed by data in the following sheets
Populates/overwrites data in
Retrieves metadata from the source database.
Retrieves metadata from the source database.
Finds identity columns in the source database
Creates tables and indexes in the Staging Database.
Creates the rules to drive the Subset.
Writes data to the staging database.
Adds Primary Key Constraints to the Data Subset.
Adds Foreign Key constraints.
Checks that all Primary Keys are unique.
Checks whether Foreign Keys can be added to the data.
Deletes Foreign Keys from the Data Subset.
Deletes Primary Keys from the Data Subset.
Deletes the data from the target Database or Schema, allowing you to subset iteratively.
Drops the tables created by the PREPENV action.
The Knowledge Base articles on data subsetting will explain how to configure and perform the Actions associated with a Basic Subset. It then sets out how to refine the Advanced Control Spreadsheet and perform iterative Subsets, before detailing the Actions that might be performed Post-Subset.
The Data Subsetting of the Knowledge Base does not provide instructions on exposing a Subset to a shared server or self-service test data web portal. This process is generic to all Test Data Automation utilities.
TEST DATA AUTOMATION DATA SUBSETTING INTRODUCTION