Synthetic Data Generation

    Synthetic Data Generation


    Article summary

    This page shows the Test Data Activity for Synthetic Data Generation, a technique for generating new compliant data into an external database. In this tutorial, you’ll learn how to set-up and configurate the Synthetic Data Generation Activity.

    About

    Synthetic Data Generation is a critical process in database management that involves the creation of artificial (synthetic) data. Data can be modelled to mimic the characteristics and structure of real-world data, while introducing the notion of data coverage, to provide an enriched set of data. This technique often includes generating data with complex, interrelated entries into database tables and flat file formats. What sets this method apart is its capability to embed business rules within the generated data, allowing it to accurately replicate the patterns and behaviours found in actual enterprise software operations. As a result, the synthetic data can mirror real scenarios and complexities encountered in business processes, providing a realistic, yet privacy-preserving environment for tasks like software testing, machine learning model training, and simulation of data-driven applications. 

    This capacity to emulate real data with embedded business rules makes Synthetic Data Generation an invaluable tool for enterprise software development and testing, and for any application where privacy and data protection are of paramount importance.

    Tutorial

    Follow along with the video tutorial or read the written tutorial below where each of the steps is broken down and explained.

    Part one of the tutorial configures the synthetic data generation data activity in the portal.

    Part two walks through executing your synthetic data routines on your local machine, before covering publishing them into a self-service form. 

    Prerequisites

    For this data activity, you will need:

    Step 1 - Create Synthetic Data Activity

    Navigate to the Data Activity Dashboard and then click on the Synthetic Data tile.

    This will launch the wizard for creating data activities. In this section, you are required to provide specific details about the activity, including a mandatory name and description. After filling out the necessary information, click on the 'Next' button to proceed.

    In the next tab, we will navigate to the desired location to save the activity. Once this is selected, click the Finish button.

    Step 2 - Attach Database Connection

    Now we will attach our database connection that we want to mask.

    Under Add Components, choose the Attach Default Database Connection component.

    This will bring up a screen to select your Database Connection.

    Step 3 - Attach Database Definition 

    Next, attach the definition version to mask. Under Add Components, choose the Attach Definition Version component.

    Note: The definitions and versions must be associated with the database connection we configured in the previous step. Definitions that are created from other database connections will not execute.

    Step 4 - Create the Rule Set

    A rule set specifies the data generation functions to be applied to each column in selected tables of the database. Generation rules define the functions that will generate values and, in the end, create the synthetic data.

    Now we will use the attached Definition to create a new rule set. To do this, click on the drop down associated with the Definition component and select the action “Create New Rule Set”. Then, click the blue play button. 

    This will cause a pop up to appear in which the tables can be selected for generation. To select a table, click on the box next to the table name.

    This will cause the box next to table name to turn blue, showing that they have been selected. Multiple tables can be selected, when switching between schemas by clicking the “Add tables” button.

    After execution is complete a new ruleset will be attached to the activity. You may need to refresh the components section for the ruleset to appear.

    Next, we will modify the default ruleset to embed our synthetic data generation functions. Navigate to the component for the ruleset. Change the dropdown action to Modify and then click the blue play button.

    This view allows the assignment of data generation rules to each column and the ability to turn off/on data generation for those columns. To edit the data generation rules, click on the field which contains the rule. Alternatively, click on the blue button to open the Data Editor dialog (which contains all the available synthetic data generation functions).

    Step 5 - Create the Data Generation Configuration and DLL

    In this step we’ll create a data generation configuration file, and an associated DLL which can be consumed by VIP. We will open the configuration in VIP where we can specify advanced configurations, parameters, and table relationships. The benefit of opening the configuration in VIP is it allows us to test and run example generations direct from our local machine and tweak the configurations until we achieve suitable generation results.

    Firstly, we need to create the configuration. Select to Modify config and DLL in the Configuration Rules component. Click the blue execute button.

    In the Job Parameters form that pops up, change the option to Update or Create the Definition Version DLL.

    Then click the execute button. This will create a log in the activity log at the bottom of the data activity and will add the DLL in the components section.

    The next step is to create the data generation configuration. We will repeat the above steps to launch the Job Parameters form, but this time we will change the option to Create or Update the Generator Configuration and click execute.

    Find the Logs of the activity, you'll now find an entry for creating the configuration. There will be a .cfg file available - we will want to download this.

    Download the file to a desired location. An example could be a subfolder in C:\VIPWork. The example in this page used the directory C:\VIPWork\OT_Customers.

    Step 6 - Loading Generation Configuration

    The next step will be to view and edit the configuration within VIP. First, we'll want to start the VIP application on your local machine. Once VIP is available, navigate to Test Data -> Edit Data Generation. Then, we will navigate to the .cfg file that was downloaded. In this example, the file is saved in C:\VIPWork\OT_Customers.

    This will open the configuration file and allow the user to configure the generation rules for each column in the table, along with relationships, parameters, and other advanced configurations.

    The Columns for each database will have the rules defined from the ruleset in the portal.

    Step 7 -  Navigating Tables and Columns

    The tables are organized in the parent child relationships found in the database definition. This allows the user to easily find desired tables and understand how they are related.

    Step 8 - Basic Editing of Data Generation Rules

    This document will cover the basic steps of using VIP to generate data.  The configuration screen allows the user to navigate to each database table and view the associated data generation rules. 

    There are two options for editing the values which appear. We can either embed functions which directly embed the data when performing a generate, or we can add parameters which can be user defined at run-time in a self-service form.

    Synthetic Data Functions

    Synthetic data functions are methods which generate values into a specific column. Enterprise Test Data contains hundreds of functions out-the-box for generating different data types, and characteristics. An example below shows a table, with generation functions applied.

    Double clicking on a column will open up a VIP window, which has intelligence to help the user choose and preview data generation functions.

    Parameterise Values

    Along with being able to generate synthetic values into columns, a user can also expose a value as a parameter. This means that the value will be substituted into the column later. Parameters allow values to be user-defined in a self-service form, set in an automation script, or through modelling different values to create covered sets of values for specific parameters.

    To create a parameter, we can click on a cell and select ‘Create parameter’. This will expose the column as a parameter which can eventually be overridden be a user-defined value.

    Step 9 - Exporting the Flow

    The final step is to export the configuration to a VIP flow. The VIP flow is an executable workflow which creates the synthetic data into the database. We will export it and then run the workflow locally on our machine to trial the generation rules. This is a great way to debug the generation routine, and iteratively tweak any generation rules until you create the appropriate synthetic data set.

    Once the Rules for the Data Generation have been configured in VIP, press the Export Flow button to export the flow. 

    This will export a flow based on the configuration in VIP. During this step, it will show any compilation errors and provide a link which shows which section of the configuration is causing the error.

    If the machine that is running VIP has access to the database, you can click Run in VIP to test that the data generation flow creates data as expected. After a successful run you will see the data generated in your database.

    Step 10 - Uploading the Generation Routine to the Portal

    Once it has been confirmed the generation is working as expected, it is time to syncronise the workflow back to the portal.

    Firstly, we will synchronise the ruleset back to the portal. Click the Sync Rule Set button. If you navigate back to the ruleset, you will see the generation rules are mapped to the configuration in VIP. 

    Next, navigate back to the data activity in the portal. Click on the Upload a Flow Action and then navigate to where the vip flow is saved (this will be the file with the .vip extension)

    Step 11 - Creating the Submit Form

    A submit form is a reusable form which can be embedded into the self-service portal for future use. Whenever any user wants to create synthetic data, they can do so using the created form.

    Click on the Create Data Generation Submit Form.

    The following form will appear, allowing the user to customize the form they are creating.

    • The first box allows the user to enter the name of the submit form 

    • The second box allows the user to choose the group where the submit form will be located. The default is “Data Generation”. Alternatively, the user can update an existing process by using the drop down below.

    • The two check boxes show different features.

      • The Add in drop down selections for parameters linked to definitions with enumeration will mean that drop downs will be filled automatically by values stored in the definition.

      • Include a field to override todays date in the submission for will create a field in the submission form that allows the user to override any values that would use todays date in.

    Step 12 - Executing a Generate

    Once the form has been created you can change the drop down associated with the form to Execute and then press the blue execute button.

    This will reveal the self-service form for executing the data generation. Below is the form we have created in the tutorial.

    In the form we have a ‘Orders – Status’ input available. This column has been exposed as a parameter in our data generation configuration, and therefore is now overridable in our self-service form. In addition, the Orders – Status drop down has been automatically filled values. These come automatically from the enumerations defined in the data definition as we have checked the “Add in drop down selections for parameters linked to definitions with enumeration” box.