Synthetic Data Generation
Overview
Document details | |
Purpose | To help you understand how to generate data in the Enterprise Test Data environment |
Audience | Anyone needing to generate synthetic data |
Requirements | Access to the Curiosity Dashboard. |
Additional Links | Video tutorial demonstrating the creation and use of the Data Generation activity |
The data generation activity is a process that can create fictitious data that you can use in your tests. It: avoids the security problems of using customer data; allows for coverage of all user stories, including edge cases, rare events and new scenarios; and means that there is always data available for testing.
Terms and functions used
Database scan: A listing of the database structures such as schemas, tables and columns.
Database definition: A snapshot of the database, showing the relationships between database components such as tables, columns and keys.
Rule set: All the rules for generating new data in a data generation activity
VIP flow: a routine that will generate the synthetic data on the server.
Steps to create and use a data generation activity
Connect to the database
If you do not already have a database connection, then you can follow the steps here to create a new database connection by navigating to Data dictionary (1)→ databases tab (2) and clicking the New Connection Profile button (3).
Scan the database
This process will grab the schema of the database, so that it can be analysed in the next steps.
Navigate to Data dictionary (1)→ databases tab (2) and clicking the newly created connection (3).
There are two ways to run the scan, as listed below.
Run Scan (Native)
This option will run the scan directly from where Enterprise data instance is hosted and will open the job details page in a new tab.
Run Scan (VIP Server)
Use this option if you are using a data agent to access the database. Clicking this option will display the ‘Select Server’ Dialog box
You can select the Server (1) to run the scan on from the drop-down list.
There are two Process (2) options and the next screen viewed will depend on the process selected:
‘Get Schema Metadata’: This will grab the schema data
Note that you can schedule the job to run at a different time.
‘Schema report’: This will generate a report on the schema.
Once the scan is executed the browser tab will switch to the job details page, where you can track the process being executed.
View the scan
Once the scan job has completed, the resultant scan it will be visible on the profile page for the database connection.
Click the scan to see more details about the database scan
Click on the schema names to see the list of tables, and click those, to see the columns.
Create a definition
This creates a snapshot view of the database.
Navigate to Data dictionary (1)→ Definitions tab (2) and click New Definition (3).
Then follow the steps in the the page: Create a definition to add a database definition.
There are details on viewing a database definition in the article: Database Definition View Explained.
If you select your new definition on the Data dictionary → Definitions tab, then you will see a page where you can view:
tables (1) which you can navigate into
diagrams (2) of the tables, showing their relationships.
Set up a data activity to create synthetic data
Create the data generation activity
There is a data generation dialog, that can be started in two ways:
In the enterprise test data section, on the dashboard there is a link for the Data Generation activity.
Alternatively, navigate to the Activity explorer (1) → choose a folder (2), or add a sub-folder using the + button by the folder name → then click the Add Activity drop-down (3) and choose data generation.
Either of these methods will display the Data Generation dialog box.
The first page of the dialog box will allow you to name and describe the activity (and choose the application and server to run it)
Note that the ‘Location’ page is not displayed, if you started the dialog through the activity explorer.
The summary page will show you the updates that will be made once you click save.
Once it is created, you have the option to open it directly, or to navigate to it in the activity explorer at a later time, to configure it.
Configure the data generation activity
Attach the Database Connector (1)
Attach the Database definition (2)
Run the action: ‘Create new rule set’ for the data definition, by clicking the blue arrow (3). This will contain all the rules for generating the new data.
Creating the rule set
This action will let you set which tables you are going to generate data for, using the New Rule Set dialog box.
The Details tab lets you set the name and description for the rule set.
The Configuration Tab lets you set whether all tables will be used and whether id columns are active
Then the Tables tab will allow you to select which tables the data generation will be based on.
Note that once you have selected one or more tables (1), you need to click the Add Tables button (2), in order to add them to the rule set.
When you create the rule set, you will need to generate an up-to-date model for the definition, by clicking the regenerate button you will see in the message on the dialog box, as seen below.The job will open in a new browser window and once it is complete, you can save your rule set.
Once you have created the rule set, you will need to configure it.
Configure the rule set
This action will let you set how data will be generated for each of the columns in the table.
You can update the rule set by clicking on the rule set within the data activity window.
On the rule set page, you can view the columns for a table, by clicking the arrow beside the table name. Then you can set up rules for each of the columns. For example, in the view below, the first_name column is set to:
RandomHelper.Faker.Name.FirstName("")
Which will generate a random first name.
Note that modeller will attempt to set up functions to create the data, but you can replace these as necessary.
On each column row, there are a set of symbols that will respectively:
Open the data painter | |
Insert a function | |
cast the type | |
start an action based on the column |
Additionally there is a switch to activate the column, if it is set to Yes, then data will be generated for the column.
The values can also be set to a variable in the User-defined Variables section of the Rule set page. For example, below, I have created a user-defined variable called LastName.
This was added by clicking the +Add button and filling in the New User-defined variable dialog, an example of which is below.
Note: The Form Parameter must be set to “Yes” otherwise you may see an error when trying to create a submit form.
This can then be added to the Rules field for the relevant column. In the example below, the last_name column will be set to the value of the LastName variable which will have a default value and can be set at runtime, when you create the data.
Once you have set up the rules for generating the data, you should run the following actions on the Data activity page:
“Run validate and Preview” (1)
“Rebuild VIP flow on Server” (2)
Run validate and Preview
This is to confirm that the rules create data as expected. You can leave the values on the dialog as default and click execute.
This will open a new browser tab, to show the running job, and when complete, the sample data will be visible in the results tab
Rebuild VIP flow on Server
This will allow the server to generate the synthetic data. It will display the Rebuild VIP Flow on Server dialog box and when you click execute, it will open a new browser tab to display the job details.
Once it successfully completes, the Flow will be added as a component
Generate data
Create a submit form
In order to generate data, you need to create a submit form. This will allow anyone with appropriate permissions in your organisation to create the data.
On the Data generation activity page, click the “Create Data Generation Submit Form” action.
This will display the Create Data Generation Submit Form dialog box
This is a job scheduling dialog box, and so you can schedule the job to run at a different time, if, needed on the schedule tab.
You can generate a CSV file instead by changing the ‘Type of Submit form to Be Created’ dropdown.
You can also use this to update an existing submit form, by selecting the form in the “choose an existing process and Update it” field.
When you click execute, a new browser tab will open, so that you can see the job to create the form being processed.
Once the job successfully completes, and you refresh the browser tab you should see the submit form appear as a component in the data activity.
Run the submit form
To run the form, make sure the action is set to execute for the form then click the blue arrow
When you run the form, it will create a job which will create data in the database you have chosen.
You can change the number of records created, and any parameters that you exposed on the activity. In this example I just added LastName.
Click execute to run the job
As this is a dialog to start a job, you can schedule it for later on the schedule tab.
Note that you can click </> to embed the form into a page on your intranet, use it in CI/CD or other server processes.
To confirm that the data is created, you can check the database table to ensure that the row was added. In this example:
I used Micro DB (1) in the Curiosity platform
Checked the database connection that I was writing data to (2)
Ran SQL to view the table data (3)
Which shows the row that I created has been added (4)
Next steps
This document shows you how to add one table to a data generation activity. Adding two or more is similar and is detailed in the related article: Multi-Table Data Generation
You can embed the data generation activity into a model, so that you can set up different test cases, for example different account types based on salary: Visual Modelling and Data Generation