Introduction to Data Masking
Introduction
What is Data Masking?
Data Masking is the process of masking or anonymizing a set of data that will be used as test data for an application under development. The primary motivation for masking data is usually to comply with data protection requirements. These requirements might be dictated by applicable privacy laws or might be governed by an organisation's internal policies. They often apply when production data refers to personally identifiable information (PII), as well as to commercially or financially sensitive information. As strict laws often prohibit the use/re-use of personal data, there is good reason to anonymize data before it is moved to test and development. Data Masking, a part of Test Data Automation* supports compliance in testing and development by removing sensitive data from data before it is moved to non-production environments.
*Test Data Automation (TDA) is Curiosity's complete test data solution set. Find out more about TDA here.
Data Masking Video
For a brief overview of the Data Masking using Test Data Automation, see:
Requirements and Installation
The required pre-requisites skills and experience for data masking with Test Data Automation are:
Knowledge of SQL and SQL Server Management software
Knowledge of Microsoft Excel
Knowledge of VIP
To use Data Masking from Test Data Automation, you will need an intermediate understanding of SQL. You will also need a good level of proficiency in Microsoft Excel, and an understanding of the database that you are masking. There are four pieces of prerequisite software for Data Masking:
You must be running a 64-bit Windows machine. Windows 10 is strongly recommended and is required for the full range of functionality.
The latest 64-bit version of Microsoft Excel.
The latest version of VIP. Information on prerequisites and installation is in this article.
The latest version of the VIP Execution Server. Prerequisites and installation can be found here.
Once you have installed the prerequisite software on a 64-bit Windows machine, you will require licensed access to the relevant TDM utilities. Contact your Curiosity representative or email info@curiosity.software if you need to request a license.
Supported database types
Data Masking from Test Data Automation supports the following database types:
MySQL
Postgres
SQL Server
Oracle
DB2
MariaDB
Note: In the interest of simplicity, this document uses an example MySQL Database, but Data Masking can be done with any of the Database formats listed above. If are using a Database type that is not supported yet, please ask us for a quote – info@curiosity.software.
The Requisite Files needed for Data Masking
Data Masking is one of Test Data Automation's high-speed utilities. As set out in the process overview, a generic VIP Workflow performs the Data Masking Actions used during Data Masking. These Actions are configured by the user in the Control Spreadsheet. Each Action is customisable, with a range of parameters that can be defined.
You therefore require a number of files to run a Data Mask. The Control Spreadsheet and VIP Server Controller are used to configure a Data Mask iteratively. If running Actions via The VIP Server Controller, this spreadsheet will be the primary artefact that you work with.
Note: You can also parameterize actions in a Command Script and execute the actions from the command line rather than the VIP Server Controller. For this, a standard set of editable .CMD scripts are used. These are shipped with the VIP Server Controller installation. To run the automated Actions, you also require a set of automated VIP workflows and the associated files.
The Files
The main points for documentation is to refer to the right file path for things – so in the initial setup the DDL to create tables and functions is now in:
C:\VIPTDM\SubsetCloneAndMask\EnvironmentPrep\DDL.
The preferred way to do masking is to use SQL Server by going to:
C:\VIPTDM\SubsetCloneAndMaskRun\Mask\SQLServer
where you will see the following files:
C:\VIPTDM\SubsetCloneAndMaskRun
| PutFlowsInMemory.cmd
+---Mask
\---SQLServer
| Executor.cfg
| Mask_1_Prep.cmd
| Mask_2_Mask.cmd
| VIPMaskSQLServer.xlsx
|
|+---EnvironmentPrep
| Mask_InsertLists.cmd
|
\---Output
The steps for Masking using VIPExecutor (recommended) are:
Run PutFlowsInMemory.cmd
Edit VIPMaskSQLServer.xlsx so that it has your connection details
Run Mask_1_Prep.cmd
Manually set up masking rules5. Run Mask_2_Mask.cmd
If for some reason you would like to run Masking from the Command Line (not the recommended method), please see Data Masking from the Command Line section.
All generated output files will go to an Output folder:
C:\VIPTDM\SubsetCloneAndMaskRun\Mask\SQLServer\Output
Mask_InsertLists.cmd is the one-off environment preparation step for masking
Note: Files in C:\VIPTDM\SubsetCloneAndMask should not as a rule be changed, once the initial environment preparation has been done (ie tables and functions created).
Folder C:\VIPTDM\SubsetCloneAndMask contains
encrypted VIP flows
dlls
control spreadsheet examples (the "starting point" for each function)
batch script examples
VIPExecutor config examples
Table and UDF DDL
Regex file for masking
Seedlist .csvs
It is not possible to change any of the files with the exception of the following:
the naming used in the Table and UDF DDL
the regexes used in masking (C:\VIPTDM\SubsetCloneAndMask\DataScannerRules.csv)
the seedlists
The following requisite files (for MySQL DB) are therefore needed to define and perform the Data Masking Actions.
A blank Control Spreadsheet that points the VIP workflow to the database that will be modelled and masked, and then configures the Masking Rules.
Note: The database used throughout this documentation is SampleCommerce which is like one that would be produced by an eCommerce site. It contains various orders, grouped by the customer who made the order. Each order has an order ID, and there are products associated with each order ID. Each product in turn has a supplier company associated with it.
If running Masking Actions via The VIP Server Controller, this is the primary or sometimes only artefact you will work with.
Automated workflows used to perform the Masking Actions. These are made up of:
a. Two TDM VIP workflows that execute the Masking Actions in C:\VIPTDM\SubsetCloneAndMask:
i. SQLMask.enc.vip
ii. SQLActions.enc.vip
b. The Dynamic Link Library (.DLL) file associated with the automated Mask workflows in C:\VIPTDM\SubsetCloneAndMask
i. VIPSubset.dll
c. The standard command script used to execute the Masking Actions. This script can be edited for each Action when running from the Command Line. You also can create and rename several scripts for several, easily re-usable Actions.
i. Mask_InsertLists.cmd in C:\VIPTDM\SubsetCloneAndMaskRun\Mask\MySQL\EnvironmentPrep
ii. Mask_1_Prep.cmd in C:\VIPTDM\SubsetCloneAndMaskRun
iii. Mask_2_Mask.cmd in C:\VIPTDM\SubsetCloneAndMaskRun
If running Masking Actions via The Command Line, you will edit and re-use these command scripts.
d. The PutFlowsInMemory.cmd and associated FlowCache.txt in C:\VIPTDM\SubsetCloneAndMaskRun. FlowCache.txt is used to tell the VIP Server Controller the location of VIP workflows to put "in memory". This improves the performance of the automation.
e. Folders containing additional files are in C:\VIPTDM\SubsetCloneAndMaskRun :
i. Output – This folder contains excel files showing pre and post masked values as well as the log files. This folder by default will be located in C:\VIPTDM\SubsetCloneAndMaskRun\Mask\SQLServer\Output. Note that this file path can be set as an argument value in the VIP Controller.
ii. DDL – This folder in C:\VIPTDM\SubsetCloneAndMask\EnvironmentPrep\DDL\MySQL contains four SQL scripts for Creating/Dropping and Tables and Functions:
MySQL_Functions_Create.sql
MySQL_Functions_Drop.sql
MySQL_Tables_Create.sql
MySQL_Tables_Drop.sql
Additionally, the SQLServer folder contains
VIPFunctions – contains scripts to create tables and user-defined functions used in masking (see Initial Setup below)
TestDB – contains SQL scripts to create and populate a simple example database.
iii. Seedlists* – This folder contains comma delimited files with lists of values for use in masking. For example LastName.csv contains a randomly generated list of 1,000 last names. Seedlist files can be found in C:\VIPTDM\SubsetCloneAndMask\EnvironmentPrep\Seedlists
*Seedlists : a list or set of lists containing members of a list type that will be used as source/original data for a process such as masking.
Note: All outputs (log file, MaskReport.txt, audit .csvs…) go to an Output folder: –C:\VIPTDM\SubsetCloneAndMaskRun\Mask\SQLServer\Outputas setup by logDir="C:\VIPTDM\SubsetCloneAndMaskRun\Mask\SQLServer\Output"
Files will be provided through a downloadable installer package. Instructions will be provided for how to unpack into the man folder etc.
Housekeeping tip: It is recommended that you create a folder for each individual Data Masking task. Maintaining individual folders per Masking enables better management of your TDM artefacts. It helps to avoid them being edited or lost, and also enables you to quickly and easily repeat TDM jobs in future.
You might additionally incorporate a file structure where TDM tasks are stored under folders for individual Databases or Schemas
Data Masking List of Steps
TEST DATA AUTOMATION DATA MASKING