Data Probability Analysis
This page shows the Test Data Activity for Data Probability Analysis, a technique to analyse and discover relationships in data. In this tutorial, you'll learn how to set-up and configurate the Synthetic Data Generation Activity.
About
The Data Probability Analysis Activity is an exploration tool that analyses data sets for different combinations of values. The result is useful for identifying patterns which exist in your data. The activity uses Apriori machine learning to discover relationships. The scan will look for potential relationships between data states to see which combinations occur together, and do not occur together - indicating the potential presence of business logic. The probability analysis result can then be used to validate additional data sets, along with as an input to generate synthetic data.
Tutorial
Follow along with the video tutorial, or read the written tutorial below where each of the data probability analysis steps is broken down and explained.
Prerequisites for Data Probability Analysis
For this data activity, you will need a:
- A CSV file of the dataset to be analysed. We recommend exporting your data to CSV format.
Step 1 - Create Probability Analysis Activity
The first step to scanning your data, is to create a new data probability analysis activity. Firstly, navigate to the Data Activities dashboard, then select the 'Data Probability Analysis' activity.
This will launch the wizard for creating data activities. In this section, you are required to provide specific details about the activity, including a mandatory name and description. After filling out the necessary information, click on the 'Next' button to proceed.
After entering the activity details, you must select a location to save the data activity. Once you have chosen a location, click the 'Finish' button to complete the wizard.
A new data activity will appear, ready for configuration.
Step 2 - Probability Analysis Configuration
When a Probability Analysis activity is created, a default configuration will be set. These specify different settings within the data activity. You will want to edit these default configurations to suit your needs. To do this, click the edit button in the top right hand corner.
To begin, we will first set our property parameters in the configurations section. Click the Edit button to open the configurations tab.
The following property parameters can be configured to suit your needs:
Group Count | This is the number of columns to analyze for a match. For example, inputting a value of “2-5” will perform a range from 2 to 5. A value of “2,3” will perform an analysis on columns 2 and 3. |
Limit to Items | This value will restrict the analysis to a specific value, for example “CustomerType=Retail” |
How many to analyze | This refers to how many rows to process in the dataset. |
Must-occur confidence limit | If the analyzed probability is above this value, it will provide a value of “Must Occur” for the relationship. |
Report the full analysis | The final report will show all potential relationships if the value is True. A False value will only show the most likely relationships. By default, this is set to false. |
Columns to analyze | This value will determine how many columns at the front of the dataset will be processed. When left empty, all columns will be analyzed. A rule set can alternatively be used to select the columns to be analyzed. |
Max time to run | The maximum time to run the analysis in seconds |
Step 3 - Upload CSV file
Now that we have finished inputting our properties, we can begin our data probability analysis. You will need a CSV file to upload into TDA. Under the actions sections, we will select Upload File to be analyzed to Server. Select the input field and upload it to the activity.
Step 4 - Run Data Probability Analysis
When the file has been uploaded, it will now show in the Components section.
From here, we can execute our analysis. Click run, and then select your server to begin the job.
Step 5 - Review Results
Once the job has finished, there will be a new component that contains the results in a new CSV. Select the download action from the dropdown to download the file.
This will process the Download request and provide an option to download a zip folder with your new report.
The CSV contains the following analysis in the first 4 columns:
Probability | this is the probability that this relationship occurs in the dataset. |
Frequency | this value refers to how many times the n selected columns occur together in the data set. The Must Occur value is evaluated based on this field. |
Combination Search | this value is based on what is specified when generating a paired set. |
Possible Rule | shows the relationship we’ve discovered in our data, based on the confidence limit we configured. |
Here is an example of a final CSV report: