There is a pre-requisite to use a database: data. These data are stored according to the specification of a data model that is designed from the modeling generated from the analysis of the storage requirements. For example, a system that controls the access of employees in a company that replaces the old timecard will have data about people, times, dates, authorization, etc.
Most models of data is filled through the use of an interface system that allows the capture of input performed by manual typists/users or automatically performed by sensors or specialized equipment. But what to do when it is necessary to fill the database with simulated data? This situation is very common during validations, approvals, performance testing, exportation of the database and other situations. Unfortunately, the available databases in the market do not have adequate resources and tools to generate simulated data suitable for real data models.
Based on this scenario, this article will present some open-source alternatives to allow the generation of simulated data, also known as mass screening. The article also shows how to populate data in tables of three partial data models that in order to illustrate the needs of simulated data. To task of generate simulated data is crucial for any professional who works with developing and managing the database, since data generation is closely linked with the process of testing and approval.
This last part present two examples of simple data models that require simulated data. The models contain relationships with cardinalities 1:N and N:M and are discussed in the context o data generation.
Example 2: Relationship with the 1:N cardinality
This example discuss how to generate simulated data for two tables whose relationship has the cardinality 1:N. The tables chosen are VEHICLE and VEHICLE_USE belonging to a model that tracks information about the use of vehicles for a fleet management system in a big company. Figure 1 shows the tables VEHICLE_USE and VEHICLES related by foreign key and primary key constraints on the column ID_VEHICLE placed on both tables.
Figure 1. Tables VEHICLE_USE and VEHICLES in the project database to control fleet usage.
The generation of simulated data for these two tables must be made in two steps: first it is necessary to generates the data for the table VEHICLES and then insert the data in the table VEHICLE_USER. In this way we avoid generating data that are inconsistent with the constraint imposed by the foreign key.
The first step for generating data for the table VEHICLES is set up a generator to auto-increment the column ID_VEHICLE as this will cause no duplication of value for the primary key. Moreover, it is worth to have generators that are based on lists of values for the columns MANUFACTURER, MODEL_YEAR, COLOR, INSURANCE_COMPANY, and TYPE_VEHICLE. The column NUMBER_INS_POLICY must be generated in accordance with the format used by insurance policy numbers. The values for the column PURCHASE_DATE must be created from generators that accepct a start date and a end date with the exception of specific dates where the purchase can not be made (weekends and holidays). The other columns can be generated typically by regular expressions and random values .
With the data generated for the VEHICLES table now we must must fill up the table VEHICLE_USE. Sequential values automatically incremented can be placed in the column ID_USE. It is required to get a list with the values previously registered in the table VEHICLE and only then choose amont them to insert data in the column ID_VEHICLE of the table VEHICLE_USE. Other than this detail, the table VEHICLE_USE has no restrictions other than the following rule to the column value DATE_RETURN: the return date values must be greater than the value of the column DATE_GET because the returning date of the vehicle indicates when the vehicle was being used as a benefit. This constraint is easily specified when a range of start and end dates are defined in the tools.
Example 3: Relationship with the M:N cardinality
This example features the classic M:N relationship between a table that store orders and a table that store products. This type of relationship requires a third additional table to relate which products belong to which orders. The orders table is simple and it's called ORDERS, the products table is called PRODUCTS table and the relationship between then is made by a table called ORDER_ITENS. Figure 2 shows these three tables, their columns and relationships.
Figure 2. Classic parent-child relationship between the tables ORDERS, ORDER_ITENS and PRODUCTS.
To create simulated data for the tables in this type of relationship is necessary to separate the generation of simulated data in three steps: first you must generate the data to PRODUCTS, then the ORDERS and finally put the data in table ORDER_ITENS in order to comply with its two foreign keys placed in the columns ID_ORDER and ID_PRODUCT.
The generation of data for the PRODUCTS table has no major difficulties. It takes a sequential number generator self-countable for the column ID_PRODUCT, and generators for simple description (PROD_DESCRIPTION) and PROD_PRICE. The most difficult step is the use of lists of values from one file to full up the description columns and also the specification of numerical values with decimal places to the maximum and minimum prices. The ORDERS table is also simple and requires a sequential number generator for the ID_ORDER and a random value generator for the dates in the ORDER _DATE column.
To create the relationship between the two tables and populate the ORDER_ITENS table one must respect two rules: the foreign keys and also the primary key. The foreign keys can be generated from the list of values already created in the ORDERS and PRODUCTS tables without problems. But to guarantee that there will be no repetition is something a little more complex to be done with the tools to generate data. In this situation it is recommended to first generate the data with possible duplicated values and then perform a simple post-processing with a DELETE statement that search for duplications and removes these values before creating the primary key in the table ORDER_ITENS.
This article presented a discussion and examples of how to generate simulated data to a database. The generation of simulated data is important in various situations such as testing, approval, and transfers of the database without disclosure of sensitive data and others.
The first part of the article covered the theoretical aspects that should be considered prior to generating data. We discussed the need for representation of data, the quality and quantity, the environment, the implications and technical aspects involved in this type of task such as import options, growth, and purging the transaction log data.
The second part discussed some alternatives to simulated data: how to work with custom scripts, some ready databases and presentation of two open source tools that can be used to automate the generation of simulated data: GenerateData and Spawner
The third part presented the dgMaster, a more complex Java tool used to perform complex generation of data. We also talked about a simple table and how to proceed when generating data for this table.
Finally, this last part presents two examples where it may be necessary to generate simulated data: two tables with a relationship with cardinality 1:N and three tables involved in a relationship with cardinality N:M. The main details about the columns and specific types of data were discussed in order to generate the specific data in each data model.
To see part three, go to: http://mrbool.com/p/Generating-data-for-a-simple-table-Populating-the-database-Part-3/22940