There is a pre-requisite to use a database: data. These data are stored according to the specification of a data model that is designed from the modeling generated from the analysis of the storage requirements. For example, a system that controls the access of employees in a company that replaces the old timecard will have data about people, times, dates, authorization, etc.
Most models of data is filled through the use of an interface system that allows the capture of input performed by manual typists/users or automatically performed by sensors or specialized equipment. But what to do when it is necessary to fill the database with simulated data? This situation is very common during validations, approvals, performance testing, exportation of the database and other situations. Unfortunately, the available databases in the market do not have adequate resources and tools to generate simulated data suitable for real data models.
Based on this scenario, this article will present some open-source alternatives to allow the generation of simulated data, also known as mass screening. The article also shows how to populate data in tables of three partial data models that in order to illustrate the needs of simulated data. To task of generate simulated data is crucial for any professional who works with developing and managing the database, since data generation is closely linked with the process of testing and approval.
In this third part we are going to see one more tool used to generate simulates data and one simple example of a table that required simulated data.
The open source tool developed in Java called dgMaster (http://dgmaster.sourceforge.net/) is more complete then the GenerateData and Spawner tools. It features dozens of automatic generators of values, has the possibility to define custom data types, use regular expressions, direct integration with any database that has a JDBC driver connection, has the ability to be executed in a command line, can setting up generators from the editing XML files, and can rely on all the functions created in Java as arsenal for the creation of data generators. Figure 1 shows the graphical interface of dgMaster.
Figure 1. Graphical interface of the dgMaster data generation tool.
Examples of data generation
To make it easier to understand the need for data generation and how to do it, this article will present three examples of data generation using existing data models. First we will see how to generate data for a table without relationships. Next is an example of how to generate data for a table in a relationship whose cardinality is 1:N. Finally, we will show how to generate data for three tables in a relationship whose cardinality is N:M.
Example 1: Table without relationships
In this example of generating data a table without relationship will be used. The table selected is called APARTMENTS and is used to represent the residents of a building. Figure 2 shows the table and its columns.
Figure 2. Apartments table for a database that store apartment's owners data.
This table has several fields with specific rules. The generation of values for the column SSN_OWNER (Social Security Number of the apartment's owner) is the most complex data to generate since there is a specific rule for the format of a social security number. Therefore, it is recommended to use an already existing list of Social Security Numbers stored in a file or rely on an external program in Java that contain logic required to create test SSNs and then use the a custom data generator dgMaster. The same principle can be applied to the column OWNER_NAME to get a name against a list of full names (name+surname). The column TOWER will probably have few values such as the North Tower, South Tower, etc. It is recommended, in this case, to manually mount the list of possible values. The tools discussed in the previous section have in their interfaces options to display a list of values and, during the generation of data, a list of values will be chosen at random. For the telephone columns (PHONE_NUMER1 and PHONE_NUMBER2) it is recommended to use a mask to fit the data required by the phone format. For example, the mask pattern (99) (999) 9999-9999, where 9 means any digit between 0 and 9. The other columns do not have much complexity in their values and can be generated by automatic generation tools.
This article presented a discussion and examples of how to generate simulated data to a database. The generation of simulated data is important in various situations such as testing, approval, and transfers of the database without disclosure of sensitive data and others.
The first part of the article covered the theoretical aspects that should be considered prior to generating data. We discussed the need for representation of data, the quality and quantity, the environment, the implications and technical aspects involved in this type of task such as import options, growth, and purging the transaction log data.
The second part discussed some alternatives to simulated data: how to work with custom scripts, some ready databases and presentation of two open source tools that can be used to automate the generation of simulated data: GenerateData and Spawner
This third part presented the dgMaster, a more complex Java tool used to perform complex generation of data. We also talked about a simple table and how to proceed when generating data for this table.
In the next article we will finalize the examples and conclude this series.
To see part two, go to: http://mrbool.com/p/Existing-data-sets-and-tools-Populating-the-database-Part-2/22899