There is a pre-requisite to use a database: data. These data are stored according to the specification of a data model that is designed from the modeling generated from the analysis of the storage requirements. For example, a system that controls the access of employees in a company that replaces the old timecard will have data about people, times, dates, authorization, etc.
Most models of data is filled through the use of an interface system that allows the capture of input performed by manual typists/users or automatically performed by sensors or specialized equipment. But what to do when it is necessary to fill the database with simulated data? This situation is very common during validations, approvals, performance testing, exportation of the database and other situations. Unfortunately, the available databases in the market do not have adequate resources and tools to generate simulated data suitable for real data models.
Based on this scenario, this article will present some open-source alternatives to allow the generation of simulated data, also known as mass screening. The article also shows how to populate data in tables of three partial data models that in order to illustrate the needs of simulated data. To task of generate simulated data is crucial for any professional who works with developing and managing the database, since data generation is closely linked with the process of testing and approval.
In this second part we are going to see some alternatives to generate simulated data including the use of existing datasets, the creation of custom scripts to insert data and two open source tools freely available to work with simulated data.
Alternatives to generate simulated data
A developer who works with databases or programming has several alternatives to assist the process of generating simulated data. This section describes the use of existing datasets, the creation of custom scripts and three free software tools that automate the process of generating simulated data.
The generation of simulated data should be only considered after checking the possibility of obtaining an existing dataset filled with the required data. The popularization of the Internet brings out new possibilities to find different dataset filled with information ready to be used at no additional cost. However, these datasets have their own structure and are usually very specific, which means that you must perform some kind of adaptation to take advantage of these data. Table 1 presents a listing of some URLs where you can entire datasets or files with different data types. Note that the data presented by the sources are in different formats such as XML, CSV, SQL, and loose files.
Posts, comments and other test data for the WordPress CMS
IP addresses by country
Table with the IP addresses associated with each country
Blacklist domains and emails
Black list of email addresses and domains associated with spam
Complete dataset with list of artists, albums, tracks, etc.
List of images for computer vision (faces, fingerprints, natural scenes, medical imaging, etc.).
Several backups of all comments, pages, and historical change in the Wikipedia
Sakila Sample Database
Complete database for video rental. Used in multiple instances of MySQL and PostgreSQL
Dell DVD Store Database Test Suite
Flexible database for video rental. Used in tests of DELL hardware
OLAP database on sales of food products
Northwind e Pubs
Sample databases for SQL Server 2000 (sales of products and bookstore)
SQL Server 2005 sample database (sales of sports goods)
Information on the database of human resources used in the examples of Oracle. The scripts are in the package installation of Oracle
Table 1. List of some datasets available on the internet
In addition to the databases listed in Table 1 there are several other ways to get important data. There are even companies that specialize in finding, cleaning and proving specific dataset for sale.
The first action taken by many developers and DBAs when it comes to fill up the database with simulated data is to create a custom script. This alternative is also called ad hoc because it is completely dependent on the model and need to be tailored. Either way, filling a script to mount the database requires planning and is more powerful alternative in terms of flexibility. This script should contain some fixed values as well as instructions to generate with random number in order to simulate the range of values . For example, suppose the script programmed in the Oracle''s PL/SQL programming language presented in Listing 1. This script is intended to fill the ORDERS table shown in Figure 1 with simulated data.
Figure 1. A simple table called Orders for the sample custom script.
Listing 1. Script for an example of generating simulated data in Oracle.
/* INSERT 50 SIMULATED ORDERS */
X INTEGER := 1;
V_DATE DATE := trunc(trunc(sysdate,''mm''),''dy'');
VALUE_CUSTOMER INTEGER :=0;
VALUE_DELIVERYCOMPANY INTEGER := 0;
WHILE X <= 50 LOOP
VALUE_CUSTOMER := 101+ABS( DBMS_RANDOM.RANDOM MOD 10);
VALUE_DELIVERYCOMPANY := ABS( DBMS_RANDOM.RANDOM MOD 3) + 1;
INSERT INTO ORDERS VALUES(X, VALUE_CUSTOMER , VALUE_DELIVERYCOMPANY,15,V_DATE,5);
V_DATE := V_DATE + ABS( DBMS_RANDOM.RANDOM MOD 10) + 1;
X := X+1;
The script in Listing 1 adds 50 new orders to the ORDERS table. The value of the x variable is a counter that is incremented at each step of the WHILE loop and is used to fill the column value idOrder. The idCustomer column is calculated as follows: a random number between 0 and 9 is added to the fixed value 101 so a new value between 101 and 110 is stored in the idCustmer column, as this is the range of values that represents clients registered in another table (not shown in Figure 1). This technique is used to prevent violations of the foreign key. Similarly, the value for column idDeliveryCompany is generated by a random number between 1 and 3. The value is fixed for the column idSalesPerson (15), the date is fixed (current date) and the number of items in the order is also fixed (5).
The script in Listing 1 contains only one example of how you can generate simulated data from a custom script. It is worth using random generated values when it is required to work with ranges of values to avoid violating foreign keys.
Tools for Data Generation
As mentioned in this article previously, there are several tools that generate simulated data. These tools have options to specify the data type of the columns, the formation rule of the data, lists of values , odds, export formats and the amount of rows to be generated. This article will present two open source tools that can assist the reader when there is a need to generate simulated data: GenerateData, Spawner. The next part will present the third tool: dgMaster.
The GenerateData is a simple tool for generating data that can be accessed online at the address http://www.generatedata.com. Figure 2 shows the interface for this data generation tool.
Figure 2. The interface of the GenerateData tool.
The GenerateData has two usage modes: on the Web site or by the download and installation of its source code in PHP+MySQL. The difference is simple: when using the GenerateData online one can only generate 200 lines, while the installation version has no such limitation. At the interface of this tool, shown in Figure 1, we can choose the export format (line Result Type), the type of data used for the country such as city, ZIP and others (line Country-specific data) among other options for values. You can also specify multiple columns, data types and formats that include auto incremented values , alpha-numeric, and custom lists that use masks to create a data format, e.g. XXX-9999. In this mask an X represents any alphanumeric value and a 9 represents a numeric digit (0-9).
The Spawner (http://sourceforge.net/projects/spawner/) is an open source desktop tool developed in Object Pascal. Figure 3 shows its interface for data generation.
Figure 3. The interface of the Spawner tool.
The Spawner has all the features of GenerateData without any limitation on the number of rows that can be generated. Moreover, it can also generate CSV files, SQL, and automatically populate a MySQL database. One of the interesting features of this tool is the capacity to generate values from a list of other values stored in a text file, something that the GenerateData tool does not provide. This option allows you to customize the type of data that can be generated, especially when dealing with foreign key constraints. In some preliminary tests the Spawner had the best performance for the generation of large amounts of data when compared to the GenerateData and dgMaster tools.
This article presented a discussion and examples of how to generate simulated data to a database. The generation of simulated data is important in various situations such as testing, approval, and transfers of the database without disclosure of sensitive data and others.
The first part article covered the theoretical aspects that should be considered prior to generating data. This second part discussed some alternatives to simulated data: how to work with custom scripts, some ready databases and presentation of two open source tools that can be used to automate the generation of simulated data: GenerateData and Spawner.
To see the first part, access: http://mrbool.com/p/Populating-the-database-Part-1/22850