Free Online Courses for Software Developers - MrBool
× Please, log in to give us a feedback. Click here to login
×

You must be logged to download. Click here to login

×

MrBool is totally free and you can help us to help the Developers Community around the world

Yes, I'd like to help the MrBool and the Developers Community before download

No, I'd like to download without make the donation

×

MrBool is totally free and you can help us to help the Developers Community around the world

Yes, I'd like to help the MrBool and the Developers Community before download

No, I'd like to download without make the donation

Existing data sets and tools – Populating the database – Part 2

See this article a discussion of alternatives to fill up the database with test data

sql-PopulandoSQL_Traduzido_part_2_v1 (1)test.doc

There is a pre-requisite to use a database: data. These data are stored according to the specification of a data model that is designed from the modeling generated from the analysis of the storage requirements. For example, a system that controls the access of employees in a company that replaces the old timecard will have data about people, times, dates, authorization, etc.

Most models of data is filled through the use of an interface system that allows the capture of input performed by manual typists/users or automatically performed by sensors or specialized equipment. But what to do when it is necessary to fill the database with simulated data? This situation is very common during validations, approvals, performance testing, exportation of the database and other situations. Unfortunately, the available databases in the market do not have adequate resources and tools to generate simulated data suitable for real data models.

Based on this scenario, this article will present some open-source alternatives to allow the generation of simulated data, also known as mass screening. The article also shows how to populate data in tables of three partial data models that in order to illustrate the needs of simulated data. To task of generate simulated data is crucial for any professional who works with developing and managing the database, since data generation is closely linked with the process of testing and approval.

In this second part we are going to see some alternatives to generate simulated data including the use of existing datasets, the creation of custom scripts to insert data and two open source tools freely available to work with simulated data.

Alternatives to generate simulated data

A developer who works with databases or programming has several alternatives to assist the process of generating simulated data. This section describes the use of existing datasets, the creation of custom scripts and three free software tools that automate the process of generating simulated data.

Existing datasets

The generation of simulated data should be only considered after checking the possibility of obtaining an existing dataset filled with the required data. The popularization of the Internet brings out new possibilities to find different dataset filled with information ready to be used at no additional cost. However, these datasets have their own structure and are usually very specific, which means that you must perform some kind of adaptation to take advantage of these data. Table 1 presents a listing of some URLs where you can entire datasets or files with different data types. Note that the data presented by the sources are in different formats such as XML, CSV, SQL, and loose files.

Source

Description

URL

CMS WordPress

Posts, comments and other test data for the WordPress  CMS

http://migre.me/27Z96

IP addresses by country

Table with the IP addresses associated with each country

http://migre.me/27Z9U

Blacklist domains and emails

Black list of email addresses and domains associated with spam

http://migre.me/27ZaA

Music dataset

Complete dataset with list of artists, albums, tracks, etc.

http://migre.me/27Zbi

Image bank

List of images for computer vision (faces, fingerprints, natural scenes, medical imaging, etc.).

http://migre.me/27ZbD

Wikipedia dataset

Several backups of all comments, pages, and historical change in the Wikipedia

http://migre.me/27Zev

Sakila Sample Database

Complete database for video rental. Used in multiple instances of MySQL and PostgreSQL

http://migre.me/27Zfv

Dell DVD Store Database Test Suite

Flexible database for video rental. Used in tests of DELL hardware

http://migre.me/27Zge

Foodmart

OLAP database on sales of food products

http://migre.me/27ZgN

Northwind e Pubs

Sample databases for SQL Server 2000 (sales of products and bookstore)

http://migre.me/27ZiM

Cycle Gears

SQL Server 2005 sample database (sales of sports goods)

http://migre.me/27Zjq

Human Resources

Information on the database of human resources used in the examples of Oracle. The scripts are in the package installation of Oracle

http://migre.me/27ZjP

Table 1. List of some datasets available on the internet

In addition to the databases listed in Table 1 there are several other ways to get important data. There are even companies that specialize in finding, cleaning and proving specific dataset for sale.

Custom script

The first action taken by many developers and DBAs when it comes to fill up the database with simulated data is to create a custom script. This alternative is also called ad hoc because it is completely dependent on the model and need to be tailored. Either way, filling a script to mount the database requires planning and is more powerful alternative in terms of flexibility. This script should contain some fixed values   as well as instructions to generate with random number in order to simulate the range of values  . For example, suppose the script programmed in the Oracle''s PL/SQL programming language presented in Listing 1. This script is intended to fill the ORDERS table shown in Figure 1 with simulated data.



Figure 1. A simple table called Orders for the sample custom script.

Listing 1. Script for an example of generating simulated data in Oracle.

/* INSERT 50 SIMULATED ORDERS */

DECLARE

  X INTEGER   := 1;

  V_DATE DATE := trunc(trunc(sysdate,''mm''),''dy'');

  VALUE_CUSTOMER INTEGER :=0;

  VALUE_DELIVERYCOMPANY INTEGER := 0;

BEGIN

 

  DBMS_RANDOM.INITIALIZE(10);

  WHILE X <= 50 LOOP

   

    VALUE_CUSTOMER        := 101+ABS( DBMS_RANDOM.RANDOM MOD 10);

    VALUE_DELIVERYCOMPANY := ABS( DBMS_RANDOM.RANDOM MOD 3) + 1;

   

    INSERT INTO ORDERS VALUES(X, VALUE_CUSTOMER , VALUE_DELIVERYCOMPANY,15,V_DATE,5);

 

    V_DATE := V_DATE + ABS( DBMS_RANDOM.RANDOM MOD 10) + 1;

    X      := X+1;

  END LOOP;

 

 

  DBMS_RANDOM.TERMINATE;

END;

.

/

The script in Listing 1 adds 50 new orders to the ORDERS table. The value of the x variable is a counter that is incremented at each step of the WHILE loop and is used to fill the column value idOrder. The idCustomer column is calculated as follows: a random number between 0 and 9 is added to the fixed value 101 so a new value    between 101 and 110 is stored in the idCustmer column, as this is the range of values   that represents clients registered in another table (not shown in Figure 1). This technique is used to prevent violations of the foreign key. Similarly, the value for column idDeliveryCompany is generated by a random number between 1 and 3. The value is fixed for the column idSalesPerson (15), the date is fixed (current date) and the number of items in the order is also fixed (5).

The script in Listing 1 contains only one example of how you can generate simulated data from a custom script. It is worth using random generated values   when it is required to work with ranges of values   to avoid violating foreign keys.

Tools for Data Generation

As mentioned in this article previously, there are several tools that generate simulated data. These tools have options to specify the data type of the columns, the formation rule of the data, lists of values  , odds, export formats and the amount of rows to be generated. This article will present two open source tools that can assist the reader when there is a need to generate simulated data: GenerateData, Spawner. The next part will present the third tool: dgMaster.

GenerateData

The GenerateData is a simple tool for generating data that can be accessed online at the address http://www.generatedata.com. Figure 2 shows the interface for this data generation tool.


Figure 2. The interface of the GenerateData tool.

The GenerateData has two usage modes: on the Web site or by the download and installation of its source code in PHP+MySQL. The difference is simple: when using the GenerateData online one can only generate 200 lines, while the installation version has no such limitation. At the interface of this tool, shown in Figure 1, we can choose the export format (line Result Type), the type of data used for the country such as city, ZIP and others (line Country-specific data) among other options for values. You can also specify multiple columns, data types and formats that include auto incremented values  , alpha-numeric, and custom lists that use masks to create a data format, e.g.  XXX-9999. In this mask an X represents any alphanumeric value and a 9 represents a numeric digit (0-9).

Spawner

The Spawner (http://sourceforge.net/projects/spawner/) is an open source desktop tool developed in Object Pascal. Figure 3 shows its interface for data generation.



Figure 3. The interface of the Spawner tool.

The Spawner has all the features of GenerateData without any limitation on the number of rows that can be generated. Moreover, it can also generate CSV files, SQL, and automatically populate a MySQL database. One of the interesting features of this tool is the capacity to generate values   from a list of other values   stored in a text file, something that the GenerateData tool does not provide. This option allows you to customize the type of data that can be generated, especially when dealing with foreign key constraints. In some preliminary tests the Spawner had the best performance for the generation of large amounts of data when compared to the GenerateData and dgMaster tools.

Conclusion

This article presented a discussion and examples of how to generate simulated data to a database. The generation of simulated data is important in various situations such as testing, approval, and transfers of the database without disclosure of sensitive data and others.

The first part article covered the theoretical aspects that should be considered prior to generating data.  This second part discussed some alternatives to simulated data: how to work with custom scripts, some ready databases and presentation of two open source tools that can be used to automate the generation of simulated data: GenerateData and  Spawner.

To see the first part, access: http://mrbool.com/p/Populating-the-database-Part-1/22850



Mauro Pichiliani has the Master of Science degree on collaborative systems by the Aeronatics Institute of Technology (ITA) in Brazil. He is a specialist on database technologies with more than 8 years of experience on the industry...

What did you think of this post?
Services
[Close]
To have full access to this post (or download the associated files) you must have MrBool Credits.

  See the prices for this post in Mr.Bool Credits System below:

Individually – in this case the price for this post is US$ 0,00 (Buy it now)
in this case you will buy only this video by paying the full price with no discount.

Package of 10 credits - in this case the price for this post is US$ 0,00
This subscription is ideal if you want to download few videos. In this plan you will receive a discount of 50% in each video. Subscribe for this package!

Package of 50 credits – in this case the price for this post is US$ 0,00
This subscription is ideal if you want to download several videos. In this plan you will receive a discount of 83% in each video. Subscribe for this package!


> More info about MrBool Credits
[Close]
You must be logged to download.

Click here to login