There is a pre-requisite to use a database: data. These data are stored according to the specification of a data model that is designed from the modeling generated from the analysis of the storage requirements. For example, a system that controls the access of employees in a company that replaces the old timecard will have data about people, times, dates, authorization, etc.
Most models of data is filled through the use of an interface system that allows the capture of input performed by manual typists/users or automatically performed by sensors or specialized equipment. But what to do when it is necessary to fill the database with simulated data? This situation is very common during validations, approvals, performance testing, exportation of the database and other situations. Unfortunately, the available databases in the market do not have adequate resources and tools to generate simulated data suitable for real data models.
Based on this scenario, this article will present some open-source alternatives to allow the generation of simulated data, also known as mass screening. The article also shows how to populate data in tables of three partial data models that in order to illustrate the needs of simulated data. To task of generate simulated data is crucial for any professional who works with developing and managing the database, since data generation is closely linked with the process of testing and approval.
In this first part we are going to see a discussion about the planning required to generate simulated data. We will focus on the type of simulated data and the context on which this data is used
Planning for data generation
Before we start discussing how to generate simulated data is necessary to consider some important points about this job. These points include the need for testing, quality and quantity of data, the environment, the implications and technical aspects involved in this type of task.
The data generation typically takes place during the testing phase of the database or during the creation of a newly developed system. Currently the testing area is far advanced compared to what we had some years ago: today we have agile methodologies, test-driven development, continuous integration, management and specialized tools for automatic generation of tests, different types of tests (unit tests, black box, white box, etc..) and other practices that show how to employ different types of data for verification, validation, system approval and suitability requirements.
An important part of testing in databases is the data generation. The importance lies in the representativeness of the data when doing testing, i.e. how reliable are the data compared to the data of a real situation. This need to represent reality is important, because what is being tested should be evaluated with maximum accuracy compared to what actually can happen in a production environment instead of data created by a tester just to complete this task quickly.
To illustrate the lack of commitment to the generation of realistic data, many times I found developers and testers that do not care about the quality and quantity of data therefore providing values completely out of context when creating test data. Something like putting names of customers who do not represent reality, addresses, types of public parks with absurdities, invalid dates, codes out of the format, values outside the predetermined ranges, unrealistic prices and other violations that do not represent the reality. For example: once I found a customer address with the following data: "Milky Way, the second star to the right" instead of something like "Street X, 25." It is also common to found situations where the user place just a dot (.) or an empty string ("") to fulfill a required text field. This type of attitude towards the completion of test data shows that there's no concern that the data do not represent reality. Surely this lack of representativeness is something that can contaminate the test and cause it to be successfully completed while with real data it is possible that a failure occurs. The reverse can also occur, i.e. testing data without representation may fail while the same test with real data would be completed successfully.
If on the one hand we have the lack of representation of data in some situations on the other hand there is the need to use specific statistical models to fill up the database. For example, a database that stores the arrival times, service and waiting time in queues of customers to be served by a cashier. Depending on the rigor of the test may be necessary to generate data that follow the statistical distribution parameters of a Poisson or Normal distribution (see DevMan Notes 1 and 2). It is common to find the need to generate statistically representative data on simulations, scientific applications, or deeper academic analysis, but it is important to know that there are mathematical models to generate data that simulate and predict the characteristics of occurrences related to real events. The focus of this article is on the generation of mass screening at a particular moment and not a formal test that requires complete statistically correct data.
DevMan Note 1. Poisson distribution
In probability theory and statistics, the Poisson distribution is a discrete probability distribution. The Poisson distribution assumes the form of a probabilistic model used in the study of a large number of observable phenomena such as telephone calls per unit time, defects per unit area, accidents per unit time, arrival of a supermarket customers per unit time, number of blood cells visible under a microscope per unit area, number of particles emitted by a source of radioactive material per unit time. The distribution was discovered by Siméon-Denis Poisson (1781-1840). The Poisson distribution of random variables used includes the number of discrete occurrences (sometimes called "arrivals") that occur during a time interval of given length.
DevMan Note 2. Normal Distribution
The normal distribution is a statistical distribution also known as Gauss or Gaussian distribution. It was developed by French mathematician Abraham de Moivre. This distribution describes a series of physical and financial phenomena and can be employed in inferential statistics. It is entirely described by its parameters mean and standard deviation, i.e. knowing these two parameters it is possible to determine any probability in a normal distribution. To test whether a set of values follows a normal distribution it is necessary to obtain the mean and standard deviation and calculate whether the set of values follow a normal distribution with a certain error rate.
Another important aspect that should be considered prior to generating test data is the configuration of environments. It is very common to find a separation of environments in companies when it comes to testing and approval. In general, there are development environments, testing/certification and production. The production environment contains the data used in the official daily database. The development and testing/approval environments contain only a fraction of the official data or even completely different data set. The recommendation is that the synchronization between data from different environments for the activities of development and testing/approval may be made with data that is as close to reality as possible according to what is being tested. The details of when and how to perform synchronization and data migration between these environments are outside the scope of this article.
The representativeness of the data generated for testing is one of the implications that must be taken into consideration. But there are other implications such as the fact that the artificial data inflate the database size in an unnatural way. While this swelling may be useful to test scalability, performance, fault tolerance and high availability solutions, this unnatural increase of the database can create unexpected situations for which there is no contingency such as the need for devices and other resources (additional hard drives, increased bandwidth, memory expansion server, and so on). There are also implications relating to confidentiality and security when the data generated have been too faithful in relation to official figures. For example, if the process of generating simulated data create values of monthly payment salaries of managers between $ 5,000.00 and $ 6,000.00 based on the average real wage of managers, it is possible that when someone look at the data take this as truth and disseminate sensitive information in a unscrupulous way that may create a very delicate situation.
From a practical standpoint there are some technical aspects that should be considered prior to generating data. Perhaps the most important of these is the use of tools and resources for data generation. The four major databases currently used in commercial applications and in the Web environment (SQL Server, Oracle, MySQL and PostgreSQL) have tools to import/export data, features for interoperability with systems, drivers that connect to different data sources, resources for transformation, joining, separation, and all kinds of data manipulation via graphical interfaces, textual or through SQL commands. However, these four major databases do not have adequate support for data generation, claiming that this type of task can be accomplished by the creation of scripts using the SQL dialect they provide. Because of this gap, several companies have developed specialized tools to generate simulated data. The next section of this paper presents and details three of these tools.
Another important technical aspect to be considered is the database internal transaction log since the generation of simulated data will lead to an insertion or modification of rows in tables within the database. This data manipulation is essentially composed of INSERT or UPDATE statements that are stored in the transaction logs files of the database before actually being stored in the databases' data files. This process is specific for each relational database, but it generally follows the flow MEMORY → LOG FILE → DATA FILE. Because of this sequence we need to be aware of the consumption of resources during bulk imports and the manipulation of great amounts of data. For example, SQL Server has a setting called Recovery Model (DevMan Note 3). This option can be modified when performing bulk imports so that the generated SQL statements on import are partially stored in the transaction, making it faster to import data and do not generating too much overhead. Other databases have similar options and recommendations to make importing faster without consuming too many resources.
DevMan Note 3. Recovery Model
The recovery model is a database option in SQL Server that indicates how the transaction log will behave when there are changes in the data. The three possible values for the recovery model of a SQL Server's database are Simple, Bulk Logged and Full.
In the Simple recovery model the Transaction Log automatically eliminates the transactions of the log to keep it from growing too fast. This model cannot perform transaction log backups, and so it is recommended to use this option only for small or test databases.
When using the Bulk Logged recovery model the imports of bulk data does not take up much space in Transaction Log, so it will not be quickly filled when there is a considerable amount of data being manipulated. This recovery model is used when there is the need to modify a large amount of information at once. The DBA is responsible for automatically clean up the Transaction Log in this model.
In the Full recovery model all information is stored in the Transaction Log and the DBA is responsible to periodically clean the Transaction Log manually. This model should be used in production databases when a recovery strategy requires different types of backups.
Perhaps the main practical aspect which the professional must be faced when planning to generate simulated data is the data model itself. The data modeling is created through understanding, specification of storage requirements, and implementation of the model in a database. In many situations it is common to find relatively large data models with dozens of tables each one with several of columns with different relationships and all kinds of data. These models present a large obstacle to overcome when someone want to generate simulated data since it is necessary to take into account the order of insertion in the tables, data types, non-conventional ranges of values , attributes, flags, values , constraints, declarative referential integrity, and other specific aspects of the model. In situations where the data model is very complex, one must consider the generation of data by the application or system and not directly interact with the database. Nowadays there are testing applications that can use the existing systems and act as a real person to simulate the manual data entry. The open source test suite called Selenium (http://seleniumhq.org/) is one example of this type of testing application.
Finally, the DBA should not forget to plan some way to remove the set of simulated data that was entered in the database for testing. This requires applying a purge and separate from the database what was simulated and what is real data. Again, how perform this purge operation is out of the scope of this article.
This article presented a discussion and examples of how to generate simulated data to a database. The generation of simulated data is important in various situations such as testing, approval, and transfers of the database without disclosure of sensitive data and others.
This first part talked about the details involved in the planning phase of the generation of data and the context in which this type of data must be faced. The article also presented some issues that must be addresses when working with simulated data.
In part two, let's talk about alternatives to generate simulated data.