Every professional that work with databases such as a programmer, a DBA, a system administrator or an analyst should have the ability to solve unexpected problems that appears during a normal day at work. This ability, also known as troubleshooting, is very important to maintain the computing system and applications up and running in order to support the enterprise business.
Based in this scenario, this article presents a discussion regarding the attitudes that can be taken when an unexpected problem arises in the database during normal daily operations. This discussion is presented in the format of questions and possible alternatives that should be chosen in order to solve the proposed problematic situation. The alternatives available focus on operational aspects, behavioral actions and technical procedures needed to solve the issue.
To know the best option among many in problematic situation is one of most important and appreciated skills that IT professionals should have specially when working with databases and mission critical applications. The benefits of having these skills include the trustworthy of the professional, the gain of advanced technical knowledge, the feasibility to justify the purchase of new resources and also autonomy to work alone without constant supervision.
In this first part we are going to present the context on which we will present the problematic situations and a few questions about it.
Before the presentation of the questions that explore daily problems it is necessary to define the context in which we are discussing the problems. Let's take a medium size corporation that contains 100 to 150 employees. This corporation has an IT department which staff is composed by developers (programmers, analysis and web designers), system administrators, support professionals, and only one DBA (Database Administrator). This company uses several system such as ERPs, access control system, intranet, and a corporate web portal that deals with B2C (Business to Consumer) and B2B (Business to Business) transactions. From an organizational standpoint there is a manager that coordinates the internal development and is responsible for third-party software packages that must be customized and deployed. There is also another manager that is accounted for the management of the other technical staff: support, infrastructure, external suppliers and other related areas. On top of these two managers there is an operational director that report to the higher levels of the corporate pyramid hierarchy.
Focusing in the technology area, the company employs several different databases with varied sizes and characteristics stored in SQL Server, Oracle, MySQL, and PostgreSQL as well as development technologies such as Java, .NET, PHP, Visual Basic 6.0, Python, and others. There are local servers in the company own datacenter as well as remote servers located in a remote thirty-party datacenter. The software platform is a mix of open source and proprietary system from different software vendors and hardware suppliers.
The company's DBA, I mean you, has an intermediary level (between senior and junior) and is used to the environment, the system, and the applications that access the databases. Also, the DBA is well known and is capable to perform the database maintenance task that must be done in the databases on a daily basis, such as a backups and restores, importation and exportation of data, database objects' creation, set permission to users and objects, and others. However, due to the disorganized way that some applications were developed, acquired, and implanted and also because of the diversity of the technologies mixed together it is often common to deal with emergencies, i.e. unexpected situations that must be solved as soon as possible by the DBA.
Besides that, it is also important to evidence that the DBA made a deal to the company in which he/she must provide support out of normal business hours as soon as a problem in the database happens. This kind of work has a differentiated payment deal that was agreed by both the DBA and the company.
Situation 1) The DBA arrives at work, turn on its computer and find out that three databases in two different SQL Servers are marked as suspect thus are offline at the exact time that the DBA note that all employees are arriving at work. Since the databases are offline the application that is responsible to grant access to the company (open doors, register arrive times, etc) does not work. What should the DBA do in this scenario?
a) Does not say a word about to anyone, stays cool and wait for someone to open an urgent support request to a developer. Only then takes some action to look at the suspected databases.
b) Get some backups ready and do not take any actions before receive an urgent support call from some user or programmer. Upon receiving the call spend more time trying to understand why some databases became suspect than the time required to get the databases on-line again.
c) Try to solve the problem and get the databases online as soon as possible without anyone realizing what happened.
d) Call the manager and the developers responsible for the system about this urgent problem in the database. Provide a time estimation to solve the issue and digs into the problem in order to solve it as quick as possible.
Situation 2) The clock ticks towards the end of the work shift for all the employees. Tired from the day's work, all the IT staff waits eagerly for the shift's end in order to enjoy the next hours of daily rest. Everything is fine until a heavy rain starts to follow and the power is down in the server's room. The failsafe mechanism (no-break) for the servers does not take place and the servers are shut down. After the powers restoration the MySQL error log show that some data files are corrupted. The company's blog and part of the enterprise portal is offline. Everyone looks at the DBA. What should he/she would do?
a) Complain with the developers about the low quality of the hardware, talk about the server's power source with the infrastructure guys and go look for the last database backup. Do not mention about the possible loss of data due the backup strategy.
b) Get very upset and take a position where it is not possible to work in an environment where and equipment that suppose to work in the critical moment just fail. Deny to solve the problem in the same day and go back to home arguing that he/she will only work in the database after the power source failback system will be solved permanently, which means in the next day.
c) Talks with the infrastructure staff to insert some user friendly error message in the blog and in the corporate blog. Notifies the manager about the situation and explain about the possible loss of data due the backup strategy. Ask for at least one of the developers to work together while he/she is solving the issue and work as fast as a DBA can to make the database online again with all its data.
d) Do not say a word to anyone and go back to home. Wait some developer to be notified and only then try to access remotely the server and start the restore process of the affected database without working too hard in order to gain some extra hours in his/her hours account.
Situation 3) During a discussion with a programmer regarding where some business logic should be implemented the DBA suggested the creation of a stored procedure according with what needed to be accomplished. The programmer insisted to use the programming language of the application to implement the business logic. The manager decided to implement the feature in the application. For a while the solution worked great, however while the programmer were on vacation the DBA conducted a database migration at night and note that some data were written wrong by the application. These wrong data will be used to make an important report in the next morning. How you, as the DBA would handle this situation at dusk?
a) Quickly create a new database with the right data, change the database connection in the application and go home. Make a paper note to change the database connection after the creation of the report in order to redirect the application to the database with the wrong data.
b) Don't say a word and allow the report to be generated with the wrong data in the next day since the programmer is the only one to blame because he/she does not followed your recommendation to use a stored procedure for the implementation of the business rule.
c) Try to contact the developer at nigh (or who is responsible for the application during the programmer's vacation) and indicate that here is a problem in the source code and with the data stored in the database. To sum up, pass the problem along and does not take responsibility for the mistake made.
d) Create a copy of the database correcting the data as far as you can get. Do not change the connection of the application with the database. Go home and in the next day talk with the manager and with the responsible programmer about the workaround and offer as an alternative the copied database you created with the right data.
Situation 4) After giving an internal training to the programmers and analysts of the IT department that covered the use of transactions, the DBA allowed access to potential dangerous commands in the development and testing environment, but not on the production environment. Two hours after the end of the training one analyst calls the DBA and says that he/she does not understand why some data is missing after the statements presented in Listing 1 were executed in a SQL Server database.
Listing 1. Script with the instructions executed by the analyst.
SET IMPLICIT_TRANSACTIONS OFF
DELETE FROM PRODUTOS WHERE REGIAO = ‘SP’
IF (@@ERROR <> 0)
Other programmers start to complain to the DBA about problems in the test environment arguing that they cannot work anymore. Which of the following attitudes will you take in this scenario?
a) Think to yourself that teaching the developers how to work with transactions in the database was a mistake. Reprehend the analyst, solve the problem in the test environment by restoring a backup and revoke all the grants to potentially dangerous commands.
b) Keep calm, try to restore only the data erased in the table and say to the analyst take more care when working with some commands.
c) Say the following to the analyst: "Strike one. You got more two chances before I take all you grant away". Make the analyst apologize to all the staff in the department and say that he/she must copy from the production environment the data that he erased in order to fix the test environment.
d) Make some excuse to all the IT staff, solve the issue by copying the missing data from the production environment and say to the analyst that he/she is in debt with the DBA.
Situation 5) The periodic SQL Server job that creates the daily backups was named as "666" by the not-so-funny previous DBA. For one of the databases the DBA used a strategy with full and transactions logs backups. One example of the strategy used by this job is showed in Figure 1.
Figure 1. Backup strategy that make a full database backup each hour with transaction log backups between them.
However, on a sunny saturday afternoon the company was working normally while the DBA was having some time off on a park trip. There was a problem at 12:50 that corrupted the entire database and the DBA became aware of this problem at 14:00. What should the DBA must do to correct the problem as soon as possible?
a) Pretend that he/she did not received the message regarding the problem and keep walking in the part leaving the problem to be solved in the next Monday.
b) The DBA should restore only the 12:45 backup and lost 1 hour and 15 minutes of transactions in the database.
c) The DBA should restore only the 13:00 backup, go back searching from the point where the problem appeared and stop the recovering process immediately before that moment.
d) The DBA should restore the following backups in this sequence: 12:00, 12:15 and 12:45 and lost 5 minutes of transactions in the database plus what is in the database until the start of the recovery process.
This first part of the article presented some unexpected situations that may happen during the daily work of a DBA. These situations were explained with some possible attitudes that can be taken to correct them. Next, each situation was commented in order to guide the DBA behavior when an emergency happen.
It is important to remember that the DBA must resist to its primary instincts that lead to the fast solving of the problem. Each scenario presents a different situation that must be analyzed according with the rule and procedures of the company. Also, it is recommended not to panic, always focus on the problem and the circumstances and be responsible by the problem in the first place.
Each unexpected situation can improve the DBA career since know how to dealt with unexpected problems is a major skill for every professional that work with databases. Besides, the expertise obtained while solving daily problems is a characteristic appreciated by the companies and can generate good benefits such as the trust of superiors, the highlight of the professional among others, and the possibility to be a strong candidate for the next promotion regarding a leadership position in the company.