Ontrack Data Recovery

Data Recovery for NetApp Systems

 

The fear of losing electronic data or not being able to keep up with user/application data transfer bandwidth is a common message among storage manufacturers. Today's storage equipment return on investment is undoubtedly high as new technology creates new efficiencies for users

Despite impressive and complex methods of storing the data files that users create, data loss failures happen every day. Some of these failures happen on a small scale, with only one or a handful of users being affected. Other data disasters impact departments or entire organisations. During the chaos the IT team is trying to control, at some point someone asks, "Why did this fail? I thought this was protected…"

Introduction of an Alternative Storage Architecture

NetApp is one of a handful of companies that provide a unique alternative to the common Direct Attached Storage (DAS) model; that is, the server box with local storage with availability to users via network protocols. In 1994, NetApp (then known as Network Appliance) presented to the USENIX society their claim of a consolidated computer storage system that blended the operating system, networking, and storage mechanisms into one unit, a "network appliance." This brought a unique concept to the IT industry, instead of viewing the operating system, hardware, and storage file system as separate, and thereby requiring more management, the appliance concept conveyed the sense of ease, simplicity, and reliability of data storage to that of a stove, refrigerator, or coffee maker—just plug it in and it works. The appliance would be easy to manage and operational costs would be controlled; which satisfied IT and corporate management goals.

The system relies on a number of previously established computer data storage concepts:

  • NFS Protocols – Developed by Sun Microsystems, Inc.
  • RAID Storage – Developed by D. Patterson, G. Gibson, and R. Katz; University of California, Berkeley
  • UNIX operating system – Developed by AT&T and many others
  • Berkeley Fast File System – Developed by the Department of Electrical Engineering and Computer Science, University of California, Berkeley
  • Episode file system – Developed by TransArc Corporation

The NetApp filer appliance relies on a proprietary hardware operating system, Data ONTAP, and a new file system, the WAFL (Write Anywhere File Layout) file system; both are of primary importance to the device's success and reliability. In the WAFL file system, meta-data (key file system data that describes the file and the logical location of stream of data) is stored within the file's data stream. The file system uses the UNIX style inode as a descriptor of this meta-data, yet instead of keeping the inode in a special area as most UNIX-based file systems do, the inode and the data stream compose one object within the volume. This fits nicely with WAFL's file operation methodology of being able to ‘write anywhere.' The file system purposely fragments the files to maintain a high performance. Additionally, the operating system schedules these file system writes to conform to the RAID configuration of the storage pool. In this manner, reads and writes can be optimised and performance timing penalties can be minimised.

NetApp's file system also allows for a copy-on-write (COW) technique that provides data replication at a specific point in time, which NetApp calls "snapshots." These snapshots are scheduled within the Data ONTAP system so that only low-level blocks that change within a file's data stream are recorded. Imagine a file that consists of 1,000 blocks, with each block storing 4,000 bytes—a 4MB file. This file is occasionally opened and updated throughout the work day. With the snapshot technology, only the areas of that file that changed would be duplicated at the scheduled time. So if a user accidently changed something within the file or deleted some of the data, the System Administrator could go back to the snapshot that was as close as possible before the data loss, and restore that file. Since the snapshots are part of the WAFL file system and not a separate file unto themselves, efficiency in storage and optimisation in data management are big wins.

Additionally, newer NetApp devices employ granular logging of file system changes, so consistency is maintained by means of a file system based transactional system. These transactions are stored on non-volatile or flash memory data storage. In case of an unexpected system shut-down, the file system is able to determine the exact file system operations that were not completed, and then make the necessary updates to the WAFL meta-data. Since this is in a separate storage area, the reliability is increased so that extensive file system checks are reduced, thereby getting the system up and accessible as soon as possible.

Data Disasters—When the Unexpected Happens

Disasters with data loss usually occur with unexpected events. Planned events can have unexpected actions due to unknown contingencies, human error, or faulty hardware. Disasters can be compounded when there are ‘knock-on' effects or failures. In short, data disasters happen at the wrong time with the potential for devastating results if not contained quickly.

Despite the best of hardware and software technology, there are a number of abstract layers that make up the physical and logical storage process where minor failures can cascade into a full blown data loss 'ground zero.' Data loss happens at one or more of these layers:

  • Physical storage layer—this involves the individual hard disks; failure can happen can be either at an electronics or magnetic medium level.
  • Logical Unit Number layer—where the physical devices are grouped into storage units, such as in the case of RAID storage arrays; failure happens when the storage array configuration is lost or the hardware controlling that configuration malfunctions.
  • Logical volume management layer (LVM)—where numerous LUNs are configured into ‘storage pools' or sections of the LUNs are grouped into volumes that are presented to the operating system as the storage that is available for use; failure happens with the configuration or the previous two layers are unavailable for the LVM to operation correctly.
  • File system layer—where the formalised union of meta-data to file data streams are made. This is the hierarchical representation of data file organisation that users and applications rely to read or write to the storage system; failure happens when the meta-data is corrupted or the data streams are no longer accessible.

All of these layers are present within the modern storage system, and the level of complexity grows with NetApp devices due their integrated design. The challenge is to gain access to the final layer of data storage after one or more of the preceding layers have failed. Additionally, if the all of the storage layers are operational but human error occurs - or system redundancy (such as the snapshot technology) is not configured correctly and a disaster happens - then it may be necessary to contact the Ontrack Data Recovery division of Kroll Ontrack to recover the critical data files.

Over the past decade, most NetApp recoveries performed by Ontrack Data Recovery engineers involved individual storage device failures—one or two hard disk drives out of the RAID layer fail thereby disabling the built-in redundancy. Recovery experts in one of the Ontrack Data Recovery clean room labs have a high success rate getting the hard disk drive operational and the contents of the drive extracted. The extracted data is then written to a drive similar to those used in NetApp devices and those replacement drives are returned to the system. There would be enough of the extracted data present to ensure that the other data storage layers were operational, and the user would have a small impact to file integrity.

In other cases, hard disk drive firmware upgrades provided opportunities to work with the system area of hard disk drives and restore critical operational information in order to get the storage device working again. The entire unit would work long enough for the target data to be copied off, again with minimal impact to file integrity.

Recovery after a Failure – A Case Study

Over the past six months, a new opportunity presented itself. Due to companies experiencing the effects of the worldwide economic downturn of 2008/2009, many IT staff responsibilities are being consolidated to a handful of employees. This can increase the likelihood of human error as the root cause of a disaster. In one case, the sheer volume of snapshots forced IT staff into managing a small window of point-in-time backups. This created a vacuum when the delta between live data and backup data exceeded the storage of the appliance. To maintain availability, key snapshot data was deleted from the system. Then a data disaster struck the storage system.

During a file's life cycle (see figure 1), there is a constant need for recoverability. When this need was removed from the above case and a disaster occurred that fell outside of the snapshot recovery point, the only resolution was to contact an expert in data recovery techniques. As the above storage layer describes, this particular data loss happened at the upper-most level—the file system layer.

Ontrack Data Recovery engineers worked to develop solutions to repair the file system's meta-data so that target data would again be accessible. This process involves an in-depth understanding of file system allocation methods. The end-goal was to get this client's data accessible again and the Ontrack Data Recovery team was successful in getting a solution developed that met the time frame and data integrity requirements of the data owners.

This innovation can now provide Ontrack Data Recovery solutions to NetApp file systems where the WAFL snapshots have been purged from the storage pool or entire volumes have been destroyed. In addition, specific recovery techniques can also verify and return the file system to a consistent state. This is very important when evaluating the success of the recovery and the extent of potential data corruption.

This capability, along with electronic and magnetic storage expertise, RAID rebuilding expertise, and an entire staff of dedicated, experienced engineers who work to get the best recovery possible, is what sets Kroll Ontrack apart from other recovery companies.