Data Warehousing: Strategies, Technologies, and Techniques by Rob Mattison McGraw-Hill 485 pages ISBN 0-07-041034-8
The title of this book might lead you to believe that it will walk you through the creation of a data warehouse, or possibly serve as the utilitarian reference book that sits close at hand as you build or maintain a data warehouse. Such is not the case. While the book does not contain the implementational details a reference book would normally contain, it is filled with valuable overview information and broad-brush approaches to data warehouse problems.
A data warehouse, for those unfamiliar with the term, is defined by the author as a database that:
The majority of the applications and examples from the book show data warehouses to be systems designed for the collection of large amounts of legacy data from disparate systems, often attempts to perform company-wide integration of data. A good amount of time is spent just looking at when data warehousing is actually necessary. To justify the time and effort of development, substantial new functionality must be added. "Nobody needs a system that does what an earlier generation of system did, using the latest and greatest Windows-based, mouse-driven screen," Mattison says. "It provides no new value to the company and, therefore, does not make economic sense."
In "Data Warehousing," Mattison attempts to present the concepts of data warehousing and data mining in as much detail as possible, keeping the terminology consistent and the content manageable. For a manager or non-computer professional, the book meets that goal. "Hands-on" computer people will be frustrated by the lack of low-level technical information.
As an example, a manager would be pleased to discover the discussion in Chapter 12 of requisite skills for the personnel required for various parts of the data warehouse development project, and would appreciate the discussion throughout the book of political issues. A systems analyst, on the other hand, would be disappointed at the lack of discussion of relative merits of Unix vs. Windows NT and TCP/IP vs. Netware. There is no discussion of physical storage media (e.g., CD-R, RAID, WORM) at all.
In fact, where vague attempts are made to discuss such issues, they are often misleading. An example occurs in Chapter 8, "The Physical Infrastructure." Mattison points out the scarcity of data mining tools for Unix, and then in the same sentence, states that, "the vast majority of the products run on Windows, OS/2, Macintosh, and X-Term environments." Last time I checked, X-Term applications do, indeed, work in Unix.
To be fair, I must point out that the last third of the book, dedicated to the subject of data mining (recovery of information from a data warehouse) delves into a great deal of technical detail, right down to the electro-chemical reactions of neurons in Chapter 16, "Neural Networks and Business Data Systems."
Mattison points out in his introduction that the structure of this final section differs radically from the rest of the book in that it is made up of guest authors discussing specific products or projects, all on the data mining end of the equation rather than the data warehouse itself.
In addition to the neural networks mentioned above, the data mining section also contains chapters on statistical analysis, multidimensional analysis, visualization, and data warehousing on an enterprise intranet. Personally, I found the chapter on enterprise intranets pertinent, and all too short.
Chapter 20, "Prediction from Large Data Warehouses" contains a highly enlightening discussion of what today's tools are capable of doing with the data. It examines some sample databases, simple enough to grasp immediately, and shows how automated analysis programs go about generating easy-to-understand rules rather than obtuse statistical analyses. In the sample database of irises, for example, it generated a rule stating that with the given data, if an iris has petals within a specific range of lengths and widths, then there is a 98% certainty it is an iris versicolor (as opposed to some other species). Another example identifies a specific machine operator as generating the majority of a specific failure mode.
This book has no glossary, a serious shortcoming with a subject this complex, especially when one of the author's stated goals is to develop the reader's vocabulary. A basic knowledge of the subject matter is assumed. Terms like SMP, MPP, IT, and Oracle are used without definition. Some of the terms used in data warehousing are quite difficult to define (Mattison spends the entire first chapter defining what a data warehouse is), but that's no excuse for omitting such a basic tool as a glossary.
The index is a paltry seven pages in length, lacking entries for many subjects crucial to the subject, including archiving, backup, disk, operating systems, optical storage, Oracle, RAID, redundancy, tape, Windows, and even Unix. Most of these subjects are addressed within the book, which is fine for someone who will read it and set it aside. In a reference book, however, subjects missing from the index might as well not exist.
Other missing elements include a bibliography and suggested reading list. Mattison covers an immense amount of material in 485 pages, and readers interested in more detail on a specific topic would benefit from some tips about where to find the information.
The book is sprinkled liberally with drawings, charts, and graphs. Unfortunately, many of the illustrations seem to be spurious afterthoughts, even containing basic errors like the one in Chapter 2 that connects purchasing to shipping rather than receiving. After struggling to interpret several of the illustrations, I came to the realization that they were illustrating the concept of boxes connected by lines, and that the specific labels on the boxes and lines were not intended to carry any particular meaning.
"Data Warehousing" clearly demonstrates Mattison's prodigious knowledge of his subject matter. A great deal of work and thought has obviously gone into developing the content and tools, among them a wonderful set of checklists in Chapter 11 that guide you through the planning and estimate stage. His writing style is clear and as light as one can expect with a topic this heavy.
If you are looking for a technical treatise, however, that will point you at specific hardware/software platforms (or even explain their relative merits), "Data Warehousing" will leave you disappointed. The same will be true if you attempt to use it in random-access mode, looking up specific topics when you need them. There is a lot of information in the book that's tough to find.
If you are not familiar with data warehousing, and would like a detailed overview of the concepts, issues, problems, and design strategies, this is your book. It will be a valuable aid for evaluating the people and strategies you will need to launch your own data warehousing project.