In 1994, I attended COMPASS and heard a presentation of Using Formal Methods To Derive Test Frames In Category-Partition Testing by Paul Ammann and Jeff Offutt. It was an interesting experience to hear a formal presentation of the method that I had been using for about six years prior to that. When I spoke to the author (I apologize for forgetting which of the authors presented the paper), he said that the Category-Partition method was essentially the result of interviewing practitioners and writing up their methods formally. At that time, I had intended to write a response paper about my experience with the method and why it had worked as well as it had. Four years later, here it is.
Between 1986 and 1993, I worked for a mainframe manufacturer in a group whose name on the org chart was something like Architectural Compatibility CPU Diagnostics Development. Informally, we were the BUP team. The tests that we developed were called BUPs, for reasons that are lost in the mists of time.
A BUP, by definition is a small test program written in assembly language and intended for use in hardware simulation. Each test needed to be small not only in its memory footprint, but also in number of CPU cycles used. Because our target runtime environment was the simulator, the entire test had to fit in the cache and run in a reasonable amount of time at speeds in the neighborhood of 1000 cycles/second.
The design specifications that our team received came from IBM. These specifications described the cpu architecture in precise language. The IBM mainframe architecture includes not only the instruction set, but also several environments (virtual memory modes, privilege modes, etc) which interact with the instruction set.
It is a critical point that the nature of the tests was well defined and limited by their purpose. The number of testing actions in each test was limited by their target run time environment. The specifications were not in flux because the requirements were set by our competitor's product which was either already shipping or nearly so. Our job was to ensure that our company's product matched IBM's exactly at the architectural level.
Even the coding style of the tests was not in question. Our error reporting mechanism was to halt the test as soon as possible after detecting a problem. This was partly to save time on the simulator and partly to preserve the state of the machine. The engineer running the test would the read the assembly code listing to see what the problem was. Usually the engineer running the test was not the author of the test, so we had to ensure high code readability.
With so few degrees of freedom in the test development process, we had an ideal environment for Category-Partition method testing.
The Category-Partition method provides a way to quickly translate a design specification to a test specification. The steps below show the method in italics and the BUPs use of the method after.
Using this method, the obvious tests could be enumerated quickly and completely, leaving more time to think about more subtle issues. The first benefit of this method was that we achieved fairly uniform coverage across a large problem space. This matrix method also clarified in our minds where test code could be reused because we were explicitly looking for similarities and differences between the functions under test.
By doing the analysis involved in creating the matrix, the boring (i.e. obvious) cases were listed quickly, and the interesting (i.e. subtle) cases became apparent. By designing the coverage quickly, it was possible to hold the design in our heads and see the coverage holes created by the partitioning. No matter how carefully you create your matrix, by placing the problem in an orthogonal view, you are forcing the object under test into that view. This may not reflect reality. There are always a few special cases which can only be treated as exceptions. By doing an essentially mechanical process first, we were able to use a second pass to think carefully about the exceptions.
The BUPs process was highly predictable. Like any good process it survived massive changes in personnel. None of the people who developed this process are still on the team, few are even still at the company, but last I heard, the process is alive and well. [Note: Amdahl shut down their engineering organization late in 2000. The last BUPs engineers left at that time.]
When a new specification arrived, the two or three most senior team members would each read the document, then meet to create a list of tests by title using steps 1 -3 listed above. We had all been using the process for so long, that the matrix itself was seldom written. The conversation was typically along the lines of "looks like these three go together, that's one, and these two go together, then we'll need an environment test for all five. Ok, that's three BUPs for this bunch." A typical test title would be something thing like "Branches and page faults."
From the list of titles, we could produce a schedule. This is where the restricted format of the tests comes into play. Because the tests were limited in format a BUP was both a unit of coverage and a unit of work. From measurement, we had determined that the average amount of time to produce one BUP. Using the magic multiplier, we could produce test schedules for a small specification in a matter of days. During the one major architectural change during my tenure on the team, the BUP development was a 3 person-year effort. A total of 14 people worked on the project, with no more than 8 on the team at any time, under a series different managers, and finished within 2 weeks of the original 9-month schedule. What should have been utter chaos, functioned perfectly. I would like to believe that it was because we understood our process completely.
Once the schedule was set, the list of tests was made available to the team. Each engineer signed up for a test and wrote a detailed specification. This corresponds to steps 4 and 5 above. The specification was then reviewed for intent, and to coordinate coverage with other engineers' work. The engineer would then write and debug the test and present the code for review.
The process defined above, creates a uniform view of the problem space across many projects. Because the team worked on essentially the same architecture for many years, we could learn effectively from what had worked in the past. The BUPs were only the first set of architectural tests to run against the design. This allowed us to use the defects found by later test methods to measure how effective BUP-based testing had been. Because the process of hand coding each test action was relatively expensive compared to other test methods, we used defect counts by test type to design coverage splits between test type, as well as to check that coverage existed where we thought it did. This effort could not have been contemplated if the coverage goals had not been well understood to start with.
The restricted environment and the resulting processes meant that for planning purposes we could multiply the number of tests by a constant and know the amount of effort. From the time the engineer signed up to write a test to the time the debugged code was checked in was 2 weeks on average. Primarily, we knew this from measurement. Two weeks per BUP was a magic number. As such, it was sometimes the target of suspicion. How did we know it took two weeks? Any engineer could easily complete one BUP in a day. Why did the 2-week number hold true across so many projects and for a team that was constantly changing players?
The limits introduced by the run-time environment created a simplifying assumption about the amount of work per test. As a result, the time required to write each program climbed slowly with feature complexity. This is because the complexity of the feature under test more strongly affected the number of tests written than the complexity of any single test.
As with any skill, the time to write any one program dropped with the experience level of the engineer. But clearly, tests cannot be written in zero time. A junior team member would be slower than an experienced member in creating the detailed test specification, but in general, hours of coding were constant. Much of the time lost by junior team members was in debugging. For the most part this could be counteracted by training. Our debug environment usually involved running from the machine console on a known "good" machine. Because of the level of our testing, we had to have the machine to ourselves. Stand-alone mainframe time was at a premium, even at a mainframe manufacturer. But our tests were so small, engineers could often share a one hour slot. This was a perfect opportunity to train new members. This combination of factors meant that we could hire new college graduates, train them, and break even in about three months.
The interaction of the slowly climbing time versus complexity curve and the slowly falling time versus experience curve turned out to be constant. For any average team working on an average architectural feature, the time to produce each test could be treated as a constant. There were very few cases in which the number was higher or lower. A notable case where the number was lower was for a vector co-processor. The instruction set to perform mathematical matrix operations was much more orthogonal than a typical feature. Therefore we were able to significantly reduce the average development time by creating a library of test routines.
I was privileged to work on a project where the Category-Partition method not only worked, but worked extremely well. I consider that time to have been an extremely valuable learning experience. Since that time, I have not found another situation where it applied so perfectly. I have attempted in this article to call out the key points for why the method was suited to the environment and produced remarkably consistent results.