The Art of the Spreadsheet. Copyright 2008 John F. Raffensperger

1. Why is spreadsheet style important?
2. Make your spreadsheet read from left to right and top to bottom.
3. Omit unneeded bytes.
4. Omit unneeded sheets.
5. Organize blocks with care.
6. Attend to blank space.
7. Omit unneeded cells.
8. Format with caution.
9. Show all the information.
10. Spreadsheet errors.
11. How to audit a spreadsheet.
12. Suggestions for operations researchers.
13. Teaching the art of the spreadsheet.
Appendix. Checklist for a spreadsheet.

7. Omit unneeded cells.

All cells in a spreadsheet, sometimes even blanks, require the reader's interpretation. The goal is to focus the reader on the critical information. Do this by avoiding noise, by writing concisely.

Unnecessary cells take up the reader's precious time, time that the writer wants the reader to spend on understanding the business logic. Unnecessary cells can contain mistakes, add visual and logical clutter, add to the bulk of the file, and confuse the reader. They require another label, too, compounding the problem.

Eliminate spurious cells.


A spurious cell is one that simply points to another cell, such as the formula "G8 = Input!G15". A spurious cell has one precedent. If its precedent is on the same sheet, it can be discovered by examining the precedence tree.

Various bad habits motivate the use of spurious cells, but usually the problem is from long arcs of precedence. A writer wants to remind the reader of data from another section; later formulas refer to the spurious cell rather than the original data.

Here, the writer did not format cell D8 as its source, cell B24, is formatted. He should format the spurious cell properly, if he is going to use it. Also, there is another label to be inserted, yet another element of screen noise to distract the reader, and this second label should be consistent with the first label. So we have evidence that spurious cells add work for the writer.

If this were text, an editor would not think twice about deleting the redundant verbiage. In mathematics, the extra variable would be substituted out algebraically with no hesitation. In graphics, the duplicated visual would be erased.

The reader should see a piece of information in exactly one place. To many spreadsheet writers, this means only that the reader should not have to enter the same constant input in two places. Every value should be in exactly one place, which is where it is entered or calculated. If it appears a second time as a spurious cell, the reader is first confused, then after realizing the duplication, he is compelled to compare, and then he is left wondering why this new variable is important. Spurious cells are repetitive information that should be deleted.

Erase dangling cells.

A dangling cell is a numeric cell without dependents, a calculation not used anywhere else in the spreadsheet. Usually, a spreadsheet contains a “bottom line,” such as an objective function or balance sheet total profit. We will call this the last numeric cell.

Any cell that is not on the complete precedence tree for the last numeric cell must be dangling.

Most of the time, a dangling cell is like an unused variable or unneeded decoration. Whereas a spurious cell is like repetitive text, a dangling cell usually refers to irrelevant text.

A spreadsheet's last numeric cell must be dangling, too, and we obviously want to keep it. Here, we see dangling the objective function, Preference total, for a version of our model Assign.xls. Upon clicking the Trace Dependents button, no arrows appear, because this is our last numeric cell, the “bottom line,” the total utility of the ex-presidents' schedules.

A dangling cell may be required for other software, such as an optimization algorithm. Here, the cells in row 49 have no dependents. However, they are required to model the linear programming constraints, and they are referenced by the What'sBest! solver. Excel's Solver may use dangling cells in the same way. This is a compelling reason to leave them in, though none is the last numeric cell.

Besides the last numeric cell and cells required for proper modeling of simultaneous equations, there are four main types of dangling cells: the unused input, the validation formula, the useless intermediate calculation, and the interpreted output. These four types of dangling cells can usually be erased.



An unused input is a relic from an earlier version or an error in the model. Either way, the spreadsheet needs more work. (There is one more possibility - the input is used in a formula hidden somewhere, such as Solver or Visual Basic. The reader has no way to tell except by opening up and very carefully inspecting all the Solver dialog boxes, and checking the Visual Basic code.) Here, cell G15 - a constant - has no dependents. Why is it in the spreadsheet? Most likely, one of two things has happened. Case one: the writer thought he would need it, but didn't, then forgot about it. Case two: the writer has intended some formula to refer to G15, but the formula instead refers somewhere else. Either way, an unused input needs help.

Authors cited earlier recommended formulas to check the validity of input and output. These extra formulas are dangling by definition. Temporary validation formulas are useful for debugging, but reduce readability and can contain errors themselves (like the benign one here). Use them in drafts, not in the final spreadsheet.

Writers insert useless intermediate calculations, thinking the reader “will want to see this subtotal, just for their information.” The same logic compels the spurious cell. But the writer should instead drive home the main point rather than display trivial side matters. Let the reader put these in as he wishes. Here, “TOTAL” is not used elsewhere; it has no dependents. We click Trace Dependents, and no blue arrows appear, so cell L7 is a useless intermediate calculation. Once we erase it, we do not need the label either. Removing two of 17 items from rows 6 and 7 results in 11.7% fewer objects that the reader must read. We communicate the same information in 88.3% of the time needed before.

Interpreted output is like a comedian repeating the punch line, hoping for an extra laugh. For example, consider this blemish.

There are many nasty things to say about this formula. Seventy-four characters to interpret one number Input!K33, which is the real output. (Why the true output is on the input sheet is anyone's guess.) It rounds the answer to the nearest million, so the value is no longer correct. The writer assumes the reader is an idiot who cannot understand significant digits. If Input!K33 is negative, this formula shows nothing - it hides itself. It gets worse. We shall see later that if Input!K33 is negative, then a cell currently hidden below this one pops up. The writer should have simply labeled Input!K33 with constant text as “Surplus (loss),” then formatted the number as dollars. The real output was available in Input!K33, but was obscured here in I32 by a forced interpretation.

Do not mix a model's formulas with its documentation. Avoid labels containing formulas. Use formulas only for the model, and use concise constant text for labels.

Eliminate relics

A relic is anything no longer needed, but appears because the writer has not cleaned it up. Relics confuse the reader.

Here, we find a relic, far below the apparent bottom of a big spreadsheet. The writer probably just wanted to calculate 10,000*2,700, then forgot to erase it.

To change the memory allocated in Excel, the file must be saved.

A simple way to look for spreadsheet relics is to press the End key, then the Home key. Alternatively, you can use Edit, Go To, Special, Last cell. This moves the cursor to the lowest and furthest right cell for which computer memory has been allocated. If this last cell is not the bottom right cell of the intended spreadsheet, then there may be relics.

Most of the time, extra rows and columns contain old formats. A fast way to determine whether these apparently blank rows or columns affect the spreadsheet is to delete them. Do this by selecting the apparently blank rows or columns, and deleting them with Edit Delete Row or Edit Delete Column. If #REF! errors appear, undo the changes and find the problem; it is likely that formulas perversely refer to blank cells.

Assign.xls contains relics of earlier versions. In the Model sheet, all the data is in columns A through J. However, width was adjusted of columns far to the right, in AV and AX. In fact, in the original version, pressing End and Home moves the cursor to cell IT22, far away from the apparent last cell in L18. Fortunately, these relics are just old formats.

Nest and erase formulas where appropriate

In nest and erase, a formula in a cell is substituted for that cell's address elsewhere. Nest and erase is the same as substituting out variables in equations. This technique can reduce the complexity and size of a spreadsheet significantly.

 B18=B11 - B16
B19=B2 - B18
 B19 = B2 - B11 + B16

Here, the writer used nest and erase to eliminate the formula in cell B18. The number of terms has been reduced from four to three. The label in A18 could be erased. Cell B19 is more likely to depend directly on constants rather than other formulas.

Nest-and-erase is especially appropriate when a formula in cell x has only one dependent, cell y, that is, when x is a spurious cell. The formula in x can be copied and substituted for the address of cell x in cell y.

David Freeman, “How to make spreadsheets error-proof,” Journal of Accountancy, May, 1996, vol. 181, no. 5, pp. 75-77.
D. Mather, “A framework for building spreadsheet based decision models,” Journal of the Operational Research Society, 1999, vol. 50, pp. 70-74.
Raymond R. Panko and Ralph H. Sprague, Jr., “Hitting the wall: errors in developing and code inspecting a ‘simple’ spreadsheet model,” Decision Support Systems, vol. 22, no. 4, April, 1998, pp. 337-353.
Hans G. Daellenbach, Systems and Decision Making, 1994, John Wiley & Sons Ltd., Chichester, England, p. 132.

Nest-and-erase can be tedious work, but often produces a pleasant surprise of gathering and canceling terms. The result is a clean spreadsheet. In spite of this, eliminating cells goes against the common wisdom. Authors such as Freeman (1996) and Mather (1999) consider extra cells a virtue: complicated formulas should be separated into multiple cells. In fact, Mather says

The choice of intermediate variables is subjective, but, as long as the processing is accurate, the actual number of these variables is not critical, and it is always better practice to use more rather than less. This is because breaking down the processes into smaller steps makes the programming easier to follow, easier to debug and overall reduces the need to use complex formulae, thus reducing possible sources for errors.

Mather cannot possibly be correct - “it is always better practice to use more rather than less.” If Panko's studies are correct, more cells increase the probability of error. No art prefers verbosity. Hans Daellenbach (1994) wrote of mathematical modeling, “A good model is a model that is as parsimonious as possible in terms of the variables/aspects included. In other words, it should be simple.”

Spreadsheet writers often say they want to leave in “interesting” intermediate cells. Unfortunately, “interesting” to the writer is usually noise to the reader. A more compelling reason to leave in intermediate cells is to check them as the spreadsheet is being written. Later, however, as the spreadsheet is edited, these cells should be substituted out.

When should this process of nest and erase should stop? It should stop when the formula in the dependent cell y begins to lose readability. The difference of opinion lies in when the loss of readability begins. It is true that longer formulas are less readable, but the cost of a shorter formula can be more cells and a longer precedence tree, which may be less readable.

Ross James, conversation, 1998, Christchurch, New Zealand.

One rule of thumb is that a formula may be considered too long if it wraps in the formula box. If the formula is highly irregular, then it makes sense to end the formula before it wraps to the next line. However, if the formula is regular, if it has repeating terms that reference blocks of the same size, then wrapping is less relevant to readability.

A better rule is to minimize the total number of characters in all formulas in the spreadsheet. Two or three somewhat complicated formulas will be easier to understand (and simplify) than several dozen simpler cells. Also, a complicated formula can sometimes be converted to a Visual Basic function. The fewer the formula characters in the spreadsheet, the lower the probability of error.

Consider the psychology of a spreadsheet with a half-dozen formulas, but these few are complicated. The writer and reader will focus intently on those few formulas, to make sense of them. Conciseness gives a strong signal, and puts our attention on what really matters. If the formulas are broken up into dozens of cells, few of those cells will get much attention.

Tufte gave a useful (and occasionally amusing) measure of the quantitative value of a graphic. His formula is

Tufte, Edward, Envisioning Information, 1990, Graphics Press, Cheshire, Conn., pp. 162, 168.

data density of a graph = (number of entries in the data matrix)/(area of the data graphics).

And what he said of it was:

Maximize data density and the size of the data matrix, within reason.

Why? Why should data density be maximized? The reason is to strengthen the signal. A signal is stronger when given concisely, in a small visual area, than when given verbosely, spread over a wider visual area.

Since a spreadsheet is a mathematical model, we can algebraically determine the minimum number of numeric cells required in the spreadsheet. Then a more accurate numeric density would be given as:

numeric density of a spreadsheet= (fewest numeric cells required)/(number of numeric cells).

Generally, the higher the numeric density, the stronger the signal.

 

 

 

The colorful 225-cell model here could be reduced to a spreadsheet with one formula. However, this cell would have many long repeated terms. We can reduce the total number of formula characters by adding cells, when the original formulas have long terms that repeat:

K5 = if (really-long-term1 > term2, really-long-term1, term4)

can be shortened with two cells:

K4 = really-long-term1

K5 = if (K4 > term2, K4, term4)

To lower the total number of formula characters, the new spreadsheet has 4 formulas and the 25 inputs. The longest of these is displayed, and it is non-trivial. None of the other formulas wraps. So the data density of the original is less than 29/225 = 0.129. The new spreadsheet fits on one screen, viewed at 90%.

This process of reducing the spreadsheet uncovered a variety of errors, including an unused input. Examination of the four formulas brought to mind many questions that probably could not be answered confidently by the writer.

The new spreadsheet looks kind of boring, doesn't it? Which spreadsheet is most likely to contain an error? Let's ignore inputs, since we already see the error of the unused input. The original spreadsheet had 25 inputs and 225 numeric cells, so there were 200 formulas. At a 1% error rate, the probability of a correct model with 200 formulas is 0.99200 = 0.366. For the new spreadsheet, the probability of a correct model with 4 formulas is 0.994 = 0.96.

You can supercharge nest and erase with the sumproduct() formula. It is amazing, sometimes, how many cells can be eliminated with this function. Also, sumproduct() tends to produce readable formulas.

In the first sheet here, cells in column F contain trivial formulas, F5=B4*B5, F6=B4*B6, F7=B4*B7, F8=B4*B8. Similarly for columns G and H. Column I simply sums across columns F, G, and H.

The five formulas are the constraints and objective function of a linear program. Based on Hans G. Daellenbach, Systems and Decision Making, 1994, John Wiley & Sons Ltd., Chichester, England, p. 347.

In the second sheet, these trivial formulas have been replaced with sumproduct(), eliminating 20 of the 43 cells, about 46% of the spreadsheet. There is less to look at, less to check, less to figure out. We see that this model needs only 5 formulas.

The data density of the top model is 27/43 = 0.628. The data density of the bottom model is 1.00, a stronger signal.

 

 

The dot product is sometimes confused with the vector (or cross) product, a×b=c, where c is a vector orthogonal to a and b, and c has length |a|·|b|·sin (angle ab).

One possible drawback to using sumproduct() is that your reader may not understand it, because so few people know about it. If you are not acquainted with it, push the help button! To an engineer, mathematician, or operations researcher, sumproduct() is just a dot (or inner) product of two vectors a and b, written mathematically ab. But many professed experts in spreadsheets still do it the long way, one product and sum at a time.

Here is another example of nest and erase. Cells J37, J38, and J39 contain the following (block corners bold):

J37: =SUMPRODUCT(C34:I34,C37:I37)

J38: =SUMPRODUCT(C35:I35,C38:I38)

J39: =SUMPRODUCT(C36:I36,C39:I39)

Note that in this version of Assign.xls, formulas are formatted gray, everything else is constant.

Since we intend to erase J37, J38, and J39, we should check their dependents. What other cells will be affected by erasing these?

The precedence tree shows that each of these cells have only one dependent - the “Preference total” in cell C44. Note that “Preference total” contains four sums, one for each ex-president.

Cells J37, J38, and J39 could be nested into one cell, say, J39.

J39: =SUMPRODUCT(C34:I36,C37:I39)

We can then erase J37 and J38. We do the same substitution and erasures with the other blocks. “Preference total” must be rewritten:

C44: =J9+J19+J29+J39

But now we can nest J9, J19, J29, and J39 directly into the “Preference total” cell C44. This allows us to erase 12 cells.

All constant precedents of this cell may be viewed in one click. Hence, it will be easier to debug and edit. While long, this formula is readable because it is regular. Before nest and erase, to understand “Preference total” and to discover the relationship to Assignments and Preferences, the reader had to trace back through twelve intermediate cells, quite a lot of work. Unneeded intermediate cells create unneeded complexity, and cause the reader unneeded cognitive effort.

After nest and erase, the reader makes the connection directly. Fewer cells, a lower probability of error, a stronger signal, a more emphatic context.

 

 

Simplify formulas

The following table summarizes issues of formula readability.

Less readable More readable
Irregular terms reference various ranges. References appear more than once. Similar terms repeat and reference similar ranges.
References are ordered randomly in the formula. References read left to right in row and column order.
Well-known formulas are made cryptic by separation into multiple cells. Well-known formulas (e.g. the quadratic formula) are in a single cell.
Division appears randomly. Division generally appears at the end of a term.
Formulas contain constants. Formulas reference other formulas or blanks. Formulas reference constants.
The formula has unnecessary parentheses and spaces; the formula can be simplified. The formula has the fewest characters necessary.

For example, the following formula (not from Assign.xls)

C7: C6*(A4) + A6*C6 + (C6*A5)

could be simplified to

C7: C6*(A4+A5+A6)

or even

C7: C6*SUM(A4:A6)

which will change automatically if rows 4 to 6 were to have a new row inserted. The shorter formula is more readable partly because the information fits in a smaller eye span.

Simplifying a formula requires knowledge of the order of precedence for arithmetic operators. The formula

(C7/A8)*A7

could be simplified to

C7/A8*A7

because multiplication and division are commutative. However, the reader may wonder if the writer meant

C7/(A8*A7)

which is not the same. So put division last. Division at the end also tends to reduce the number of characters required.

C7*A7/A8

Here is a puzzle:

-2^4

Is this -32 or 32? Negation has higher precedence than exponentiation, so the result is 32. The danger comes not when we do not know the answer, because we will put a parentheses around the -2 to be sure. The danger comes when we do, because we will not put a parentheses around the -2, but then our reader may not be convinced we have done the right thing.

It is a good habit to use parentheses when we are not sure of the precedence of the arithmetic operators, within reason. Unfortunately, extraneous characters can accumulate to make the formula long and hard to understand. Use the fewest characters necessary to write the formula correctly, within reason.

Here is the order of precedence for the common operators:

() grouping with parentheses
- negation (as in -1)
^ exponentiation
* and / multiplication and division
+ and - addition and subtraction
& string concatenation
=, <, >, <=, >=, <> comparison

A telltale zero often suggests that an if() has been used when max() or min() could be used. These are typically of the form, if(a>b, a-b, 0). Observe that

if(a>b, a-b, 0)

is the same as

if(a-b>0, a-b, 0)

which makes it a bit clearer that we only need

max(0, a-b)

but this is hard to see if a and b are monstrous. For example, consider the double whammy

if(A64*B29*(A62-A63)/B28>B26+B27, A64*B29*(A62-A63)/B28-B26-B27, 0)

The telltale zero at the end encourages the spreadsheet editor to take a closer look. Note that the first term

A64*B29*(A62-A63)/B28>B26+B27

can be rewritten as

A64*B29*(A62-A63)/B28-B26-B27>0.

The left side of this inequality is the same as the “then” clause of the if() statement.

Following the bad old advice, “break complicated formulas into multiple cells,” we might to put the big term “A64*B29*(A62-A63)/B28-B26-B27” in a separate cell.

The if() statement would then reduce to something like:

if(B65>0, B65, 0)

But now we see what is really happening. Instead of creating the spurious cell, the original double whopper could be reduced to

max(0, A64*B29*(A62-A63)/B28-B26-B27)

or better yet,

max(0, (A62-A63)*A64*B29/B28-B26-B27)

which puts terms in an order that is closer to their order of appearance in the spreadsheet. Most importantly, using max() instead of if() allows the huge term to be typed just once, so errors are easier to avoid.

Put the short term “0” at the beginning of the formula, where it will be noticed immediately, rather than at the end of the formula, where the reader must search for it:

max(0, (A62-A63)*A64*B29/B28-B26-B27)

is easier to read than

max((A62-A63)*A64*B29/B28-B26-B27, 0).

Keep cells one type

Here is a horrible formula:

=if(B19="", 0, if($C$34=1, $C$47*B19, $C$41*B19))

B19 seems like it should be numerical, because it shows up getting multipled by either $B$47 or $B$41. But the formula first tests it for blankness. The writer probably wanted the reader to enter values in column B, and was perhaps trying to save the reader the effort of entering a zero in column B. Or perhaps the writer was trying to save the reader the effort of seeing the zero. Anyway, if he had just allowed B19 to be zero, he wouldn't have need to test for blank. The formula could have read as follows:

=B19*if($C$34=1,$C$47,$C$41)

Here is the reverse case:

I14=if(F14=0, "", H14/$E$63)

It turns out that F14 contains a sum() formula, so we cannot write =F14*H14/$E$63. It also turns out that I14 is a percent. So the formula is intended simply to show blank, perhaps to look tidier. Unfortunately, every cell that references I14 must first test to see whether it is blank, because 1 + "" = #VALUE, when we really want it to be zero. The writer should have kept both clauses numerical:

I14=if(F14=0, 0, H14/$E$63)

This small change will make later formulas much easier to write. But the real amusement comes when we read backwards to track down H14:

I14=if(F14=0, 0, H14/$E$63)
H14=if(F14=0, 0, G14)
G14=F14
F14=sum(B14:D14)

So really H14=if(F14=0,0,F14). And that makes I14=if(F14=0, 0, F14/$E$63)=F14/$E$63. Columns G and H can be deleted. For that matter, if we only need the percentage, we could delete columns F, G, and H, nesting the sum directly into column I:

I14=sum(B14:D14)/$E$63

Here is one that is even more egregious:

=if((G16+C17)>0, (G16+C17)/($B$26*0.5), "0")

This is an inventory control model, and G16+C17 should never be negative, and the writer has correctly tested for it. But if G16+C17 is negative, the result is a string that looks like a zero! This is bad form. We can improve this wart as follow:

=2*max(0,G16+C17)/$B$26

Excel can help you a bit here. Make sure that transition formula evaluation is not on (In Excel, see Tools, Options, Transition). Then if you add “Cat + 3”, you will see a #VALUE! error.

  

Do not put constants in formulas. Make formulas reference constants.



The rule “Don't put constants in formulas” is a standard adage of all the spreadsheet literature. Most people understand that data in a formula hides information rather than exposes it. “You're not showing your assumptions,” is what they will correctly tell you.

Constants in formulas sometimes show that the writer had a problem that he did not know how to solve. Here's an example. The writer knew something was wrong here. In an earlier version of this formula, he noticed that if AJ2 were 0, the formula produced a divide-by-zero error. So the writer added this kludge of 0.000001 to make it work. AJ2 is in the numerator and the denominator, so AJ2 has no effect whatsoever as long as AJ2 is a nonzero number. He got himself into trouble because he did not bother to simplify the formula. In fact, the formula can be simplified significantly. After that, the next thing to do would be to remove the constants 12 and 13 from the simplified formula; it is much easier at this point. But the monster arose mainly because the writer did not simplify the formula. An accountant may recognize this calculation as a conversion from months to 4-week accounting periods.


Here is a different way to say “Don't put constants in formulas”: Parameterize your model. That means that, as much as possible, aspects of the model can be easily changed. For example, models that depend on time might be designed with a column for each hour. If we “hard-code” the change in time period as one hour, then we will have a hard time if we want the model to have a different time interval. However, if we parameterize the time interval, the model is easier to change.

Earlier, I wrote that a spreadsheet should have short arcs of precedence. Here is a new suggestion, more controversial, which follows from that: Try to make formulas reference constants. This is almost the same as “Don't put constants in formulas,” but not quite. Try to write the spreadsheet so that each formula directly references constants, within reason. Here, the reader can immediately see how the source data - the “assumptions” - for a cell relate to the cell.

The worst possible way to mix formulas and constants is to overwrite a formula with a constant. This can be done accidently in editing, or on purpose by a well-intentioned clerk. The clerk sees what seems to be a wrong number, and wanting to make the numbers correct, the clerk types over the formula with a constant. This can actually improve the spreadsheet in the short run, with those given numbers, but the whole idea of a spreadsheet being a model that can be modified is lost. So authors are tempted to lock and password protect sheets, which is okay in small doses when the audience is not spreadsheet literate, but bad the rest of the time. A better solution is to format formulas differently than constants.