The Art of the Spreadsheet. Copyright 2008 John F. Raffensperger

1. Why is spreadsheet style important?

2. Make your spreadsheet read from left to right and top to bottom.

3. Omit unneeded bytes.

4. Omit unneeded sheets.

5. Organize blocks with care.

6. Attend to blank space.

7. Omit unneeded cells.

8. Format with caution.

9. Show all the information.

10. Spreadsheet errors.

11. How to audit a spreadsheet.

12. Suggestions for operations researchers.

13. Teaching the art of the spreadsheet.

Appendix. Checklist for a spreadsheet.

A block is a set of cells organized into a rectangle and usually set apart from other blocks by blank space. Hopefully, these cells are related. Not so hard, right? In fact, designing good blocks and arranging them wisely are key skills in the art of the spreadsheet.

I usually draft a spreadsheet, and then try different ways of setting out the blocks to see which is best. The goals here are predictability, scalability, and compactness. The essential tools are joining, sorting, and transposition.

The ideal is often a single table. Where possible, have a single table, as in a list. The structure is so simple that your reader will immediately grasp it. Many times, writers mistakenly break up data that belongs together, and this causes them all sorts of trouble later. The solution is to join the tables.

Simply put, to spread data over an already-available dimension means consolidating cells that use the same set of labels.

Tufte said, “In flatland, after all, every opportunity to spread additional data over an already-available dimension must be cherished.” Later, he wrote, “Constancy of design puts the emphasis on changes in data, not changes in data frames.” Still later in the same book, he wrote, “Micro/macro designs enforce both local and global comparisons and, at the same time, avoid the disruption of context switching.”

A “dimension” in a spreadsheet is often denoted by a label that describes a row or a block. A data frame is the block itself, as defined by the labels. By aligning data types consistently and carefully, the reader will know where to find information, without your even needing to tell him directly.

Here is a cyborg's approach to spreadsheet layout. The next spreadsheet models a neural network, the exclusive-or (XOR) recognition problem. Here is the cyborg's problem: If exactly one of the input bits is 1, then the output neuron should return 1. If both input bits are 0, or if both input bits are 1, our cyborg's output neuron should return 0. Ordinary sorts of analysis like linear regression fail with the XOR recognition problem, so we are using something more interesting.

To understand the issues of layout, the main technical point you need about the spreadsheet is that we have two input bits and five neurons. Each neuron has a threshold, two weights, and an output. The input bits feed into the two input neurons. The two input neurons feed into the hidden layer neurons. The hidden layer neurons feed the output neuron.

“Learning” for the neural net means finding values for the thresholds and weights that produce the right output. If Cy returns the wrong output, then he has the wrong thresholds and weights in his neurons, and needs more training. Here, the writer has already found good values for the thresholds and weights.

Above, the writer has a carefully-spaced block for each neuron. To show the relationships between neurons, he has added arrows with Excel's drawing tool. Tufte's “data frames” are the neurons. The dimensions are “neuron,” “Threshold,” “Weight 1,” “Weight 2,” and “Output.”

Can you quickly see which neuron above has the largest value for Weight 1? After some searching, we find the neuron in the lower right.

Conway and Ragsdale (1997) wrote, “A design that results in formulas that can be copied is probably better than one that does not.” A design that can be copied will always save you time in the end, and the repetitive and consistent structure will tend to be easy to read.

Adding neurons will be a lot of work with the above spreadsheet. It will be rather difficult to copy the formulas for each layer of neurons and get them to align properly. The graphic arrows also make rearranging the spreadsheet very difficult. If the writer decided he wanted a big neural net, with many layers, he would have quite a job drawing all the boxes and arrows, and it would take up an immense area on the spreadsheet.

Let's try rearranging this spreadsheet, so that all the dimensions are aligned in the same way. All dimensions are aligned in the same way, across the same direction. The arrows are gone. It is more compact - 5 columns by 10 rows, rather than 9 columns by 11 rows. It looks less cyborg-like.

The first cyborg spreadsheet is most appropriate for teaching neural nets. Without the arrows, the reader may at first find it less clear that the input neurons feed into the hidden neurons. But it also puts the model in perspective - our cyborg is just a spreadsheet with 17 constants and 5 formulas. It will be easier to add layers of neurons, if we wish, because we can insert and copy pairs of rows. One could imagine a big neural net as simply a large table. For a real-world problem that requires scalability, the second one would be much easier to use.

Finally, it is much easier to compare thresholds and weights for each neuron, because they are adjacent. In the first neural net spreadsheet, the values of thresholds and weights were not adjacent, so it is more difficult to compare these values. We can see which neuron has the largest value for Weight 1 by vertically scanning a short list.

The venerable *Chicago Manual of Style* gives this prescription:

The left-hand column of a table is known as the *stub...*There is no hard and fast rule about
placing one type of variable (dependent or independent) in the stub and the other in the column headings, but once the
choice is made, it should be applied consistently to all the tables in the same series.

Consistency within the stub is also important. Items that are logically similar should be treated similarly: Authors, Publishers, Printers, *not* Authors, Publishing concerns, Operates print shop. In a series of tables, the same item should always bear the same name in the stub: the Union of Soviet Socialist Republics, for instance, should not appear
as USSR in one table and Soviet Union in another.

So it is not only the layout which should be consistent, but also the labels. If you keep related data together on one sheet, then you need only one label for each data type, and you will not have a problem of inconsistency.

Here is an example we saw earlier. On SCREEN 2, the information is primarily sorted on the dimension of Name. On this new SCREEN3, they are sorted on the dimension of Shift. Combining onto a single sheet would encourage the writer to sort the information along a single dimension. The labeling is inconsistent because it is not in the same order and because it is not repeated. Most likely, the writer would not do this, because the inconsistent blocks will make formula creation a headache.

There are four common signs for improving layout of data types: label repetition, time, formula transposition, and concreteness of data type.

Here is a simple example of label repetition. In SCREEN 2 below left, the days of the week labels take up almost a fifth of the spreadsheet. By deleting the redundant labels in rows 7, 12, and 17, the data fits into a smaller eyespan. For a larger spreadsheet, more real information is raised from beneath into a more compact view.

Label repetition suggests that the writer could consolidate blocks. Here we see repeated names, REAGAN, BUSH, FORD, NIXON. The spreadsheet could be reorganized by name, so each name appears only once. The duplicate labels for the days just waste space.

Tufte, Edward,

Tufte wrote, “Maximize the data-ink ratio, within reason,” and “Erase non-data-ink, within reason.” By deleting the labels in rows 7, 12, and 17, and then moving the tables closer together, we would leave room for more information on the first screen, improving the data-ink ratio.

The original Assign.xls is structured abstractly, as a linear program, with sets of decision variables (the ASSIGNMENTS blocks) and sets of constraints. The objective function, Preference Total, is at the top (with the range name WBMAX), as an operations researcher would write a linear program. The Model sheet has the staffing constraints and right hand sides. The Assignments sheet has the decision variables. The Preferences sheet contains the objective coefficients. The Constraints sheet has the work capacities. So the spreadsheet is organized in the abstract manner of a linear program, a mathematical model used to solve business problems.

A client is unlikely to want to use the spreadsheet in this abstract way. The client instead will probably think about ex-presidents. What is Bush's assignment this week? How can I add a new ex-president? Structured by decision variables, preferences, and constraints, the client must copy three sections separately.

A concrete data type, such as “ex-president”, is probably a better structure than an abstract data type, such as “data” or “constraints”. If the spreadsheet is organized abstractly, then what can be copied? You will have to do a separate copy operation for each abstract type, such as “data,” and “constraints.” Try to make your primary data type concrete, tangible, and meaningful for the business problem.

If you really want to use a computer programming paradigm for spreadsheets, at least use the modern programming paradigm - object orientation. In object-oriented code, the program is organized by objects that describe real-world objects. Each object has properties and behaviors that reflect the properties and behaviors of the real-world objects.

If Assign.xls were arranged by ex-president, we have a spreadsheet organized by a concrete data type. To add another ex-president, there is only one block to copy. The amount of visual space used by Assign.xls is now less than two screens, better than the original four sheets.

Also, the context of ex-presidents has been consolidated and visually solidified. The reader can now guess much of what is off-screen without moving there.

Spreadsheets often have time as an element in their structure. Each row is a day, or each column is a year, etc. If you have a spreadsheet with records that vary by time, try to use time as the unifying element to have a single table.

When a spreadsheet requires more than one table, stack unrelated blocks vertically (like a web page). Less preferable
is a horizontally (like ticker tape) arrangement. Worst of all is a bulletin board. The bulletin board arrangement, like
the one here, looks like the “for rent” board at the student union. It's almost impossible to find anything. Maybe
the reader could use the search function, if he knew a word to search for. Information one screen down *and* one
screen right is hard to find, unless it is part of a table that starts on the far left. A bulletin board layout requires
more keystrokes to navigate. Row and column operations will affect other blocks, so changes require moving cells, which
is time consuming and error prone.

So avoid the bulletin board structure.

We read by paging down, as in a web page, not by wandering randomly across a huge field. Aligning table (and graphs) vertically, at the far left of the sheet will make information easier for you and your reader to find.

Tufte put it like this:

Tufte, Edward,The principle of data/text integration is

Data graphics are paragraphs about data

and should be treated as such.

Words, graphics, and tables are different mechanisms with but a single purpose - the presentation of information. Why should the flow of information be broken up into different places on the page because information is packaged one way or another?... Imagine if graphics were replaced by paragraphs of words and those paragraphs scattered over the pages out of sequence with the rest of the text...

Neither row nor column deletion disturbs a stair step structure. The stair step structure is a block diagonal matrix with blanks off the diagonal, usually with blocks for input, calculation, and output, like the one here. An unrelated block to the right of another block is bad. A block to the right of a blank is worse. A blank on the left suggests blank all the way across. The outdated stair steps prevent a single type of error at the expense of readability, just as multiple sheets do today.

Try to avoid transposing cells with formulas. Spreadsheets constructed this way are particularly error-prone.
Transposing with formulas makes construction and modification difficult, and is a loud sign that data types in the
spreadsheet are misaligned. It is *very* hard to get all the formulas right for this structure. Each one must be
typed by hand.

Notice the blank row inserted in the table here. Why didn't the writer just enter the original data vertically, the way he needed, it in the first place? Furthermore, this confuses the reader: will this data type read across or will it read downwards? The writer should settle on a convention for the alignment of this data type, and stick to it.

If “names” are listed in a column in one place, then list them in a column everywhere. This gets the reader oriented to viewing the spreadsheet in a consistent way. Break this rule to confuse the reader and waste space with extra labels.

If you *must* transpose (and it happens sometimes), use Excel's array feature and the Transpose() function. You may even find that instead of Transpose(), you can use matrix multiplication (Mmult()). With either Transpose() or Mmult(), you will not need to enter a long list of addresses by hand, and the function name makes the transposition explicit to the reader.

Think about which orientation has the most elements, and make that the vertical orientation, so summary operations are at the bottom. Similarly, if one dimension is fixed and another dimension is variable, place the fixed variable across and the variable dimension downwards.

If the data are in a horizontal list, summary operations will tend to be on the far right. Most likely, your reader will have to page right to find your Big Conclusion, if he knows to look for it over there.

If the data are in a vertical list, then summary operations will tend to be at the bottom. Now, the reader will find the Big Conclusion at the bottom, where it can easily be put on the left, near the beginning of the line, where the reader will see it first.

This follows how we read, left to right and top to bottom. The repetitive information should organized downwards, so the reader views it by paging down. The reader expects to see the table structure across the top, in the column headings.

Excel has 65,536 rows but only 256 columns. If the primary data type is downwards, you are less like to run out of room.The “let me page down to see it” preference has become even stronger with the Internet. A web page that requires the reader to scroll sideways is inconvenient, but paging down is easy. Furthermore, you can put long labels in the A column, which can be widened, and still allow the other columns to have consistent width. It works better and looks better.