Wednesday, February 14, 2018

Embedding the design into your code is pixie dust for code quality

Programming seems to become more about memorization every day, with advocates pushing for memorizing lists of design patterns and refactorings and the difference between “parameter coupling” and “invocation coupling.”

Forget all about that. There’s much more to gain by having general principles which apply to many situations. In this post, I’m going to show you one of my favorite software design principles in action, the Embedded Design principle. The Embedded Design principle, which I briefly introduced in a previous post, states that you should always code in a way that makes the design apparent. It’s like pixie dust in its ability to make code more readable, loosely coupled, and impervious to bugs. It’s also hard to grasp, because programmers rarely see a program’s design in first-class form. So, in this post, I’ll show you an example.

Conveniently, a principle can be used as a “machine” for generating examples that violate it. In software design, these examples always end up being code that looks reasonable to the untrained eye, but will spell doom and gloom — often in the form of hilarious bugs — for any team that tries to build on it. Last week, one of my coaching clients asked me how this principle applies to caching. I turned on my machine and it produced this:

The Problem

Meet Bob. Bob is an engineer at He’s only been there a few months, but he has a project that will affect everyone: he’s going to create the website’s stats page! Now everyone can see just how engaged AwesomeSauce’s users are.

Bob starts coding. There are three data items, and each one will print either a computed or cached value. Bob ended up using one chunk of code for each data item.

public void displayStats() {
  if (lastCachedTime <= lastMidnight()) {
    numUsers = countUsers();
    lastCachedTime =;
    print(Total Users:  + numUsers);
  } else {
    print(Total Users:  + numUsers);

  if (lastCachedTime <= lastMidnight()) {
    numArticles = countArticles();
    lastCachedTime =;
    print("Articles written: " + numArticles);
  } else {
    print("Articles written: " + numArticles);

  if (lastCachedTime <= lastMidnight()) {
    numWords = countWords();
    lastCachedTime =;
    print("Words written: " + numWords);
  } else {
    print("Words written: " + numWords);

So simple! As the stats page gets bigger, it will be easy to add more by repeating the pattern. Bob sends it to Charlie for review.

Now, one thing about Charlie is that he’s not me. You see, I’d probably flip out and tell Bob about all the bad things that will happen because of this code, and give him a lecture on the Embedded Design Principle and a link to this blog post (which conveniently already has this code). Charlie, however, just takes a sip of coconut water, and says “There’s a bug over there because you re-used lastCachedTime. Otherwise, looks good. Ship it!”

- if (lastCachedTime <= lastMidnight()) {
+ if (lastCachedTimeUsers <= lastMidnight()) {
     numUsers = countUsers();
-    lastCachedTime =;
+    lastCachedTimeUsers =;

< similar for articles and words >

It’s 6 months later. has been growing faster than the national debt. AwesomeSauce needs more frequent updates to the stats page to show this. Team lead Denise decides to double the refresh rate, and Charlie does it.

- if (lastCachedTimeUsers <= lastMidnight()) {
+ if (lastCachedTimeUsers <= lastMidnight() || lastCachedTimeUsers <= lastNoon()) {
    numUsers = countUsers();

Bob is displeased. They had previously decided all stats should refresh simultaneously; see how the mockup just has one “Last updated” time. “Oops,” says Charlie, and together they merge the if-statements to prevent a future mistake.

public void displayDashboard() {
  if (lastCachedTime <= lastMidnight() || lastCachedTime <= lastNoon()) {
    numUsers = countUsers();
    numArticles = countArticles();
    numWords = countWords();
    lastCachedTime =;

Another year passes. AwesomeSauce has been taking off like Bitcoin. Their stats page is now an information feast, with dozens of items. Time to start simplifying! Surveys show that no-one is actually wondering about the count of words, so they take it down:

- print(numWords);

AwesomeSauce has hit a jackpot! Writers are making 10,000 posts per day. One day, an AwesomeSauce engineer notices they have a big slowdown twice a day, at noon and midnight. After a full day of sleuthing, he discovers that it has something to do with how, at cache refresh time, the stats page is taking a full three minutes to load. With horror, he discovers the call to countWords(), still living, though its results are unused.

This was not merely an instance of “mistakes happen.” This was a heavy price paid from badly-designed code. Everything from the first bug to the site slowdown could have been prevented by structuring it better, and the way to structure it better is a simple application of the Embedded Design Principle.

Nipping it in the Bud

Standard disclaimer: When reading software design advice, always imagine the examples given are 10x longer. Overengineering is bad. But, if you’re not sure whether applying a technique to your code would be overengineering, error on the side of doing it. Abstract early.

In the first version of the code, the computation, caching, and display of each statistic were all independent knobs, 3*N in total, where N is the number of stats to display. Merging the if-statements combined the N caching behaviors into one knob, bringing the total down to 2*N+1 knobs. Nonetheless, this still made it possible to compute a stat without displaying it. This was the rope that hung AwesomeSauce with the twice-daily slowdown.

This example was extreme, but ones like it are real. A static analysis researcher I know once visited eBay. She was shocked to discover that the checker that excited the engineers the most was one of the most shallow: detecting unused variables. Each unused computation eliminated was a database call saved, and a nibble taken out of server costs.

Conceptually, the set of stats and their caching behavior should be only N+1 knobs: each stat as a whole can be turned on or off, and the caching behavior can be changed. Achieving this will be a non-local change, and won’t be obvious by just thinking about what the code does, thinking at Level 2. Instead, we have to look at the design:

This diagram gives a rough picture of concepts used in the design of the program, showing how the concept of a dashboard stat is instantiated multiple times, and how the concept of each stat is built into the code. This diagram is far from complete; it omits, e.g.: the choice of the English language. But, with it, we can see several things.

The same caching behavior is implemented multiple times, which made the bug in Bob’s first version possible.

The code is indistinguishable from an alternate design where each stat has its own caching policy. This is another example of why design recovery is in general impossible.

And finally, each dashboard stat gets expressed in multiple independent lines of code, each its own “knob.” Hence, the slowdown that plagued AwesomeSauce.

All can be fixed — and the code rendered beautiful — if we take all those many arrows emanating from those boxes on the right, and reduce them down to one per box, one line of code per low-level concept.

This is the true version of the “Don’t Repeat Yourself” (DRY) principle.


In Zen and the Art of Motorcycle Maintenance, Robert Piersig illustrates these higher-level concepts, which he calls forms, in the world of mechanics. A motorcycle is a power assembly and a running assembly; the power assembly can be divided into the engine and the power-delivery assembly; and so on. A mechanic that’s testing the spark plugs or adjusting a tappet is really working on these higher-level forms. And yet you cannot walk into a motorcycle parts shop and ask for a power assembly: these concepts only exist in the mechanic’s head.

But we’re programmers, and we have tools that translate our thoughts into CPU instructions. To a certain extent, we can actually build these forms directly into our code. Let’s create a value representing a DashboardStat.

public interface DashboardStatComputation {
  public Object compute(); // in real code, use a more specific type

public class DashboardStat {
  private DashboardStatComputation computation;
  private String textLabel;
  // constructors and getters

In the design, we already had a concept of a dashboard stat: it’s a computation and a label. Now that concept is in the code too. We can go one level further, and also put the collection of dashboard stats into the code too:

public class Dashboard {
  private List<DashboardStat> stats;
  private Time lastComputedTime;
  private Map<DashboardStat, Object> curValues;
  // constructors and getters
  /** Recomputes all statistics if cache is stale */
  public Map<DashboardStat, Object> getCurrentValues() {  }

With these higher-level concepts turned into code, Bob’s original dashboard page can now be written declaratively. This code is just too simple to fail.

// Initialization (slightly simplified from actual Java)
dash = new Dashboard(list( new DashboardStat(countUsers,  Total Users)
                         , new DashboardStat(countArticles, Articles written)
                         , new DashboardStat(countWords,  Words written));

// Use
public void displayDashboard() {


More (or less) Embedded Design

You will never turn your program into a pure expression of the design, at least not unless you’re a descendant of Zohar Manna. You are always making tradeoffs on how far to go down this spectrum. Here are some other choices for applying the Embedded Design principle to this example:

  • If this example refactoring was too much, there’s a simpler compromise. Don’t have a DashboardStat value; just have a DashboardData which hardcodes each of the computations. This is much less overhead than the way I did it above, but it still dictates that all stats should be computed in an all-or-nothing fashion. 
  • Extract the check for whether to recompute into a shouldUpdate method. Now your decisions about cache timing are reified in a separate chunk of code. To go one step further, the caching policy could be parameterized. 
  • Might you one day want to serve the website in multiple languages? Wrap each string in a call to an internationalization function. For now, this can just be the identity function, but doing so makes the choice of language explicit in your code, and marks all the text in your code which corresponds to a “translatable string” concept. (Another example of how, at the design level, two different identity functions are not identical.) 

Wrapping Up

Many programmers will get a sense that something was off about the first example. Every programmer I’ve shown this example to knew they should merge the if-statements and factor out the prints. But the ones who knew they should refactor out a “dashboard stat” value only did so by flash of insight, and couldn’t articulate how they came up with it. But by instead thinking about how the code is derived from the design, this refactoring becomes just a straightforward application of the Embedded Design Principle. Finding a good structure for the code was easy once we saw the structure of the design.

So always think beyond the code. Ask yourself about the concepts of your program, and the values that define them. Like Piersig, this is what turns you from the mechanic turning screws into the engineer and artist building something to be proud of.

But what if you didn’t have a clear picture of how this code was derived, and you want to just be able to look at the code and realize there’s a deeper concept that should be extracted? There is a way to do that, and, for that, you’ll have to wait for my next post.

Like this post? You may be interested in my Advanced Software Design Web Course or Weekend Intensive.


Thanks to Alex Reece, Elliott Jin, and Mimee Xu for comments on earlier drafts of this post.

Monday, January 22, 2018

The Benjamin Franklin Method of Reading Programming Books

Let’s face it, programming books suck. Those general books on distributed systems or data science or whatever can be tomes for a lifetime, but, with few exceptions, there’s something about the books on how to write code in a language/framework/database/cupcake-maker, the ones with the animal covers and the cutesy sample apps, they just tend to be so forgettable, so trite, so….uneducational.

I think I’ve figured out why I don’t like them, and it’s not just that they teach skills rapidly approaching expiration. It’s their pedagogical approach. The teaching algorithm seems to be: write these programs where we’ve told you everything to do, and you’ll come out knowing this language/framework/database/cupcake-maker. Central to these books are the long code listings for the reader to reproduce. Here’s an example, from one of the better books in this category

class User < ApplicationRecord
  attr_accessor :remember_token
  before_save { = email.downcase }
  validates :name,  presence: true, length: { maximum: 50 }
  VALID_EMAIL_REGEX = /\A[\w+\-.]+@[a-z\d\-.]+\.[a-z]+\z/i
  validates :email, presence: true, length: { maximum: 255 },
                    format: { with: VALID_EMAIL_REGEX },
                    uniqueness: { case_sensitive: false }
  validates :password, presence: true, length: { minimum: 6 }

  # …another 30 lines follows...

Traditionally, there are two ways to study a page like this:

  1. Type out every line of code 
  2. Copy+paste the code from their website, maybe play around and make small changes 

Approach #1 is a method that, like a lecture, causes the code to go from the author’s page to the reader’s screen without passing through the heads of either. The second is like trying to learn how to make a car by taking apart a seatbelt and stereo: you’re just toying with small pieces. Neither is a sound way to learn.

If you had an expert tutor, they wouldn’t teach you by handing you a page of code. Still, these books are what we have. How can we read them in a way that follows the principles of learning? Read on.

Mental Representations

According to K. Anders Ericsson in his book Peak, expertise is a process of building mental representations. We can see this because expert minds store knowledge in a compressed fashion. Musicians can memorize a page of music far faster than a page of random notes. Expert chess players told to memorize a board position will do much better than amateurs, but, when they make a mistake, they’ll misplace whole groups of pieces.

This is possible because music and chess positions have structure that makes them look very different from a page of random notes or a random permutation of pieces. Technically speaking, they have lower perplexity than random noise. So, even though there are 26 letters in the English alphabet, Claude Shannon showed that the information content of English is about 1 bit per letter: given a random prefix of a paragraph, people can guess the next letter about half the time.

This is why a programmer skilled in a technology can look at code using it and read through it like fiction, only pausing at the surprising bits, while the novice is laboring line-by-line. This is also why a smart code-completion tool can guess a long sequence of code from the first couple lines. With a better mental representation, understanding code is simply less work.

(How do these mental representations work? My officemate Zenna Tavares argues they are distribution-sensitive data structures.)

This is exactly what’s missing from the “just type out the code” approach: there’s nothing forcing your mind to represent the program as anything better than a sequence of characters. Yet being able to force your mind to do this would mean being able to learn concepts more rapidly. Here’s a 200 year-old idea for doing so.

The Benjamin Franklin Method

I don’t know what’s more impressive: that Benjamin Franklin became a luminary in everything from politics to physics, or that he did this without modern educational techniques such as schools, teachers, or StackOverflow. As part of this, he discovered a powerful method of self-study. I’ll let him speak for himself (or go read someone else’s summary).

About this time I met with an odd volume of the Spectator. It was the third. I had never before seen any of them. I bought it, read it over and over, and was much delighted with it. I thought the writing excellent, and wished, if possible, to imitate it. With this view I took some of the papers, and, making short hints of the sentiment in each sentence, laid them by a few days, and then, without looking at the book, try'd to compleat the papers again, by expressing each hinted sentiment at length, and as fully as it had been expressed before, in any suitable words that should come to hand. Then I compared my Spectator with the original, discovered some of my faults, and corrected them.

—Benjamin Franklin, Autobiography

This process is a little bit like being a human autoencoder. An autoencoder is a neural network that tries to produce output the same as its input, but passing through an intermediate layer which is too small to fully represent the data. In doing so, it’s forced to learn a more compact representation. Here, the neural net in question is that den of dendrons in your head.

K. Anders Ericsson likens it to how artists practice by trying to imitate some famous work. Mathematicians are taught to attempt to prove most theorems themselves when reading a book or paper --- even if they can’t, they’ll have an easier time compressing the proof to its basic insight. I used this process to get a better eye for graphical design; it was like LASIK.

But the basic version idea applied to programming books is particularly simple yet effective.

Here’s how it works:

Read your programming book as normal. When you get to a code sample, read it over

Then close the book.

Then try to type it up.

Simple, right? But try it and watch as you’re forced to learn some of the structure of the code.

It’s a lot like the way you may have already been doing it, just with more learning.


Thanks to Andrew Sheng and Billy Moses for comments on previous drafts of this post.

Monday, January 15, 2018

The Design of Software is A Thing Apart

Big up-front planning phases are out. Rapid iteration is in. With all this movement towards agile, it’s increasingly tempting to throw out the idea of having a separate design doc for software in favor of just getting started and having self-documenting code.

And that is a fatal mistake.

Last week, I explained that we speak about program behavior at one of three levels: executions, code, and specification. Most knowledge about software design is contained in Level 3, the specification. The flipside: the information of a program’s design is largely not present in its code1. I don’t just mean that you have to really read the code carefully to understand the design; I mean that there are many designs that correspond to the exact same code, and so recovering all the design information is actually impossible.

In that post, I talked a lot about the example of describing malloc and free, and how it’s quite possible to have a semantic-level notion of “freeing memory” that corresponds to absolutely nothing in the implementation. In this post, I give three more examples. The first discusses the problem of testing too much, and how the decision of what to test is in general unrecoverable from the code. The second is a question of coupling which cannot be answered from the code alone. The third shows how an identical comparison function may have different meanings in different contexts, with different expectations about how the code may evolve. None of these examples will cause your code to break, at least not today. What they will cause are the characteristic problems of bad design: fragile tests, and the risk of extra work and bugs tomorrow. The first two examples are based on real code I’ve encountered (or written!), which caused exactly those problems.

Let’s begin!

1) What should you test?

“Write the tests first,” say advocates of test-driven development, “and the code will follow.” Doing so forces you to think about what the could should do before writing it. What it doesn’t do is get you to separate the code’s goals from its implementation details.

So you update the code, a test fails, and you think “Oh. One of the details changed.” Should you update the test to compensate? If you tried to use TDD as a substitute for thinking about the design, then you probably will. Congratulations, you now have a fragile test. Writing tests in this fashion is mistake #7 in my booklet “7 Mistakes that Cause Fragile Code”.

Here’s some code and its test. Is this a good test?

Code (pseudo-Java):
public void postProfileUpdateToServer(User u, Profile p) {
  HttpParams params = ;
  addProfileToParams(params, p);
  HttpResult res =, params);
  logger.logUpdateEvent(u, p); // calls appState.saveCurrentState()
  if (res.getResponseCode() != Http.HTTP_SUCCESS)
    scheduleRetry(u, this.updateBlock);
Test checkPostProfileUpdateToServerAssumeSuccess (pseudocode):
  Call postProfileUpdateToServer with user Bob,
       and his profile with address changed to 123 Mulberry St
  Check that was called once
  Check that its first argument was
               Check that the first parameter is the users E-mail address,
                  the second their mailing address,
                  and the third their preferred language
  Check that appState.saveCurrentState was called twice

( How do you check that was called? This is quite easy to do using a mocking framework such as Java’s Mockito. I quite recommend it, as a way to unit-test functionality purely in terms of how it’s defined using other operations. )

There are generally two ways to write bad tests. One is to miss cases and not be rigorous enough. This is what happens when developers chase code coverage without actually thinking about what they’re testing. A common blind spot is that a programmer will test what happens when a program goes right, but not test what happens when a server’s down, when it’s fed bad input, etc.

The other is to test too much and be too specific. Is it part of the design that postProfileUpdateToServer must call appState.saveCurrentState twice, and it chooses to do so once by calling logger.logUpdateEvent? I don’t know. The test assumes a certain order about the HTTP parameters, but does the Profile object guarantee it? If the Profile uses a hash table internally, then the order is arbitrary, and may change when upgrading the libraries. Then the test will fail when upgrading to a new version of Java. (This is why Go randomizes hash table iteration order; one of my colleagues built a tool that does the same for Java.) At the extreme, I’ve seen a very experienced programmer write tests like assert(MSG_CODE_UPLOAD_COMPLETE == 12), which, given that these constants are meant to encapsulate knowledge, kinda defeats the point of having them in the first place.

The code should not follow the tests, nor the tests the code. They should both follow the design.

2) Do these pieces of code know about each other?

One nice thing about understanding the design/specification level of code is that it gives us very clean definitions of formerly-fuzzy concepts like “this code knows about that code.” For now, stick with your intuition as you answer this question:

Here are three modules. Module 1 defines a “student” record. Module 2 defines an “employee” record, and creates them by turning students into student employees. Finally, module 3 uses employee records, namely by printing them.

Oh yeah, and this example is in pseudo-JavaScript, so there’s no notion of “type” other than “the set of stuff this function can return.” Here’s the code:

Module 1 :

function makeStudent(firstName, lastName) {
  return {firstName: firstName, lastName: lastName};

Module 2:

function makeStudentEmployee(student) {
  var e =  student.copy();
  e["employeeId"] = nextEmployeeId();
  return e;

Module 3:

function printEmployeeFirstName(employee) {

So, Module 3 accesses the firstName field of the employee value, which is defined in Module 1. Now, the question is: does Module 3 know about Module 1. That is, does the description of what Module 3 does need to be written in terms of the description of what Module 1 does?

The answer is: it depends. If the documentation of Module 2 is “Returns a student employee record with firstName, lastName, and employeeId fields,” then the answer is a clear No; it’s just an implementation detail that it obtains a student-employee record by appending to a student record. If the documentation is “Returns a student record with an additional employeeId” field, then it really is the case that you can’t know what’s in the argument to printEmployeeFirstName without reading Module 1, and so modules 1 and 3 are very much coupled.

3) What does this code accomplish?

Quick! What does this line of code accomplish?

  return x >= 65;

Two possible answers:

  1. x is an ASCII character which is already known to be alphanumeric; this code checks if it’s a letter. 
  2. x is an age of a person; the code checks if the person is past the retirement age in the US. 

In both possible answers, the code is the same. I can probably find examples where the surrounding context is the same too. If this is just a standalone function lying in a Utils file somewhere, it makes absolutely no difference to the program which I use it for; I could even use it for both. But it makes a massive difference to the evolution of the program.

The former may need to be changed for internationalization, while the latter may need to be changed following changes in US laws. The former should live alongside other functionality related to character sets; the latter alongside code relating to people or company policies. And in the former case, since there’s a bunch of non-alphanumeric ASCII characters between “9” and “A”, some crazy-optimizing programmer could change it to “return x >= 64” and then optimize it to use bit-ops like “return (x >> 6)” with no effect on the program. In the latter, that might lead to a lot of erroneously-sent retirement benefits.

Meanwhile, if you were trying to prove the correctness of a program that used this function, the proof environment would not be convinced that Bob is eligible for retirement benefits just because that function returned “true” until you tell it exactly what this function is meant to do.

If you just think about the code, then the letter ‘A’ is the same as the retirement age. So focus on the design.

Credit to Ira Baxter for this example.

Writing Better Code

I gave you three examples of situations where it’s impossible to recover design information from the code alone. For the tests, there’s no solution other than to have the design written somewhere separately, so you can know whether the order of HTTP parameters is stable, or whether to rely on the number of times saveState is called. For the code examples, it’s actually quite possible to change the code so that it only corresponds to one design. For that unclear case of coupling, if you want to make it clear that modules 1 and 3 are separate, have module 2 create records with explicit fields. For that retirement age check, just write it as “return x >= RETIREMENT_AGE”. Even if RETIREMENT_AGE and ASCII_A (usually spelled just 'A') have the exact same runtime values (for now), they represent different concepts. They are not interchangeable in the design, nor in a formal proof.

This gives rise to my first rule of writing good code:

Jimmy Koppel’s rule of good code #1: Make the design apparent in the code (Embedded Design Principle).

Sounds simple, but it’s quite vague and subtle until you’ve really internalized how code is derived from a design. I’ve given multiple workshops exploring different aspects of this idea, especially around coupling.

Some things are better left unsaid

In statistics, we used to think that, if you had enough data, you could learn anything. Then the causal inference guys came along and showed us that, actually, even with infinite data, there can be many processes that generate the same observations. If you just collect data about what days there was rain in town and what days your lawn was muddy, you can’t tell the difference between a universe in which rain causes mud and one where mud causes rain. You need to know something more.

Software design is the same way: there can be many designs that correspond to the exact same code. Those who speak of “self-documenting code” are missing something big: the purpose of documentation is not just to describe how the system works today, but also how it will work in the future and across many versions. And so it’s equally important what’s not documented.


Thanks to Alan Pierce, Alex Reece, and Tej Chajed for comments on earlier drafts of this post.

1 Note that I am considering text in a source file, namely comments and variable names, as part of the documentation, not part of the code. A long comment explaining the structure and guarantees of your program is absolutely a form of specification.

Sunday, January 7, 2018

The Three Levels of Software: Why code that never goes wrong can still be wrong

Have you ever stopped to consider what it means for a program to be wrong? I mean, really stopped to consider? Like, “it’s wrong if it crashes — but what if the crash conditions are only hypothetically achievable — but wait…”. Let’s get to the bottom of this. Here’s a first try:

Definition #1
A program is wrong if it runs and produces a wrong result.
(Where “wrong result” is broadly defined to include performance bugs, usability bugs, etc.)

So I put my bread in the toaster, and, instead of giving me a piece of toast, it gives me a rhinoceros. Not what I wanted. Definitely a bug (err, mammal).

Programs can still usually produce the correct result but be wrong. So I get a new toaster, and I’ve never had a problem with it. Then I find out that, if I put in a thin slice of Bavarian rye when there’s a full moon out, and I’ve used the toaster exactly twice in the last 24 hours, then it will fry its circuits. Sounds like a bug. So we amend it into a 2nd definition.

Definition #2
A program is wrong if there exists some environment, sequence of events, or other “input” under which it produces a wrong result.
Now let’s eliminate all that. I’ve patched the software on my toaster, and I’ve both proven the code correct, and had a trillion monkeys test every conceivable situation. There’s now nothing you can do short of taking a sledgehammer that will make the toaster malfunction.

It’s still possible for the code to be wrong.

How? Simple. Let’s say somewhere in that toaster, it sums all the elements of an array — but it accidentally goes over by one. Memory error? Well, it’s written in C, so it just adds on whatever is in the next address in memory. That address is a byte which is always 0 for whatever reason, so it gets the right answer every time.

The code is still wrong. It may never fail, but the reasoning for why the code should work has a hole: what should have been a simple argument of reading an array now depends on complex assumptions about the compiler and memory layout, as well as a whole-program check that that byte is always 0. And that hole can cause an actual failure in future versions of the program: someone rearranges the fields in a structure, and code that should never have been affected starts failing.

We reach our third definition:

Definition #3
A program is wrong if the reasoning for why it should be correct is flawed.

Progress! We’ve gone from a clear and simple definition to one that’s handwavy and impossible to use. Actually, it’s quite rigorous, but I’ll have to teach you some formal verification to make it more concrete.

All three definitions are correct to use at different times. This forms one of my core teachings: that, when we talk about programs, we speak at one of three levels. These three levels are:

  • Level 1: Runtime. The runtime level deals with specific values and a specific environment from a single execution of the program. A lot of debugging is done at Level 1.
  • Level 2: Code. At the level of code, we think about what the current implementation could do when given arbitrary inputs and an arbitrary environment. Behaviors that cannot happen are not considered, even if it requires global reasoning to rule them out. A lot of implementation work is done at Level 2.
  • Level 3: Design/Logic: At the level of logic, we consider the abstract specification of each unit of a program. When using other units, we only consider the guarantees made by the spec, and assume they may be replaced at any time with a different implementation. Many programs which are correct when viewed at Level 2 are not correct when viewed at Level 3, because they rely on behavior which is not guaranteed to hold in all future versions. Most software design is done at Level 3.

I’ve met programmers who confuse Level 1 and Level 2, but not many. Levels 1 and 2 are concrete enough: you see Level 1 by running under a debugger and inspecting the stack, while you see Level 2 just by reading the code and thinking about what could happen. But most programmers have a much harder time getting that there even is a Level 3, and I’ve seen much confusion come from when one programmer is talking about what a component is guaranteed to do, and the other is talking about what it happens to do. Most programmers will be taught the difference between interface and implementation at a high-level. But very few get to see the full details of what defines an “interface” beyond just “list of functions.” That can only be seen when doing formal verification, where a program’s properties and assumptions are written just as concretely as its source code. In everyday development, all that structure and reasoning is still there; it’s just scattered across documentation, comments, and programmers’ heads.

Statements at different levels do not mix. So if the proposed client/server protocol says the client should send a request twice and discards the first result (a Level 3/design-level statement), and the designer tells you it’s because there are three different kinds of request handlers in the codebase, and Bob’s sometimes gets it wrong the first time (a Level 2/implementation-level statement), you should get confused. You should be as confused as if someone wanted to call a file or write to a function.

Isn’t our goal to deliver working software to customers, and so Level 2 correctness is all that’s important? No, our goal is to be able to continue to deliver working software far into the future. Level 3 is all about how modular are the interactions between the components of your software, and that’s important if you care at all about having different versions of those components, like, say, if you wanted to rewrite one of them tomorrow. Joel Spoelsky relates how the original Sim City had a use-after-free error. At Level 2, this was totally fine, since freed memory in DOS was valid until the next malloc, and so the program worked. At Level 3, this was a defect, because the spec for free says you need to act as if memory gets eaten by a dragon as soon as you free it, and any future free implementation may actually eat it. Sure enough, once Windows 3.1 rolled around with a new memory manager, SimCity would start crashing. Microsoft had to add a special case to check if SimCity was running and switch to a legacy memory manager if so.

People sometimes tell me about how software is so easy and you can just have an idea and make it and it’s so cheap because there’s nothing physical to build. Hogwash. Software is a domain where we make decisions that can’t be undone and have to be supported for all eternity. The HTTP “referer” is forever misspelled, and that SimCity special-casing code is still there 30 years later.

This is why it’s important to get programs right at Level 3, even if it passes all your testing and meets external requirements.

And, lesson for API designers: This is why it’s important to make your APIs conform as strictly to the spec as possible, at least in Debug mode. This all could have been avoided if DOS’s free were to deliberately zero-out the memory.

I now have two years experience teaching engineers a better understanding of how to avoid complexity, improve encapsulation, and make code future-proof. But ultimately, that knowledge all stems from this single master insight: the most important parts of our craft deal not with code, but with the logic underneath. The logical layer may be hidden, but it’s not mystic: we have over 40 years of theory behind it; see the postscript below for a taste of the details. When you learn to see the reasoning behind your system as plainly as the code, then you have achieved software enlightenment.

Next time: why this implies keeping design and implementation separate.

Postscript: Technical Details

I could speak a book on harmony and composition, but you’d learn more from listening to a song and seeing the sheet music. Similarly, I could ramble on about reasoning and assumptions, but to truly understand it, you need to see the objects under study. The three levels deal with different views of a program: executions, code, and specifications. Each corresponds to its own kind of reasoning.1 Let’s bring in the math!

Level 1: Traces and states

The objects of study of Level 1 are traces and states. A trace is a sequence of events that occurred in a program execution. A trace looks something like this:

Enter method postProfileUpdateToServer
Read field profile to local variable p
Enter method saveCurrentState
Leave method saveCurrentState

Traces can be very high level, like which microservices get run, or very low level, involving instruction scheduling on the CPU. They tell you exactly what happened, and give the information to construct the current state.

The state is a collection of cells with their current value. It looks like this:

{ p=Profile(name=Bob, id=42, ), 
  this=NetworkClient(socket=..., baseUrl=”),
  __messages_output=[Initializing MyApp version 5, Welcome, Bob!, ],

Most often, the state just consists of values in memory, though it can be helpful to also include pieces of the environment, or a list of what’s already been output or sent over the wire.

Any statement that can be phrased in terms of specific traces and states is a Level 1 statement. These constructs should be familiar to programmers: printouts let you view fragments of a trace, while a debugger lets you observe the current state.

The corresponding manner of automated reasoning is ground reasoning. Ground reasoning means reasoning only about concrete values with no quantifiers. So, suppose I have this code:

1:       left = x - 10;
2:       if (left < 0)
3:         left = 0;
4:       right = min(x+10, 100);
5:       print(right);

Say I know from the printout that right=50 in a given trace, and I want to know whether it’s possible that the branch on line 3 executed. I could answer this question by asking a solver to find a satisfying assignment for the following formula:

left<0  (left=x-10 
          right=(if x+10<=100 then x+10 else 100)

Level 2: Code

The object of study of Level 2 is the code. Yes, code, the stuff you work with every day. Any statement that can be phrased in terms of the code, but not a specific trace or state, is a Level 2 statement. Let’s look at the following example, which computes a damage bonus in a hypothetical game:

public double computeDamageBonus(int creature1AttackSkill,
                                 int creature1ArmorPiercing,
                                 int creature2DefenseSkill) {

  int adjDefense = creature2DefenseSkill - creature1ArmorPiercing;

  if (adjDefense <= 0) {
    return 10;

  double factor = (double)creature1AttackSkill / 
                     (creature2DefenseSkill - creature1ArmorPiercing);

  if (factor > 10)
    return 10;
    return factor;

By picking specific inputs to this function, it will yield a trace and a sequence of states. But this program encodes an exponentially large number of possible traces and states. At Level 1, we can ask “Did this execution experience a division-by-zero error?” At Level 2, we can ask “Is there any execution of this function that experiences a division-by-zero error?”

The corresponding manner of reasoning is first-order logic. This means we can write down formulas that say “for all inputs X, does this property hold?” or “does there exist an input Y, such that this function crashes?” What we can’t do is quantify over other functions. Here’s a first-order formula that states that the computeDamageBonus function can never have a division-by-zero error:

creature1AttackSkill, creature1ArmorPiercing, creature2DefenseSkill.
  (adjDefense = creature1ArmorPiercing - creature2DefenseSkill
  ¬(adjDefense  0))
     creature1ArmorPiercing - creature2DefenseSkill  0

Level 3: Specifications

The object of study of Level 3 is the specification. There are many, many ways of writing specifications, but a popular one is the Hoare triple: preconditions and postconditions. Here’s one for malloc:

Pre: n > 0
Post: retval  NULL  alloc(retval, n)

At Level 2, we could ask questions about a specific implementation of malloc, like “How much memory overhead does it use?” At Level 3, we can ask questions about all possible implementations of malloc, like “does this program have a memory error?”

The corresponding manner of reasoning is higher-order logic. Higher-order logic is like first-order logic, but we can now quantify over functions. First, let’s translate the spec above2:

MallocSpec(m) = n. n > 0  m(n)  NULL  alloc(m(n), n)

Now, if we hypothetically had a formal specification for the PlaySimCity function, a spec that it works for any implementation of malloc would look something like

m. MallocSpec(m)  SimCitySpec(m, PlaySimCity)

And that, my friends, is modularity reduced to a formula.

A really big thing about specifications is that they can involve properties which do not appear in the code at all. We defined malloc in terms of this alloc predicate, which has no intrinsic meaning other than “stuff returned by malloc.” But what this does is let us relate malloc to other operations. We can give the act of dereferencing pointer x  a precondition ∃a,n. a≤x<a+n ∧ alloc(a,n), and give the free(x) function a spec that destroys the predicate alloc(x,n). Now, having a proof of alloc(a, n) when reasoning about the program means “this memory was returned by malloc, with no intervening free” — exactly our intuitive notion of memory being allocated!

What we’ve shown here is that “this memory is allocated” is a specification-level notion which is independent of the code. And indeed, there may be nothing within the program to indicate that a piece of memory has been allocated.  It’s possible that being allocated corresponds to some internal data structure of the memory manager, but it’s also possible that your code will be compiled for a machine with infinite memory.

Now I can state exactly in what sense the SimCity code was wrong: A precondition of dereferencing a pointer is that the pointer is to allocated memory. They tried to dereference a pointer without meeting that precondition, i.e.: no proof of alloc(a,n). So, their code ran fine for one implementation of malloc, but, as they sadly learned, not for all of them.

When designing software, I always recommend trying to think in pure concepts, and then translate that into the programming language, in the same way that database designers write ER diagrams before translating them into tables. So whether you’re discussing coupling or security, always think of the components of your software in terms of the interface, its assumptions and guarantees, and mentally translate them into formulas like the ones above. So much becomes clearer when you do, for logic is the language of software design.


Thanks to Elliott Jin and Jonathan Paulson for comments on drafts of this post.

1 Don’t get too hung up on the analogies between the levels of software and the modes of reasoning; there are plenty of exceptions. For instance, much of the progress in program analysis/verification/synthesis research comes from finding all sorts of tricks to encode more complicated problems into a form that can be solved by ground reasoning, since we have good solvers. This likely isn’t going to be too relevant to you unless you work in programming tools.
2 I am lying slightly in this example — the kind of logical formula I gave is really just meant for pure functions. To handle malloc properly, we’d want to use separation logic.

Saturday, June 4, 2016

The Partial Control Fallacy

Around the time I started grad school, I applied for a few prestigious fellowships. Winning is determined by several factors. Some are just an application, while some have a follow-up interview, but the applications all get scored on a rubric that looks roughly like this:

  • 50%: Past research
  • 30%: Letters of recommendation
  • 10%: Transcript
  • 10%: Personal Essays

Naturally, I proceeded to pour massive amounts of time into the essays, letting it consume much of my free time for the month of October.

Getting that Fellowship will really help me have a successful graduate career. Writing better essays will help me get the Fellowship. Therefore, to the extent that I care about having a successful graduate career, I should be willing to work hard on those essays.

But if the real goal is a successful graduate career, then at some point shouldn’t I put those essays down and do something else, like reading a few papers or practicing public speaking?

This, I dub the Partial Control Fallacy. It’s where, if there’s some outcome you want, and you only control a couple factors that affect that outcome, you decide how much to try to improve those factors as if you were actually improving the entire outcome. It’s closely connected to the 80/20 principle: it’s when you only have control over that last 20%, but you pretend it’s the whole thing and work on it accordingly. It’s when the 80/20 principle would suggest doing nothing at all.

Here are some more examples:

  • Trying to get any competitive award that’s judged mostly by your past. The best college application is stellar grades and some good awards, the best resume is a great network and lots of success stories, and the best pitch to VCs is a rock-solid business.
  • Thinking really hard about what to say to that cute guy or girl across the room. Most of what happens is determined before you open your mouth by what they’re looking for and whether they’re attracted to you.
  • Worrying about small optimizations when writing code, like avoiding copying small objects. Most of good performance comes from the high-level design of the system.

I think I’ve been guilty of all three of these at one point or another. I don’t want to think about how much time I spent on my Thiel Fellowship application and preparing for my YCombinator interview. Meanwhile, most people who get into either don’t spend much time at all.

In parallel computing, there’s a concept called Amdahl’s law. If your program takes t steps to run, and you can make s steps faster by a factor of f (say, by splitting them across multiple processors), then the new speed is t-s+s/f, for a speedup of t/(t-s+s/f). Therefore, if you optimize those s steps extremely hard and split them across infinite cores, the best speedup you’ll get is t/(t-s).

Applying that to the above, and you can see that, if I worked infinitely hard on my essays, I could only make my apps 11% better versus not doing anything at all. (At least to the extent that it really does follow that rubric, even if I submit a blank essay.)

Sometimes, that outcome is all you care about it, in which case you’re perfectly justified in trying to eke out every advantage you can get. If you’re in a massively competitive field, like sports or finance, where there’s a really big difference between being #1 and being #2 at some narrow thing, then, by all means, go get that last 1%. Wake up early, get that 7th computer monitor, rinse your cottage cheese. But if you’re putting this kind of effort into something because it’s your terminal goal — well, you’re not doing this for anything else, are you?

I think the solution to this fallacy is always to think past the immediate goal. Instead of asking “How can I get this Fellowship,” ask “How can I improve my research career.” When you see the road ahead of you as just a path to your larger mission, something that once seemed like your only hope now becomes one option among many.

Thanks to Nancy Hua, Melody Guan, and Ryan Alweiss for comments on earlier versions of this post.

Tuesday, October 13, 2015

The Prototype Stereotype

It’s a sunny day in Santa Cruz, and Alice is showing Bob her new app:

Alice: This is Frosttly, my new on-demand delivery service for cupcakes. 

Bob: Cool! How does it work?

Alice: It’s based on the Uber and Google Maps APIs. Whenever, you press this button, it texts one of our Cupcake Delivery Specialists your current location and summons an Uber so they can bring you cupcakes. 

Bob: Wait; there’s got to be more to it than that. 

Alice: Naturally, this is just a prototype. The final version will integrate directly into the order system for a specialized cupcake kitchen, and will feature sophisticated order tracking and highly optimized routing of deliveries. It’s fully functioning though. Go ahead; try it. 

Bob presses the button, and hears Alice’s phone vibrate in the other room 

Bob: Wow, it works! I can’t wait to get my cupcakes.

If you’ve ever been to a hackathon, you might have seen plenty of conversations like this. Elsewhere, Billy is showing Alyssa his prototype of a new dessert: he poured confectioners sugar over an antique dinnerware set to create something light and fluffy with a rustic feel (the final version will also have chocolate and goji berries). Billy’s concoction is much closer to a final product than Alice’s cupcake service without cupcakes, and may have even taken more work than her handful of lines of code. Yet still I’d be surprised if I saw anyone calling a pile of sugar a prototype dessert. What makes one a prototype but not the other? The resolution is simple: there’s no such thing as a prototype app.

A few years ago, my working definition of a prototype was simple; you build enough of your plan to get a sense of what the final version looks like and show you could build the whole thing. The prototype of my app for playing arbitrary card games was a screen where you could drag-and-drop rectangles. The prototype of my game mod used hand-crafted assembly to make changes. This breaks down when you stop to think: what aspect exactly are you trying to figure out?

Some people walk into Jesse Schell’s Game Design class expecting an easy time, and are shocked to find themselves pulling multiple all-nighters for a class where getting a 100% on everything is only enough for a B. But those that persevere find themselves with new worldviews on everything from sleep to applied probability theory, and learn why there’s no such thing as a prototype app. As Schell states in his book “The Art of Game Design,” a prototype is defined not by the product it steps towards, but by the question it’s intended to answer. And depending on the question, the form of the prototype can be very surprising.

So, what’s a prototype for Tetris? You mean: something to figure out how the blocks mechanic works in practice? Get a friend to cut some shapes out of paper and start sliding them down a grid. It might not make for the best Tetris experience, but it takes about 15 minutes to get going, and it’s enough to start getting a feel for how the shapes work together.

The team for the game “Prince of Persia: Sands of Time” wanted to figure out how the acrobatics in the game would work. Their prototype was just a few animations of different moves, plus a bit of imagination.

Designing the gameplay for a new first-person shooter? Try flashlight tag. Designing the atmosphere? The prototype is also called “concept art.”

And for hardware? Nintendo showed us the difference between design and technological prototypes in 2005 when they unveiled the slick look of the Wii while having modified Gamecubes running behind their demo booths

It gets even more interesting once you leave the realm of games. Jeff Hawkins of Palm prototyped the user experience of the Palm Pilot by carrying around a block of wood in his shirt pocket. He would frequently take it out to “check his schedule” or “look up a contact.” Whenever someone suggested a new feature in a meeting, he would take it out and ask them where it would fit. Meanwhile, I told a friend at an assistive robotics startup, that, while their current project does serve the purpose of a technological prototype, he could build a prototype to test the value proposition much faster by simply going to his grandparents’ house and pretending to be a robot.

Prototypes and MVPs

Like everything else in Lean Startup, the idea of a “minimum viable product” has been passed around the Valley in a game of telephone until its meaning is perhaps less than that of the three words stuck together.

A minimum viable product is a very special kind of prototype, one that tests the two key factors behind a startup’s success, what Eric Ries calls the value hypothesis and the growth hypothesis. Typically, this constrains the MVP to more resemble the actual product, but not necessarily. In 2008, Dropbox hit a key milestone on the path to their MVP when they released a video demonstrating their product. Tens of thousands joined the waiting list. So actually, the minimum viable product was the video itself: they had already proven that lots of people (growth hypothesis) want what they’re building (value hypothesis).

Similarly, the MVP doesn’t need to work internally at all like the final product. Lean Startup contains a couple examples of this: the ingredient-delivery and meal-planning service “Food on the Table” started with the CEO making deliveries to one woman, and didn’t even try to add automation until forced to. Alice’s app may not be a technical prototype of anything other than the ability to send texts, but if she can do enough behind-the-curtain work to get real users and see how they respond, it’s enough for a perfectly fine MVP.

Unpacking the Confusion

Prototypes for design questions, engineering challenges, usability, market. Demos for users, investors, the press. Why do people seem to want to combine them all into one mythical “prototype?” To show they can build it? Surely they realize that, for a typical web app, the answer to “Can I make this?” is usually “Yes” if not “Yes, but why?” And why is it weird to prototype an RPG game’s combat system with pen and paper, when many of them are just digitized versions of physical predecessors?

I think it’s simply a case of a more general phenomenon. The best way to learn to play a song may first involve drills with nary a bar from the final piece. The charity that makes you feel great and does a lot of good may actually be two charities. When you learn to identify what things you want, it’s often best to get them separately. So let’s lay rest to the idea of a prototype app. Forget about about prototyping your product as a whole. Find the underlying questions, and answer them.

Thanks to Jonathan Paulson, Amy Quispe, Nancy Hua, Michael Poon, and Melody Guan for comments on earlier drafts of this post.