Friday, February 23, 2018

The Practice is not the Performance: Why project-based learning fails

Last night, I encountered an old post by Zach Holman where he pushes the idea that traditional school-based CS is useless; project-based learning is the way to go. I’ve heard this idea repeatedly over the last 10 years, and know at least one person who’s started an education company with that premise.

I don’t want to debate the current way universities do things (I found my undergrad quite useful, thank you), but I do want to dispel the idea that everything would be better if only we switched to project-based learning. The opposite is closer to true. Project-based learning is not a superior new idea. Project-based learning is the most inefficient form of learning that still works.

To see why, we actually have to sit down and think about the learning process. I’ve learned a number of models of learning over the course of my teaching training, but the one I’ve found most useful is taught by the Center for Applied Rationality. It goes like this:

  1. Break down a skill into small components.
  2. Drill each component rapidly, with immediate feedback
  3. Integrate the pieces together into larger components; drill these similarly

In that light, project-based learning definitely has something going for it: if you do a project, you will practice all the skills needed to do projects.

There’s something missing, though: it completely gives up on trying to think about what components it’s trying to teach.

Here’s how project-based learning might play out: After finishing his first two Android apps, Bob decides he wants to learn network programming, and generally get better at working on a software team. He and Alice pair up and decide to build a chat app for sending cat gifs, with a distributed replicated backend. They decide to make a spreadsheet of tasks and claim them. Alice also wants to learn network programming, and she swoops in and takes most of the server components; Bob gets left with most of the client work. Over the next three weeks, Bob spends a lot of time building the GUI for the app, which he already knows how to do, and only a couple hours implementing the client protocol. They start writing tests because the teacher said they had to. Bob has trouble writing the tests, and realizes a couple ways he could have made his code differently to make testing easier. He also discovers that two of his tests were too fragile, and needed to be changed when he updated the code.

It’s now one month later. What has he gotten from the experience? He's learned he needed to be more proactive about taking on tasks that challenge him and grow his skills. He’s learned a smidgeon about network programming, and a couple ideas about how to write better tests. These are good lessons, sure, but expensive. Isn’t there a way for Bob to learn more in that month?

False Temptations

What are the common arguments in favor of project-based learning? Here are two of the main ones.

Real skills

The first big reason for project-based learning is that it teaches real skills used in industry. Why do many schools teach much of their curriculum in Haskell, not in the Tiobe Top 10, or even SML or OCaml, not even in the top 50? Wouldn’t they serve their graduates better teaching Node and React?

The first counterargument is that industrial technologies come and go. Proponents acknowledge this, sure, but still call CS departments “out of date” for not following trends. What really drove home the futility of this argument for me was this essay by software-engineering pioneer Mary Shaw. Had she followed that advice in the 60s, she points out, her students would have spent their time studying JCL, the language used to schedule jobs on IBM mainframes.

The second and bigger counterargument: learning concepts is much more important than learning applications, and the best environment to learn a concept is rarely the one in industrial demand.

People often ask me what’s the best language to learn to study software design. I ask them what’s the best instrument to learn to study music theory. Everyone answers piano. In piano, you can see the chords in a way that you can’t in, say, trombone. We see something similar in other domains. In The Art of Learning, Josh Waitzkin recounts how, unlike others, he started studying chess at the fundamentals, in situations with few pieces on the board. He ultimately beat competitors who studied many times harder. In The Art of Game Design: A Book of Lenses, Jesse Schell advocates looking past modern video games and instead studying the concepts in board games, dice games, playground games.

So, for programming, we need to (1) figure out the core concepts to teach, and (2) pick languages that make the concepts readily available. C and Java --- indeed, all top languages until Swift arrived --- lack elegant ways of expressing the idea of a sum, a value that can be one of many alternatives. Thus, when explaining why adding a new alternative sometimes breaks existing code and sometimes doesn’t, I find myself having to explain using the clumsy manifestations as unions and subclasses. The first time I read Bob Harper’s explanation of why they chose SML for the CMU undergraduate curriculum, I thought he was just rationalizing snobbery. Now, I quite agree.

More like real jobs

The second big argument for project-based learning is that it more closely resembles what students will actually do on the job. This, in turn, is based on the idea that the best way to practice an activity is to do it.

This is false. “The only way to improve at X is to do it” is the advice you give when you actually have no idea how to improve. When you do know, you isolate subskills and drill them. Martial artists punch bags and do kata, and fight with extra constraints like no dodging. Musicians play scales, and practice a single measure over and over. Mathematicians rederive theorems from the book. And to condition a shot-putter, the best way is not to put weights in their hands and have them mimic the throwing motion, but rather to train the body’s ability to produce power, using squats and lots of heavy exercises that don’t even resemble shot-putting.1

Drilling programming

So what am I advocating? I’m advocating that you actually think about what you’re trying to teach, and design drills for it. The first drills should focus on one thing at a time.

So, for Bob trying to learn network programming and the general software engineering skills of being on a team project, here’s my 5-minute attempt to come up with an alternative way to teach these skills:

  1. Writing just the networking component of a larger system.
  2. Being asked to write the test cases for a small program given to you. The program is deliberately designed to tempt you into all the classic mistakes of test-writing.
  3. A simulation where you “coordinate” with the TAs to build something, committing pseudocode to version control. They troll you by deliberately misunderstanding things you say and seeing if their misunderstanding can go undetected.
  4. You and several teammates are assigned a small project. You’re asked to divide it amongst yourselves into unrealistically small units of work, and write test cases for each other’s components. (Integrates the skills of 2 and 3.)

These total much less time than the chat app, and teach much more. If students find they’re not optimal for learning, I as the instructor have much more room for experimenting. And if the students do choose to do a full project afterwards, they’ll be much more prepared.

I don’t teach network programming and haven’t tested these specific ideas. But, in my specialty of software design and code quality, I use exercises built on similar principles all the time. So I may teach a client some of the pitfalls of a naive understanding of information hiding, and then show them code from a real project that has that problem and ask them how they’d solve it. Or I’ll ask them to give just the type definitions and method signatures for a Minesweeper game; if they violate any design principles, then I can give them a feature request they can’t handle, or show why the implementation will be prone to bugs.

Is it better than just assigning projects? That’s the wrong question to ask because project-based learning is incredibly easy to beat. My clients are mostly working professional software engineers; they’re already doing “project-based learning” every day. On my website, I claim.

Programmers learn by making bad design decisions, working on the codebase for a year, and then find themselves wishing they could go back and do things differently. I can give you that experience in an hour. 

Does this sound like a bold statement about my teaching prowess? It’s not. In fact, piano teachers put that claim to shame. You can spend hundreds of hours practicing a piece using too many muscles on every key press. If your body awareness isn’t great, you might not find out until your hand cramps up right before your performance. A couple seconds to catch that mistake, a couple minutes to tell you a story of that happening to others, and the piano teacher’s just saved you months.

(As an aside, this is why I believe in finding a competent private coach for every skill you really care about improving in.)

Replacing a traditional CS education with a “software engineering BFA,” like Spoelsky and Atwood suggest, is no longer a hypothetical exercise. We’ve tried it. And now dev bootcamps are going bankrupt. Instead of substituting for a traditional degree, recruiters are calling bootcamps jokes. Olin College of Engineering is famous for its project-based curriculum, but one student reports that she learned much more from traditional classes.

It’s time to stop looking for panaceas and shortcuts and realize that deliberate learning and deliberate practice --- as a separate activity from the everyday doing --- is the only way to mastery. As famed gymnastics coach Chris Sommer puts it, the fastest way to learn is to do things the slow way. Studying the fundamentals may seem like a distraction keeping you from getting your hands dirty making a Rails app using the Google Maps and Twilio APIs, but when you do get there, you’ll find there is less to learn if you’ve already compressed the knowledge into concepts.

Shameless Plug

My Advanced Software Design Web Course starts next week. It’s based on a lot of the learning principles I mentioned in this post, starting each concept with isolated drills and progressing to case studies from real software, and comes with personalized feedback from me on every assignment.


No, the sum total of knowledge about CS education is not to be found within this post. Yes, I do have some formal training in education; yes, other people have a lot more. Yes, there are a lot of things I didn’t bring up. Yes, the situation with bootcamps is more complicated than a simple referendum on project-based learning. The simple “turbocharging training” model of learning I gave is not a theory of everything. Yes, you need to run into problems in context, find motivation, try things out of order, and even eventually do a full project on a team without guidance. I believe realistic projects do have a place in education, but they still must be coupled with the principles of rapid feedback, and they are a poor substitute for learning the concepts piece-by-piece.


Thanks to Elliott Jin for comments on earlier drafts of this post.

1 When I was searching for a personal trainer, I asked about this to help screen candidates.

Wednesday, February 14, 2018

My favorite principle for code quality

Programming seems to become more about memorization every day, with advocates pushing for memorizing lists of design patterns and refactorings and the difference between “parameter coupling” and “invocation coupling.”

Forget all about that. There’s much more to gain by having general principles which apply to many situations. In this post, I’m going to show you one of my favorite software design principles in action, the Embedded Design principle. The Embedded Design principle, which I briefly introduced in a previous post, states that you should always code in a way that makes the design apparent. It’s like pixie dust in its ability to make code more readable, loosely coupled, and impervious to bugs. It’s also hard to grasp, because programmers rarely see a program’s design in first-class form. So, in this post, I’ll show you an example.

Conveniently, a principle can be used as a “machine” for generating examples that violate it. In software design, these examples always end up being code that looks reasonable to the untrained eye, but will spell doom and gloom — often in the form of hilarious bugs — for any team that tries to build on it. Last week, one of my coaching clients asked me how this principle applies to caching. I turned on my machine and it produced this:

The Problem

Meet Bob. Bob is an engineer at He’s only been there a few months, but he has a project that will affect everyone: he’s going to create the website’s stats page! Now everyone can see just how engaged AwesomeSauce’s users are.

Bob starts coding. There are three data items, and each one will print either a computed or cached value. Bob ended up using one chunk of code for each data item.

public void displayStats() {
  if (lastCachedTime <= lastMidnight()) {
    numUsers = countUsers();
    lastCachedTime =;
    print(Total Users:  + numUsers);
  } else {
    print(Total Users:  + numUsers);

  if (lastCachedTime <= lastMidnight()) {
    numArticles = countArticles();
    lastCachedTime =;
    print("Articles written: " + numArticles);
  } else {
    print("Articles written: " + numArticles);

  if (lastCachedTime <= lastMidnight()) {
    numWords = countWords();
    lastCachedTime =;
    print("Words written: " + numWords);
  } else {
    print("Words written: " + numWords);

So simple! As the stats page gets bigger, it will be easy to add more by repeating the pattern. Bob sends it to Charlie for review.

Now, one thing about Charlie is that he’s not me. You see, I’d probably flip out and tell Bob about all the bad things that will happen because of this code, and give him a lecture on the Embedded Design Principle and a link to this blog post (which conveniently already has this code). Charlie, however, just takes a sip of coconut water, and says “There’s a bug over there because you re-used lastCachedTime. Otherwise, looks good. Ship it!”

- if (lastCachedTime <= lastMidnight()) {
+ if (lastCachedTimeUsers <= lastMidnight()) {
     numUsers = countUsers();
-    lastCachedTime =;
+    lastCachedTimeUsers =;

< similar for articles and words >

It’s 6 months later. has been growing faster than the national debt. AwesomeSauce needs more frequent updates to the stats page to show this. Team lead Denise decides to double the refresh rate, and Charlie does it.

- if (lastCachedTimeUsers <= lastMidnight()) {
+ if (lastCachedTimeUsers <= lastMidnight() || lastCachedTimeUsers <= lastNoon()) {
    numUsers = countUsers();

Bob is displeased. They had previously decided all stats should refresh simultaneously; see how the mockup just has one “Last updated” time. “Oops,” says Charlie, and together they merge the if-statements to prevent a future mistake.

public void displayDashboard() {
  if (lastCachedTime <= lastMidnight() || lastCachedTime <= lastNoon()) {
    numUsers = countUsers();
    numArticles = countArticles();
    numWords = countWords();
    lastCachedTime =;

Another year passes. AwesomeSauce has been taking off like Bitcoin. Their stats page is now an information feast, with dozens of items. Time to start simplifying! Surveys show that no-one is actually wondering about the count of words, so they take it down:

- print(numWords);

AwesomeSauce has hit a jackpot! Writers are making 10,000 posts per day. One day, an AwesomeSauce engineer notices they have a big slowdown twice a day, at noon and midnight. After a full day of sleuthing, he discovers that it has something to do with how, at cache refresh time, the stats page is taking a full three minutes to load. With horror, he discovers the call to countWords(), still living, though its results are unused.

This was not merely an instance of “mistakes happen.” This was a heavy price paid from badly-designed code. Everything from the first bug to the site slowdown could have been prevented by structuring it better, and the way to structure it better is a simple application of the Embedded Design Principle.

Nipping it in the Bud

Standard disclaimer: When reading software design advice, always imagine the examples given are 10x longer. Overengineering is bad. But, if you’re not sure whether applying a technique to your code would be overengineering, error on the side of doing it. Abstract early.

In the first version of the code, the computation, caching, and display of each statistic were all independent knobs, 3*N in total, where N is the number of stats to display. Merging the if-statements combined the N caching behaviors into one knob, bringing the total down to 2*N+1 knobs. Nonetheless, this still made it possible to compute a stat without displaying it. This was the rope that hung AwesomeSauce with the twice-daily slowdown.

This example was extreme, but ones like it are real. A static analysis researcher I know once visited eBay. She was shocked to discover that the checker that excited the engineers the most was one of the most shallow: detecting unused variables. Each unused computation eliminated was a database call saved, and a nibble taken out of server costs.

Conceptually, the set of stats and their caching behavior should be only N+1 knobs: each stat as a whole can be turned on or off, and the caching behavior can be changed. Achieving this will be a non-local change, and won’t be obvious by just thinking about what the code does, thinking at Level 2. Instead, we have to look at the design:

This diagram gives a rough picture of concepts used in the design of the program, showing how the concept of a dashboard stat is instantiated multiple times, and how the concept of each stat is built into the code. This diagram is far from complete; it omits, e.g.: the choice of the English language. But, with it, we can see several things.

The same caching behavior is implemented multiple times, which made the bug in Bob’s first version possible.

The code is indistinguishable from an alternate design where each stat has its own caching policy. This is another example of why design recovery is in general impossible.

And finally, each dashboard stat gets expressed in multiple independent lines of code, each its own “knob.” Hence, the slowdown that plagued AwesomeSauce.

All can be fixed — and the code rendered beautiful — if we take all those many arrows emanating from those boxes on the right, and reduce them down to one per box, one line of code per low-level concept.

This is the true version of the “Don’t Repeat Yourself” (DRY) principle.


In Zen and the Art of Motorcycle Maintenance, Robert Piersig illustrates these higher-level concepts, which he calls forms, in the world of mechanics. A motorcycle is a power assembly and a running assembly; the power assembly can be divided into the engine and the power-delivery assembly; and so on. A mechanic that’s testing the spark plugs or adjusting a tappet is really working on these higher-level forms. And yet you cannot walk into a motorcycle parts shop and ask for a power assembly: these concepts only exist in the mechanic’s head.

But we’re programmers, and we have tools that translate our thoughts into CPU instructions. To a certain extent, we can actually build these forms directly into our code. Let’s create a value representing a DashboardStat.

public interface DashboardStatComputation {
  public Object compute(); // in real code, use a more specific type

public class DashboardStat {
  private DashboardStatComputation computation;
  private String textLabel;
  // constructors and getters

In the design, we already had a concept of a dashboard stat: it’s a computation and a label. Now that concept is in the code too. We can go one level further, and also put the collection of dashboard stats into the code too:

public class Dashboard {
  private List<DashboardStat> stats;
  private Time lastComputedTime;
  private Map<DashboardStat, Object> curValues;
  // constructors and getters
  /** Recomputes all statistics if cache is stale */
  public Map<DashboardStat, Object> getCurrentValues() {  }

With these higher-level concepts turned into code, Bob’s original dashboard page can now be written declaratively. This code is just too simple to fail.

// Initialization (slightly simplified from actual Java)
dash = new Dashboard(list( new DashboardStat(countUsers,  Total Users)
                         , new DashboardStat(countArticles, Articles written)
                         , new DashboardStat(countWords,  Words written));

// Use
public void displayDashboard() {


More (or less) Embedded Design

You will never turn your program into a pure expression of the design, at least not unless you’re a descendant of Zohar Manna. You are always making tradeoffs on how far to go down this spectrum. Here are some other choices for applying the Embedded Design principle to this example:

  • If this example refactoring was too much, there’s a simpler compromise. Don’t have a DashboardStat value; just have a DashboardData which hardcodes each of the computations. This is much less overhead than the way I did it above, but it still dictates that all stats should be computed in an all-or-nothing fashion. 
  • Extract the check for whether to recompute into a shouldUpdate method. Now your decisions about cache timing are reified in a separate chunk of code. To go one step further, the caching policy could be parameterized. 
  • Might you one day want to serve the website in multiple languages? Wrap each string in a call to an internationalization function. For now, this can just be the identity function, but doing so makes the choice of language explicit in your code, and marks all the text in your code which corresponds to a “translatable string” concept. (Another example of how, at the design level, two different identity functions are not identical.) 

Wrapping Up

Many programmers will get a sense that something was off about the first example. Every programmer I’ve shown this example to knew they should merge the if-statements and factor out the prints. But the ones who knew they should refactor out a “dashboard stat” value only did so by flash of insight, and couldn’t articulate how they came up with it. But by instead thinking about how the code is derived from the design, this refactoring becomes just a straightforward application of the Embedded Design Principle. Finding a good structure for the code was easy once we saw the structure of the design.

So always think beyond the code. Ask yourself about the concepts of your program, and the values that define them. Like Piersig, this is what turns you from the mechanic turning screws into the engineer and artist building something to be proud of.

But what if you didn’t have a clear picture of how this code was derived, and you want to just be able to look at the code and realize there’s a deeper concept that should be extracted? There is a way to do that, and, for that, you’ll have to wait for my next post.

Like this post? You may be interested in my Advanced Software Design Web Course.


Thanks to Alex Reece, Elliott Jin, and Mimee Xu for comments on earlier drafts of this post.

Monday, January 22, 2018

The Benjamin Franklin Method of Reading Programming Books

Let’s face it, programming books suck. Those general books on distributed systems or data science or whatever can be tomes for a lifetime, but, with few exceptions, there’s something about the books on how to write code in a language/framework/database/cupcake-maker, the ones with the animal covers and the cutesy sample apps, they just tend to be so forgettable, so trite, so….uneducational.

I think I’ve figured out why I don’t like them, and it’s not just that they teach skills rapidly approaching expiration. It’s their pedagogical approach. The teaching algorithm seems to be: write these programs where we’ve told you everything to do, and you’ll come out knowing this language/framework/database/cupcake-maker. Central to these books are the long code listings for the reader to reproduce. Here’s an example, from one of the better books in this category

class User < ApplicationRecord
  attr_accessor :remember_token
  before_save { = email.downcase }
  validates :name,  presence: true, length: { maximum: 50 }
  VALID_EMAIL_REGEX = /\A[\w+\-.]+@[a-z\d\-.]+\.[a-z]+\z/i
  validates :email, presence: true, length: { maximum: 255 },
                    format: { with: VALID_EMAIL_REGEX },
                    uniqueness: { case_sensitive: false }
  validates :password, presence: true, length: { minimum: 6 }

  # …another 30 lines follows...

Traditionally, there are two ways to study a page like this:

  1. Type out every line of code 
  2. Copy+paste the code from their website, maybe play around and make small changes 

Approach #1 is a method that, like a lecture, causes the code to go from the author’s page to the reader’s screen without passing through the heads of either. The second is like trying to learn how to make a car by taking apart a seatbelt and stereo: you’re just toying with small pieces. Neither is a sound way to learn.

If you had an expert tutor, they wouldn’t teach you by handing you a page of code. Still, these books are what we have. How can we read them in a way that follows the principles of learning? Read on.

Mental Representations

According to K. Anders Ericsson in his book Peak, expertise is a process of building mental representations. We can see this because expert minds store knowledge in a compressed fashion. Musicians can memorize a page of music far faster than a page of random notes. Expert chess players told to memorize a board position will do much better than amateurs, but, when they make a mistake, they’ll misplace whole groups of pieces.

This is possible because music and chess positions have structure that makes them look very different from a page of random notes or a random permutation of pieces. Technically speaking, they have lower perplexity than random noise. So, even though there are 26 letters in the English alphabet, Claude Shannon showed that the information content of English is about 1 bit per letter: given a random prefix of a paragraph, people can guess the next letter about half the time.

This is why a programmer skilled in a technology can look at code using it and read through it like fiction, only pausing at the surprising bits, while the novice is laboring line-by-line. This is also why a smart code-completion tool can guess a long sequence of code from the first couple lines. With a better mental representation, understanding code is simply less work.

(How do these mental representations work? My officemate Zenna Tavares argues they are distribution-sensitive data structures.)

This is exactly what’s missing from the “just type out the code” approach: there’s nothing forcing your mind to represent the program as anything better than a sequence of characters. Yet being able to force your mind to do this would mean being able to learn concepts more rapidly. Here’s a 200 year-old idea for doing so.

The Benjamin Franklin Method

I don’t know what’s more impressive: that Benjamin Franklin became a luminary in everything from politics to physics, or that he did this without modern educational techniques such as schools, teachers, or StackOverflow. As part of this, he discovered a powerful method of self-study. I’ll let him speak for himself (or go read someone else’s summary).

About this time I met with an odd volume of the Spectator. It was the third. I had never before seen any of them. I bought it, read it over and over, and was much delighted with it. I thought the writing excellent, and wished, if possible, to imitate it. With this view I took some of the papers, and, making short hints of the sentiment in each sentence, laid them by a few days, and then, without looking at the book, try'd to compleat the papers again, by expressing each hinted sentiment at length, and as fully as it had been expressed before, in any suitable words that should come to hand. Then I compared my Spectator with the original, discovered some of my faults, and corrected them.

—Benjamin Franklin, Autobiography

This process is a little bit like being a human autoencoder. An autoencoder is a neural network that tries to produce output the same as its input, but passing through an intermediate layer which is too small to fully represent the data. In doing so, it’s forced to learn a more compact representation. Here, the neural net in question is that den of dendrons in your head.

K. Anders Ericsson likens it to how artists practice by trying to imitate some famous work. Mathematicians are taught to attempt to prove most theorems themselves when reading a book or paper --- even if they can’t, they’ll have an easier time compressing the proof to its basic insight. I used this process to get a better eye for graphical design; it was like LASIK.

But the basic version idea applied to programming books is particularly simple yet effective.

Here’s how it works:

Read your programming book as normal. When you get to a code sample, read it over

Then close the book.

Then try to type it up.

Simple, right? But try it and watch as you’re forced to learn some of the structure of the code.

It’s a lot like the way you may have already been doing it, just with more learning.


Thanks to Andrew Sheng and Billy Moses for comments on previous drafts of this post.

Monday, January 15, 2018

The Design of Software is A Thing Apart

Big up-front planning phases are out. Rapid iteration is in. With all this movement towards agile, it’s increasingly tempting to throw out the idea of having a separate design doc for software in favor of just getting started and having self-documenting code.

And that is a fatal mistake.

Last week, I explained that we speak about program behavior at one of three levels: executions, code, and specification. Most knowledge about software design is contained in Level 3, the specification. The flipside: the information of a program’s design is largely not present in its code1. I don’t just mean that you have to really read the code carefully to understand the design; I mean that there are many designs that correspond to the exact same code, and so recovering all the design information is actually impossible.

In that post, I talked a lot about the example of describing malloc and free, and how it’s quite possible to have a semantic-level notion of “freeing memory” that corresponds to absolutely nothing in the implementation. In this post, I give three more examples. The first discusses the problem of testing too much, and how the decision of what to test is in general unrecoverable from the code. The second is a question of coupling which cannot be answered from the code alone. The third shows how an identical comparison function may have different meanings in different contexts, with different expectations about how the code may evolve. None of these examples will cause your code to break, at least not today. What they will cause are the characteristic problems of bad design: fragile tests, and the risk of extra work and bugs tomorrow. The first two examples are based on real code I’ve encountered (or written!), which caused exactly those problems.

Let’s begin!

1) What should you test?

“Write the tests first,” say advocates of test-driven development, “and the code will follow.” Doing so forces you to think about what the code should do before writing it. What it doesn’t do is get you to separate the code’s goals from its implementation details.

So you update the code, a test fails, and you think “Oh. One of the details changed.” Should you update the test to compensate? If you tried to use TDD as a substitute for thinking about the design, then you probably will. Congratulations, you now have a fragile test. Writing tests in this fashion is mistake #7 in my booklet “7 Mistakes that Cause Fragile Code”.

Here’s some code and its test. Is this a good test?

Code (pseudo-Java):
public void postProfileUpdateToServer(User u, Profile p) {
  HttpParams params = ;
  addProfileToParams(params, p);
  HttpResult res =, params);
  logger.logUpdateEvent(u, p); // calls appState.saveCurrentState()
  if (res.getResponseCode() != Http.HTTP_SUCCESS)
    scheduleRetry(u, this.updateBlock);
Test checkPostProfileUpdateToServerAssumeSuccess (pseudocode):
  Call postProfileUpdateToServer with user Bob,
       and his profile with address changed to 123 Mulberry St
  Check that was called once
  Check that its first argument was
               Check that the first parameter is the users E-mail address,
                  the second their mailing address,
                  and the third their preferred language
  Check that appState.saveCurrentState was called twice

( How do you check that was called? This is quite easy to do using a mocking framework such as Java’s Mockito. I quite recommend it, as a way to unit-test functionality purely in terms of how it’s defined using other operations. )

There are generally two ways to write bad tests. One is to miss cases and not be rigorous enough. This is what happens when developers chase code coverage without actually thinking about what they’re testing. A common blind spot is that a programmer will test what happens when a program goes right, but not test what happens when a server’s down, when it’s fed bad input, etc.

The other is to test too much and be too specific. Is it part of the design that postProfileUpdateToServer must call appState.saveCurrentState twice, and it chooses to do so once by calling logger.logUpdateEvent? I don’t know. The test assumes a certain order about the HTTP parameters, but does the Profile object guarantee it? If the Profile uses a hash table internally, then the order is arbitrary, and may change when upgrading the libraries. Then the test will fail when upgrading to a new version of Java. (This is why Go randomizes hash table iteration order; one of my colleagues built a tool that does the same for Java.) At the extreme, I’ve seen a very experienced programmer write tests like assert(MSG_CODE_UPLOAD_COMPLETE == 12), which, given that these constants are meant to encapsulate knowledge, kinda defeats the point of having them in the first place.

The code should not follow the tests, nor the tests the code. They should both follow the design.

2) Do these pieces of code know about each other?

One nice thing about understanding the design/specification level of code is that it gives us very clean definitions of formerly-fuzzy concepts like “this code knows about that code.” For now, stick with your intuition as you answer this question:

Here are three modules. Module 1 defines a “student” record. Module 2 defines an “employee” record, and creates them by turning students into student employees. Finally, module 3 uses employee records, namely by printing them.

Oh yeah, and this example is in pseudo-JavaScript, so there’s no notion of “type” other than “the set of stuff this function can return.” Here’s the code:

Module 1 :

function makeStudent(firstName, lastName) {
  return {firstName: firstName, lastName: lastName};

Module 2:

function makeStudentEmployee(student) {
  var e =  student.copy();
  e["employeeId"] = nextEmployeeId();
  return e;

Module 3:

function printEmployeeFirstName(employee) {

So, Module 3 accesses the firstName field of the employee value, which is defined in Module 1. Now, the question is: does Module 3 know about Module 1. That is, does the description of what Module 3 does need to be written in terms of the description of what Module 1 does?

The answer is: it depends. If the documentation of Module 2 is “Returns a student employee record with firstName, lastName, and employeeId fields,” then the answer is a clear No; it’s just an implementation detail that it obtains a student-employee record by appending to a student record. If the documentation is “Returns a student record with an additional employeeId” field, then it really is the case that you can’t know what’s in the argument to printEmployeeFirstName without reading Module 1, and so modules 1 and 3 are very much coupled.

3) What does this code accomplish?

Quick! What does this line of code accomplish?

  return x >= 65;

Two possible answers:

  1. x is an ASCII character which is already known to be alphanumeric; this code checks if it’s a letter. 
  2. x is an age of a person; the code checks if the person is past the retirement age in the US. 

In both possible answers, the code is the same. I can probably find examples where the surrounding context is the same too. If this is just a standalone function lying in a Utils file somewhere, it makes absolutely no difference to the program which I use it for; I could even use it for both. But it makes a massive difference to the evolution of the program.

The former may need to be changed for internationalization, while the latter may need to be changed following changes in US laws. The former should live alongside other functionality related to character sets; the latter alongside code relating to people or company policies. And in the former case, since there’s a bunch of non-alphanumeric ASCII characters between “9” and “A”, some crazy-optimizing programmer could change it to “return x >= 64” and then optimize it to use bit-ops like “return (x >> 6)” with no effect on the program. In the latter, that might lead to a lot of erroneously-sent retirement benefits.

Meanwhile, if you were trying to prove the correctness of a program that used this function, the proof environment would not be convinced that Bob is eligible for retirement benefits just because that function returned “true” until you tell it exactly what this function is meant to do.

If you just think about the code, then the letter ‘A’ is the same as the retirement age. So focus on the design.

Credit to Ira Baxter for this example.

Writing Better Code

I gave you three examples of situations where it’s impossible to recover design information from the code alone. For the tests, there’s no solution other than to have the design written somewhere separately, so you can know whether the order of HTTP parameters is stable, or whether to rely on the number of times saveState is called. For the code examples, it’s actually quite possible to change the code so that it only corresponds to one design. For that unclear case of coupling, if you want to make it clear that modules 1 and 3 are separate, have module 2 create records with explicit fields. For that retirement age check, just write it as “return x >= RETIREMENT_AGE”. Even if RETIREMENT_AGE and ASCII_A (usually spelled just 'A') have the exact same runtime values (for now), they represent different concepts. They are not interchangeable in the design, nor in a formal proof.

This gives rise to my first rule of writing good code:

Jimmy Koppel’s rule of good code #1: Make the design apparent in the code (Embedded Design Principle).

Sounds simple, but it’s quite vague and subtle until you’ve really internalized how code is derived from a design. I’ve given multiple workshops exploring different aspects of this idea, especially around coupling.

Some things are better left unsaid

In statistics, we used to think that, if you had enough data, you could learn anything. Then the causal inference guys came along and showed us that, actually, even with infinite data, there can be many processes that generate the same observations. If you just collect data about what days there was rain in town and what days your lawn was muddy, you can’t tell the difference between a universe in which rain causes mud and one where mud causes rain. You need to know something more.

Software design is the same way: there can be many designs that correspond to the exact same code. Those who speak of “self-documenting code” are missing something big: the purpose of documentation is not just to describe how the system works today, but also how it will work in the future and across many versions. And so it’s equally important what’s not documented.


Thanks to Alan Pierce, Alex Reece, and Tej Chajed for comments on earlier drafts of this post.

1 Note that I am considering text in a source file, namely comments and variable names, as part of the documentation, not part of the code. A long comment explaining the structure and guarantees of your program is absolutely a form of specification.

Sunday, January 7, 2018

The Three Levels of Software: Why code that never goes wrong can still be wrong

Have you ever stopped to consider what it means for a program to be wrong? I mean, really stopped to consider? Like, “it’s wrong if it crashes — but what if the crash conditions are only hypothetically achievable — but wait…”. Let’s get to the bottom of this. Here’s a first try:

Definition #1
A program is wrong if it runs and produces a wrong result.
(Where “wrong result” is broadly defined to include performance bugs, usability bugs, etc.)

So I put my bread in the toaster, and, instead of giving me a piece of toast, it gives me a rhinoceros. Not what I wanted. Definitely a bug (err, mammal).

Programs can still usually produce the correct result but be wrong. So I get a new toaster, and I’ve never had a problem with it. Then I find out that, if I put in a thin slice of Bavarian rye when there’s a full moon out, and I’ve used the toaster exactly twice in the last 24 hours, then it will fry its circuits. Sounds like a bug. So we amend it into a 2nd definition.

Definition #2
A program is wrong if there exists some environment, sequence of events, or other “input” under which it produces a wrong result.
Now let’s eliminate all that. I’ve patched the software on my toaster, and I’ve both proven the code correct, and had a trillion monkeys test every conceivable situation. There’s now nothing you can do short of taking a sledgehammer that will make the toaster malfunction.

It’s still possible for the code to be wrong.

How? Simple. Let’s say somewhere in that toaster, it sums all the elements of an array — but it accidentally goes over by one. Memory error? Well, it’s written in C, so it just adds on whatever is in the next address in memory. That address is a byte which is always 0 for whatever reason, so it gets the right answer every time.

The code is still wrong. It may never fail, but the reasoning for why the code should work has a hole: what should have been a simple argument of reading an array now depends on complex assumptions about the compiler and memory layout, as well as a whole-program check that that byte is always 0. And that hole can cause an actual failure in future versions of the program: someone rearranges the fields in a structure, and code that should never have been affected starts failing.

We reach our third definition:

Definition #3
A program is wrong if the reasoning for why it should be correct is flawed.

Progress! We’ve gone from a clear and simple definition to one that’s handwavy and impossible to use. Actually, it’s quite rigorous, but I’ll have to teach you some formal verification to make it more concrete.

All three definitions are correct to use at different times. This forms one of my core teachings: that, when we talk about programs, we speak at one of three levels. These three levels are:

  • Level 1: Runtime. The runtime level deals with specific values and a specific environment from a single execution of the program. A lot of debugging is done at Level 1.
  • Level 2: Code. At the level of code, we think about what the current implementation could do when given arbitrary inputs and an arbitrary environment. Behaviors that cannot happen are not considered, even if it requires global reasoning to rule them out. A lot of implementation work is done at Level 2.
  • Level 3: Design/Logic: At the level of logic, we consider the abstract specification of each unit of a program. When using other units, we only consider the guarantees made by the spec, and assume they may be replaced at any time with a different implementation. Many programs which are correct when viewed at Level 2 are not correct when viewed at Level 3, because they rely on behavior which is not guaranteed to hold in all future versions. Most software design is done at Level 3.

I’ve met programmers who confuse Level 1 and Level 2, but not many. Levels 1 and 2 are concrete enough: you see Level 1 by running under a debugger and inspecting the stack, while you see Level 2 just by reading the code and thinking about what could happen. But most programmers have a much harder time getting that there even is a Level 3, and I’ve seen much confusion come from when one programmer is talking about what a component is guaranteed to do, and the other is talking about what it happens to do. Most programmers will be taught the difference between interface and implementation at a high-level. But very few get to see the full details of what defines an “interface” beyond just “list of functions.” That can only be seen when doing formal verification, where a program’s properties and assumptions are written just as concretely as its source code. In everyday development, all that structure and reasoning is still there; it’s just scattered across documentation, comments, and programmers’ heads.

Statements at different levels do not mix. So if the proposed client/server protocol says the client should send a request twice and discards the first result (a Level 3/design-level statement), and the designer tells you it’s because there are three different kinds of request handlers in the codebase, and Bob’s sometimes gets it wrong the first time (a Level 2/implementation-level statement), you should get confused. You should be as confused as if someone wanted to call a file or write to a function.

Isn’t our goal to deliver working software to customers, and so Level 2 correctness is all that’s important? No, our goal is to be able to continue to deliver working software far into the future. Level 3 is all about how modular are the interactions between the components of your software, and that’s important if you care at all about having different versions of those components, like, say, if you wanted to rewrite one of them tomorrow. Joel Spoelsky relates how the original Sim City had a use-after-free error. At Level 2, this was totally fine, since freed memory in DOS was valid until the next malloc, and so the program worked. At Level 3, this was a defect, because the spec for free says you need to act as if memory gets eaten by a dragon as soon as you free it, and any future free implementation may actually eat it. Sure enough, once Windows 3.1 rolled around with a new memory manager, SimCity would start crashing. Microsoft had to add a special case to check if SimCity was running and switch to a legacy memory manager if so.

People sometimes tell me about how software is so easy and you can just have an idea and make it and it’s so cheap because there’s nothing physical to build. Hogwash. Software is a domain where we make decisions that can’t be undone and have to be supported for all eternity. The HTTP “referer” is forever misspelled, and that SimCity special-casing code is still there 30 years later.

This is why it’s important to get programs right at Level 3, even if it passes all your testing and meets external requirements.

And, lesson for API designers: This is why it’s important to make your APIs conform as strictly to the spec as possible, at least in Debug mode. This all could have been avoided if DOS’s free were to deliberately zero-out the memory.

I now have two years experience teaching engineers a better understanding of how to avoid complexity, improve encapsulation, and make code future-proof. But ultimately, that knowledge all stems from this single master insight: the most important parts of our craft deal not with code, but with the logic underneath. The logical layer may be hidden, but it’s not mystic: we have over 40 years of theory behind it; see the postscript below for a taste of the details. When you learn to see the reasoning behind your system as plainly as the code, then you have achieved software enlightenment.

Next time: why this implies keeping design and implementation separate.

Postscript: Technical Details

I could speak a book on harmony and composition, but you’d learn more from listening to a song and seeing the sheet music. Similarly, I could ramble on about reasoning and assumptions, but to truly understand it, you need to see the objects under study. The three levels deal with different views of a program: executions, code, and specifications. Each corresponds to its own kind of reasoning.1 Let’s bring in the math!

Level 1: Traces and states

The objects of study of Level 1 are traces and states. A trace is a sequence of events that occurred in a program execution. A trace looks something like this:

Enter method postProfileUpdateToServer
Read field profile to local variable p
Enter method saveCurrentState
Leave method saveCurrentState

Traces can be very high level, like which microservices get run, or very low level, involving instruction scheduling on the CPU. They tell you exactly what happened, and give the information to construct the current state.

The state is a collection of cells with their current value. It looks like this:

{ p=Profile(name=Bob, id=42, ), 
  this=NetworkClient(socket=..., baseUrl=”),
  __messages_output=[Initializing MyApp version 5, Welcome, Bob!, ],

Most often, the state just consists of values in memory, though it can be helpful to also include pieces of the environment, or a list of what’s already been output or sent over the wire.

Any statement that can be phrased in terms of specific traces and states is a Level 1 statement. These constructs should be familiar to programmers: printouts let you view fragments of a trace, while a debugger lets you observe the current state.

The corresponding manner of automated reasoning is ground reasoning. Ground reasoning means reasoning only about concrete values with no quantifiers. So, suppose I have this code:

1:       left = x - 10;
2:       if (left < 0)
3:         left = 0;
4:       right = min(x+10, 100);
5:       print(right);

Say I know from the printout that right=50 in a given trace, and I want to know whether it’s possible that the branch on line 3 executed. I could answer this question by asking a solver to find a satisfying assignment for the following formula:

left<0  (left=x-10 
          right=(if x+10<=100 then x+10 else 100)

Level 2: Code

The object of study of Level 2 is the code. Yes, code, the stuff you work with every day. Any statement that can be phrased in terms of the code, but not a specific trace or state, is a Level 2 statement. Let’s look at the following example, which computes a damage bonus in a hypothetical game:

public double computeDamageBonus(int creature1AttackSkill,
                                 int creature1ArmorPiercing,
                                 int creature2DefenseSkill) {

  int adjDefense = creature2DefenseSkill - creature1ArmorPiercing;

  if (adjDefense <= 0) {
    return 10;

  double factor = (double)creature1AttackSkill / 
                     (creature2DefenseSkill - creature1ArmorPiercing);

  if (factor > 10)
    return 10;
    return factor;

By picking specific inputs to this function, it will yield a trace and a sequence of states. But this program encodes an exponentially large number of possible traces and states. At Level 1, we can ask “Did this execution experience a division-by-zero error?” At Level 2, we can ask “Is there any execution of this function that experiences a division-by-zero error?”

The corresponding manner of reasoning is first-order logic. This means we can write down formulas that say “for all inputs X, does this property hold?” or “does there exist an input Y, such that this function crashes?” What we can’t do is quantify over other functions. Here’s a first-order formula that states that the computeDamageBonus function can never have a division-by-zero error:

creature1AttackSkill, creature1ArmorPiercing, creature2DefenseSkill.
  (adjDefense = creature1ArmorPiercing - creature2DefenseSkill
  ¬(adjDefense  0))
     creature1ArmorPiercing - creature2DefenseSkill  0

Level 3: Specifications

The object of study of Level 3 is the specification. There are many, many ways of writing specifications, but a popular one is the Hoare triple: preconditions and postconditions. Here’s one for malloc:

Pre: n > 0
Post: retval  NULL  alloc(retval, n)

At Level 2, we could ask questions about a specific implementation of malloc, like “How much memory overhead does it use?” At Level 3, we can ask questions about all possible implementations of malloc, like “does this program have a memory error?”

The corresponding manner of reasoning is higher-order logic. Higher-order logic is like first-order logic, but we can now quantify over functions. First, let’s translate the spec above2:

MallocSpec(m) = n. n > 0  m(n)  NULL  alloc(m(n), n)

Now, if we hypothetically had a formal specification for the PlaySimCity function, a spec that it works for any implementation of malloc would look something like

m. MallocSpec(m)  SimCitySpec(m, PlaySimCity)

And that, my friends, is modularity reduced to a formula.

A really big thing about specifications is that they can involve properties which do not appear in the code at all. We defined malloc in terms of this alloc predicate, which has no intrinsic meaning other than “stuff returned by malloc.” But what this does is let us relate malloc to other operations. We can give the act of dereferencing pointer x  a precondition ∃a,n. a≤x<a+n ∧ alloc(a,n), and give the free(x) function a spec that destroys the predicate alloc(x,n). Now, having a proof of alloc(a, n) when reasoning about the program means “this memory was returned by malloc, with no intervening free” — exactly our intuitive notion of memory being allocated!

What we’ve shown here is that “this memory is allocated” is a specification-level notion which is independent of the code. And indeed, there may be nothing within the program to indicate that a piece of memory has been allocated.  It’s possible that being allocated corresponds to some internal data structure of the memory manager, but it’s also possible that your code will be compiled for a machine with infinite memory.

Now I can state exactly in what sense the SimCity code was wrong: A precondition of dereferencing a pointer is that the pointer is to allocated memory. They tried to dereference a pointer without meeting that precondition, i.e.: no proof of alloc(a,n). So, their code ran fine for one implementation of malloc, but, as they sadly learned, not for all of them.

When designing software, I always recommend trying to think in pure concepts, and then translate that into the programming language, in the same way that database designers write ER diagrams before translating them into tables. So whether you’re discussing coupling or security, always think of the components of your software in terms of the interface, its assumptions and guarantees, and mentally translate them into formulas like the ones above. So much becomes clearer when you do, for logic is the language of software design.


Thanks to Elliott Jin and Jonathan Paulson for comments on drafts of this post.

1 Don’t get too hung up on the analogies between the levels of software and the modes of reasoning; there are plenty of exceptions. For instance, much of the progress in program analysis/verification/synthesis research comes from finding all sorts of tricks to encode more complicated problems into a form that can be solved by ground reasoning, since we have good solvers. This likely isn’t going to be too relevant to you unless you work in programming tools.
2 I am lying slightly in this example — the kind of logical formula I gave is really just meant for pure functions. To handle malloc properly, we’d want to use separation logic.