Sunday, December 9, 2018

My Strange Loop talk: You are a Program Synthesizer (video + transcript)

In my day job, I work with programs that write, analyze, and transform other programs. You can't do this unless you have some special insight into how programs work. Much of my night job is finding these insights and making them accessible to non-specialist programmers. A few months ago, I spent a week in St. Louis, where I gave my first industry conference talk, at Strange Loop on this topic, as well as my first academic conference talk at ICFP: Capturing the Future by Replaying the Past.

I missed most of both conferences to finish preparing this talk, but it was worth it: this is the talk I've wanted to give for two years.


Video: You are a Program Synthesizer



Transcript

This transcript was quite something to prepare! Including a few omitted slides, the slide deck for this talk totals over 100 slides, and over 350 transitions. For the most part, I present a slide before the accompanying speech, and show only a few key frames from each animation.



Hello, my name is Jimmy Koppel, and I live a dual life. By day, I work at MIT. I'm a researcher. I work on program synthesis, programs that write programs. By nights, I lead workshops, teaching software engineers like yourselves how to write better code.

Today, I'm going to talk about something from intersection of both worlds.


Software engineering advice today is confusing. For anything you read online, you'll find someone else arguing the opposite. Today I want to ask the question, how can we find a source of truth in software engineering advice?

This whole situation reminds me of another source for truth: how do you prevent scurvy?


Scurvy is an ancient disease caused by Vitamin C deficiency. After thousands of years of deaths, finally the 1700s, a British scientist named James Lind proved in an experiment the cure: eat citrus. And so the by the end of the century, the Royal Navy had started requiring all of their sailors to drink lemon juice. Scurvy rates went from crippling to almost none.

But then in the 19th century, they lost the cure. They didn't understand it well enough. They thought, "Lemons? Why not limes?" Big mistake. As a result, there were multiple voyages to the poles as late as 1910 in which everyone got scurvy and people died.

[source: http://idlewords.com/2010/03/scott_and_scurvy.htm]

Now imagine that you are an 18th century sailor and you ask the powers that be, "How can I prevent scurvy?" The answer comes, "Eat citrus…..but wait! Lemons, not limes. And wait — do not boil it. And wait — do not store the juice in copper. If you store it in copper, it's worthless." This is kind of starting to sound like witch doctory.
From this, I coin the term citrus advice. The advice to eat citrus is good advice. It saved people’s lives. But it was not precise enough to capture the reality, and so it came with this long list of caveats that made it hard to use.

The precise advice is to get vitamin C. And while it took about 200 years to get from citrus to vitamin C, this is the ideal to be aiming for: the simple thing that explains all of the caveats.
Now compare software engineering advice.

Guru: Discover abstractions to make your code shorter. [Paul Graham]
Student: Okay, I abstracted my code.
Guru: No, that is the wrong abstraction. Duplication is better than the wrong abstraction. [Sandi Metz]
Student: Well, then how do I know if an abstraction is good design?
Guru: Well, you see, good design leads to simplicity. [Max Kanat-Alexander: ]
Student: Okay, I simplified my API.
Guru: No, you broke your API, and if you break your API, that creates complexity for other programmers. [Max Kanat-Alexander]

So, this advice: not so easy. It's citrus advice.

I think that if you were to take these three people — these are all close to but not actual quotes by these people — if you were to show them the same piece of code and ask the same question, you would get similar advice. This goes to show you that they have some deeper intuition about some deeper reality, that these words are not precise enough to capture.

So how do we get to more precise engineering advice?
Many years ago, I got really into program synthesis. Program synthesis is amazing. It's this program where you just give it stuff, you tell it what you want to do: on this input it should do this, maybe give it a little bit of information about what the program should look like, and it thinks for a while with its algorithm, and it gives you a correct program. It's cool, but a really nice thing is: you cannot BS a synthesizer. You cannot BS a synthesizer, and you cannot BS a synthesizer.
If you ask the programmer, "Hey Bob, if I do this thing, will it make it easier for us to program later on," it can be a pretty vague question, hard to answer. But if you ask, "If I do this this way, will it make it easier to synthesize the algorithms on paper," you can just look at it and get an answer. And if you want to know, if I have this data format, of the millions and millions of possible pieces of code that I can write, is there anything that's clean? You can just try it, give it a million examples, get an answer: no.

So we're going to look at the parallels between synthesis and human programming. We're going to dive into the different schools of synthesis. Broadly speaking, there are three kinds. There are inductive synthesizers — theorem proving, logic. There’s enumerative, searching very cleverly through large spaces of programs. This is how your database optimizes your SQL queries, by the way. And there’s constraint-based, using constraint solvers, In the interest of time, we're mostly going to talk about the deductive today, a little bit about constraint based, and no enumerative.
But there is a very important fourth school of synthesis. In fact we have a few hundred examples of them in this room. That is: you.
You are a synthesizer. You write programs. And while you might work a little bit differently than these other algorithms — you do not have a constraint solver in your head, although I do think you do work a bit like the deductive synthesizer and do that kind of deductive reasoning — still, there's only one universe of possible programs, only one universe of possible algorithms for writing programs. And so by looking at these other schools of synthesis, we can learn insights that we can take into our own programming. And that is the thesis of today's talk.
And throughout our journey to synthesis land, we will see three common recurring themes of abstractions, constraints and logic.
Enter deductive synthesis. At a high level, deductive synthesis is about taking a high level idea and through a series of mechanical transformations, making it more and more refined and precise until you reach an implementation. So what is the information a deductive synthesizer works with? Let's look at this.
When we talk about software, we can broadly categorize what we're saying at one of three levels. Here's a max function. You can talk about the runtime values: what is it you want as inputs? Or, we can talk about all inputs and look at the code. So runtime’s level one; code level two. But this is not capturing the why of the code. It doesn't explain why it works. It does not explain how you came up with it. That information is in the logic, in the derivation or proof. And I call this “Level Three.”

As you might guess from the relative size of these boxes, the most information lies in level three. This is the why. This is the stuff that's only in your head normally. But to a synthesizer, this is just as concrete as the code it outputs at the end. So let's look at it from the synthesizer's perspective.
Some quick background: here's what's a deductive synthesizer looks like. It has some goal, like “I need to get an integer which is the maximum of two other integers,” and you write down a logical formula for what that means, specified in some way. And so our goal is we're given two integers and after running the program, then the condition “it's the maximum of these two inputs” will be true. Its tools: it has a bunch of rules that that relates a piece of code to the abstract logical meaning of the program. And now it can do a search.
So we want this maximum function. Okay, let's try running the if rule. We know the outer skeleton’s an if, and we have two subgoals to prove.
We go into the search; maybe we try the if-rule again. Okay, let's try another rule. Another rule. Okay. Maybe using if there was not such a good idea.
It's trying an assignment instead. Ah, that works. There's my assignment. Now I know a little bit more about the program, and it keeps going like this and eventually it gets the actual program.
So that was a 40 year-old synthesis algorithm in a nutshell. But you'll also see this idea in a lot of newer systems. If you were at Nadia's talk an hour ago, you saw her talk about a deductive synthesizer called Synquid. Nadia was a postdoc in my lab; the world of synthesis is very small.

Here are a couple of other systems: Î»2 and Fiat. Note that all these only do small functional programs because synthesis is very hard. This is why instead of actually using this today, I'll tell you to not use it. Instead learn the insights, and use them to improve the algorithms inside your head. By looking at things through this lens, deductive synthesis, we can come up with some deeper insights into code.
Let's look at specification level notions. I'm running a web app and I need sanitized strings; no XSS attacks. So, to sanitize a string, I escape all single quotes. And, now I have two choices in how to use this. Before, I saved my data to this. It must be sanitized and you can do it either by calling this sanitized function that I just wrote or by doing it by hand. Take a few seconds to look at these, decide whether you prefer Option 1 or Option 2.
So, when I've shown this code to people before, a plurality prefer option one, a slightly smaller number have no opinion, and a small minority prefer option two. In a minute, I'll give you a sense in which option one is a correct answer (under some interpretations).
Well how else might you might have come to this conclusion? Well, you might've used the folk engineering advice and it's like: this one’s more abstract, but the other is more direct, and this one’s harder to see what’s happening. It's more work to understand. It's easier to change. And when you're working at this level, it’s basically like two lawyers at the podium arguing their side, and both of them have case law to support them. It's hard to get a real right answer. Even if you do, it might not be good enough.
So here's an engineer that I used to train, let's call him Steven. And his head is full of citrus advice.

He looks at this and says, "Option one, because it centralizes the code." Good job, Steven. But his understanding of the information hiding was not deep enough because five minutes later, I gave him a similar example:
So this time, I have my Twitter box and I type in 139 characters and it says: too long. What?

Because: it escaped my string, and it double-counted every single quote, because every single quote became two characters. So let's not do that. Let's un-double-count the single quotes. (And I'm very glad that the Python standard library has this count function. It made this slide easier to write.) But I propose instead of doing that, we abstract this “counting pre-sanitized string” into a function. Steven looks at this and he says, "No. Option one is over engineering. This is premature abstraction. EVIL!"

In a moment, we shall see how there is a principle that picks Option 1 of both circumstances. And if you adhere to this principle, then not only is Option 2 dispreferable, it is wrong. And by wrong, I mean, compile-error level wrong.
This comes to the idea of encapsulating ideas. Information hiding is not about private methods and encapsulating classes. It's about encapsulating ideas — and what the F does that mean. So, this is one of those vague things you might've heard before, where once you've learned to think like a synthesizer and actually see the logic it's based on, it becomes very precise. And so, let's do it. And so, through this lens of deduction and deductive synthesis, we're going to see how Option 1 is preferable in: information hiding, how easy it is to think about, and how easy it is to write. And we’ll give precise meanings to all three of these.
Let's begin just by talking about what we mean by sanitized. Well, right now it means that every single quote is escaped. I can write this down formally in one of many notations. I'm going to use Coq, because I'm familiar with it, but you can pick any of these others. Maybe you learned one of these earlier today in another session. It looks like this: It says that, before every single quote, there's a backslash. And now other models can use this definition when they’re reasoning about code. But, there's something missing here.
That definition I gave of sanitized, that's just today's definition. Other models can use it, but what if tomorrow I decide I need to escape other things?

I've changed the meaning of being sanitized. Now all this other code, that in their reasoning used the fact that this was what “sanitize” means, those are now broken.
So what I want is to is to put all current and future meanings of sanitized behind some abstract predicate and say, "I'll give you this string. It is sanitized." What do you mean by sanitized? — it's sanitized and that's final. And now this sanitization module can change its mind about what it means by “sanitized” and no one will ever know. It's a secret. And that is an abstraction barrier. Do Not Cross.
(At this point, I unfurl my “Abstraction barrier” caution tape I’ve been passing out at the conference, and get an applause.)

Now, you might've seen a diagram like this before. “Oh, I centralized my functionality. I wrapped it in a function. You can't expose it.” This is not like that. This is purely an idea. This sanitized predicate does not correspond to anything in your code. It is just how you think about the code.
On this slide, you'll see exactly how. So, in Coq, you write this idea of an opaque predicate, an existential predicate with the opaque keyword. Again, you can do this in about any other formalism. This is the abstraction boundary, but written as code. And now other modules do not get to think about that detail.
All I know is, I got some sanitized thing — whatever that means; it's a black box. So this is the sense in which Option 2 is wrong. Because in order to look at that code and read it as a human and say, "I'm going to justify why this is doing what I want," something in your head has to use the information that sanitized strings have single quotes escaped and nothing else escaped. And so, what this is actually doing is, it's piercing the abstraction barrier (do not cross), and relying on this old piece of information that may change. Dangerous.

Whereas, in Option 1, all the information about what “sanitized” means is encapsulated into the sanitization module. So maybe the sanitized strlen function, which is right next to sanitized, is allowed to know what it means to be sanitized, but this user code is not. Thus, it is a secret preserved. It may be changed.
The same view, the sanitized predicates gives us the answer to the first question. So my sanitize function returns a sanitized string, whatever that means. And before I save, I have to feed it in a sanitized string, whatever that means. When I put use function together, they line up, but when I try to sanitize it by hand, BZZZZT.
It's like, “You gave me a string where you escape all single quotes, I need a sanitized string. I don't know that these are the same.” “What do you mean? That is a sanitized string.” Oh no, you can't use that information. It is opaque, hidden. It's behind the abstraction barrier (do not cross).
But, that's just an interpretation, because there’s another way of looking at the program that gives you the opposite conclusion. Suppose the sanitized function is not allowed to know what sanitize means. I don't know what it would return, but you can't show it returning a sanitized string — it doesn't know what that means. But this other code does know what sanitized means then, it works when you do it by hands, but not when you call this foreign function. Who knows what it does.
There is in the third option in which both of them are acceptable. This is the hippie version. There is no abstraction boundary. Everyone gets to know about sanitization — those dirty hippies. And in that case, in both modules, you're allowed to think about the definition of sanitized and see that these line up and do either.
Now when you're actually programming, you don't write down these formal logic things. You don't write preconditions and postconditions in gratuitous detail and everything. No, you have an idea in your head about what you're supposed to do, but it's not given to the compiler, and so for all three of these interpretations, all three of these worlds, whoever's allowed to know what “sanitized” means ... these all result in the same code. The difference is only at level three, in the logic and the Why, not in level two code.
Which one of these worlds you choose as your interpretation determines which option is correct and preferable. So is it just a matter of interpretation? Well, there's some pretty clear ... like everything's a trade off, but there's some pretty clear reasons to pick Option 1. So in Option 1, I'm all sanitized and I need a sanitized string for save so I could have sanitized x. I need to show what sanitizes x in order to save it, and this proof is trivial. Not only is it trivial, but if I'm programming in Coq, I can actually type in “trivial.” and it'll say, "Yep, that's a proof," and more generally ... it can get more complicated when you have aliasing and conditionals, but it's still going to fall within a fragment of first order logic called effectively propositional, or “EPR,” logic, which is generally pretty easy to reason about. They're fast algorithms. In the other world, I'm doing this replaceAll thing by hand. I need to show it satisfies some complicated condition about what goes where. And we can do this, but it's a little bit harder.
So hard, that it was only discovered this year.
So, in the first case when I first write sanitize, I still need to prove that when I sanitize a string, it satisfies the meaning of being sanitized, and so there I need to think a little bit harder about the sanitize function, but when I actually use it, I can use the fast EPR logic solver. It's easy for me. It's easy for this program to think about. Whereas, in the other options, I don't have a clear module boundary, and so when I call save with this complicated by-hand replace-all'ing, now I need to use the very fancy Chen algorithm and thinking about strings.
In world 1, it's also easier for the synthesizer, because if I say, "Okay, I have a string, now I need a sanitized string and I need it to satisfy some condition about what goes where. Everything's escaped." It's like, "Okay, well how can I get a sanitized string? Is there a way to use replace on those, a way to concatenate things or a way to reverse things?" It has a lot of options to choose from when it’s trying to search.

In the other world, when you say “It is sanitized, what does that mean? — That's on a need to know basis.” Then there is only one option for how to get a sanitized string, which is to call the function, which exports an interface saying it gives you a sanitized string. And so, that's ... you get the precise meaning of easy to think about and easier to code, synthesis style.

So we've explored some correspondences between the deductive schools of synthesis and you. And, we've seen the themes of abstraction in the way that we hid behind the abstraction barrier (do not cross). We saw how it has put constraints on the program and what is possible to write. And of course, all this only seen at the level of logic.

Let's look at another example. How is it possible to write a straight line program, no indentation, that contains a conditional. Let's find out.

I have a program. On the server, it returns 4096. Elsewhere, it returns 1024. And my friend Steven sees this code, and he sees that if-statement, and he gets mad. Because the Internet's told him “Do not use if-statements. Death to the if-statement. Join the anti-if campaign. If-statements are bad.
So he goes back to his code. He gets an idea. This is in C. That isOnServer variable is either 0 or 1. We can replace this if with an array index. The guru is not pleased: “You have only moved the conditioner into the array.” So this anti-if advice is a little bit less straightforward than it may have seemed. Let's see what that means.

A quick change of notation: I'm going to denote the array [1024, 4096]by saying, "Start with the empty array. At zero, insert 1024. At one, insert 4096." On top of this notation, we have the array axioms. And I'm not going to go into them in detail, but they tell me exactly what it means to index into an array. And it gives me this.
It says at 1, the index is 4096.
Else, let's go deeper into the array.
At zero, it's 1024. Let's go deeper into the array
We're out of array and we're out of axioms and so now it's undefined, but really the only thing you care about is at the bottom right. This simplifies to the formula: if x equals zero, then the index is 1024, and if X is one, then the index is 4096. That formula is a conditional. If x is this, then it's that, and if x is this, then it's that.
Now using this, we can go back to the program and come up with a logical formula to describe what the program does and we get a conditional: if it's on the server then it's 4096; else, it's 1024. We can use other axioms to do the exact same for the original program, and we get the exact same formula. And so, there's a sense that both these have a conditional, the same conditional. The synthesizer only looks at these formulas, and so, from its perspective, these two pieces of code are not just semantically equivalent, they're actually structurally identical. We haven't changed it at all.
And so, conditionals are a semantic notion, not a syntactic notion. Just because you don't write the letters I-F does not mean there is no if-statement in your code. So, the thing that happened here was: the array index was a conditional in disguise.

Does this mean we should have an anti-array campaign and no arrays — death to the array? No, not every array access is a conditional, but the difference is subtle and it cannot be purely determined from the code. Rather, is a property of the level three, how you think about it. Let's inspect.
Here's another array. It's a list of all the president names. Using that, we can define a “president-after” function. The president after Washington was Adams, the president after Thomas Jefferson was James Madison, and so on. Here's a pretty simple function that does that.

It totally breaks on bad inputs, negatives, the last president. Don't tell me about those. It's a simplified example.

Let's talk about what it means to index into this array.
Like before, we get a giant conditional. At 0, it's Washington and at 1 it's Adams and so on. This does not look promising. It looks like a conditional. But if you already have a notion, an external notion of the i-th presidents, then you can relate this formula to that and you can get this simpler formula. It's the name of the i-th president, which does not involving any branching at all.
Now we can use this abstracted formula to look at this code and think about it in a very straight forward manner. Get this index, I name the next president and I'm done. That's how you would think about it. And then that's also how a verifier or synthesizer would think about it. Just like I've written down here.
There's no case-work here. There's no branching. It just works. And so, that's a sense in which one array index is a conditional and the other is not. So, the difference is that we're able to abstract one and not the other. Maybe you could abstract the size of array, find a better way of thinking about it, but I didn't and because I didn't find such a way, that one is a conditional and this one is not.
So we've again seen this insight from the deductive school synthesis. And we’ve seen abstraction and how we abstract the array access; this lets us put extra constraints in a program like “Death to the if-statement.” But this only can be understood at the level of logic.
Let's go to another school of synthesis, the constraint based — and this section's going to be a little bit different from previous ones. I said at the beginning, that a human and a deductive synthesizer are kind of similar, and so the deductive synthesis can shed light on soft engineering advice. Whereas the enumerative and constraint- based, they're not like a human synthesizer, so they're just going to illuminate the nature of the programs themselves. We're going to be talking about a tool called Sketch built by my advisor, Armando Solar-Lezama. And Sketch is about programming with holes.
Humans are clever, insightful, and slow. Synthesizers are dumb, thorough, and fast. Let's put them together. You're going to write a high level sketch of the program, what the program title looks like; the synthesizer is going to fill in the details. So here's the “Hello, world” example. That question mark is a hole. And so this thing is asking: find me an integer “question mark,” such that X times question mark is equivalent to X plus X for all inputs X. You might figure out the answer: it's 2. And the synthesizer, will figure that out too, after 0.452 seconds.
You can do more complicated things with Sketch. We have a really inefficient linked list reversal. You can get the fast one. You do need a bit more of a complicated hole than the integer, say like, some number of statements where a statement looks like this, but it can do it.
Let me give you the two-minute version of the synthesis algorithm. It's two parts: constraint solving and the CEGIS loop.
So, it looks like this. For simplicity, assume all integers are two bits; you'll thank me for that in a moment. We can just say: "Look at the bits of the inputs and the hole and the arithmetic expressions,” and all of these give us constraints that the the digits of the plus expression need to match the corresponding ones in the multiplication expression. So we get a list of constraints on the bits.
And now, we can also write down bit-level formulas for every bit of the plus and the times and it looks like this and being boolean formulas (thank goodness they use only two-bit numbers), and now we'll just solve for what the bits of the hole are.
And if you've taken a theory of computing course, you might start getting a little scared because in the theory competing course, you might have learned that this is the Boolean-satisfiability problem. We need to find a solution and assign them to these bits, and the thing works. That's the SAT problem. Oh no, that's NP-hard!

And that is a lie, because it's actually NP-easy. And by that, I mean that modern SAT solvers can take in millions of variables and solve them very fast, most of the time. In this case, we're lucky enough to be in the most of the time. And so we figure out the bits of the hole and we get the final answer of: 2. Success!
It makes it easier to think about, if you just pick a few inputs at one time instead of all inputs. So on top of this, we built something called the counterexample-guided inductive synthesis loop, or CEGIS. It's basically a conversation between two parts, the inductive synthesizer and the verifier. You have the synthesizer. It takes an input, takes some tests, gives you a program. The verifier takes in the program and says if it always works. If not, it gives a failing input.

So it goes: “Does this work?” “No, it fails.” “But does this work? No; here’s a failing test.” “Does this work?” “Yes.”

In other words, CEGIS is test-driven development.
And from this, I get the idea: maybe you can somehow use Sketch and CEGIS to tell us something about testing. Let's do it.
What we're going to do is: I want to see if my test case is good enough. If it's good enough, if the test passed, my program works. Is there a way to write a program that passes all my tests, but is the wrong program? Let's use synthesis to find out.
A view of testing is that you have some space of correct programs that you want to find. Every test is going to narrow down the passing programs more and more, hopefully to just the correct ones. But there's something missing from this story.
Here's a very simple kind of program: curves on an (x, y) plane. Here are my test points, and I want a curve that goes through all three points. Can you guess what curve I have in mind?
That's right. It's this guy.
No, no, no. Here's another test.
Oh, you mean this one?
But if we add a structural constraint that the curve must be a line, then those two options go away. The only option is this line, and not only that, but I need fewer points, fewer test cases.
Programs are just generalized polynomials. Program synthesis is just generalized curve-fitting.
So what does this look like for testing programs? Well, if you have some kind of structural constraint in the programs you write…
…then maybe you don't need this third test. Let's put that in action.
So I want to synthesize a function to do something very complicated: compute the length of a linked list. Here are my three test cases on the left. And for this I do need to have some kind of oracle that tells me what the correct answer is. Using this, I'm going to synthesize a length function which passes all the three tests that I gave it, where there is some input “list question-mark,” which differs from the correct answer. So it passed all the test cases, but it's wrong somewhere. Let's synthesize.
So, first I say it'll have at most three branches, each conditional looks something like this, some reasonable list operations. Now, here's where things can happen within each branch, and let's see if you can come up with something. And it does.
(Sketch produces correct output, not readable output.)
And so, in order to make sure I have the right program, I need to add another test like: this list is length three.
And now it says, if it passed all the tests, there's no way to make it in the wrong program.
But if I restrict my program a little bit, then I don't need that test anymore, and just with only three tests, I must have the correct program.
There's all this talk about more coverage, more tests, more precise, but we don't often hear about writing simpler code, in that the more of one we have, the less the other we need. If I write simpler code, then you don't need as many tests. It's a choice, but I know which end I'd prefer, because there's a saying that quality cannot be tested in, quality must be built in, and this is what it means.
So we've gotten some insights into writing programs, writing tests from the school of constraint-based synthesis. And again, we've seen abstraction in our use of this test oracle, constraints and how we restrict the program space to make it work. And the synthesis algorithm itself is based on logic.
So we've seen a lot of stuff today. You learned the high-level of two different synthesis algorithms. You’ve seen a lot of stuff you probably haven't seen before.
But really what we're doing, is reaching into that black box in your head of how we write programs, and kind of opening up a bit and seeing how it works. And in doing so, we can start to stop accepting the vague answers for what we should be doing, start to accept, maybe you can think a little bit deeper about what's going on. What is it possible for me to write right now? Programs aren't a wild west. Not anything goes. There's a structure in what is possible to write. And we're accepting that we can find it.
So my hope is, that by coming here today, you're all joining me on a journey to get away from the citrus advice we've become used to, start to learn how to see the vitamins in our code. Because the first step, the most important step you've already done, is to get to more precise engineering advice. The big step is simply to believe that it's possible. Because you are a program synthesizer.
So, if you want to see more about this, check out my blog. I love this stuff. I also teach it. I have a web course starting in two weeks. We have one of my former students sitting right there.
[Photos:
Left: Zohar Manna, Armando Solar-Lezama, Nadia Polikarpova, Adam Chlipala, Cordell Green, Sumit Gulwani, John Darlington, Zora Tung

Right (not in order): Alex Miller, Ryan Senior, Mario Acquino, Nick Cowan, Bridget Hillyer, Crystal Martin]

I'd like to thank all the people who made this possible: the people who invented what I talked about as well as the organizers.
But most importantly: St. Louis is my hometown, and so we have a very special guest in the audience today.
I'd like to thank my father, for never telling me that I'm crazy, even when he thought it.

Remember: you are a program synthesizer, do not cross the streams, and do not cross the abstraction boundary. I hope you had a great Strange Loop.


Liked this post?


Related Articles

3 comments:

  1. Hi Jimmy - for your sanitize example, in a statically typed language would the implementation be to just have a SanitizedString type, that is required as the argument for X and return value for sanitize? I'm trying to think of counterexamples where types would not be enough in order to handle your abstraction barrier.

    ReplyDelete
    Replies
    1. Hello Andrew,

      Rather than struggling to come up with examples, let's like at the actual mathematical form of an "abstraction barrier" and see what programming ideas it corresponds to.

      An "abstraction barrier" is a broad idea in programming languages, but all instances of the idea can be translated into one fundamental concept: existential types.

      For instance, consider the classic example of queues. There are many implementations of queues, but they all satisfy a signature like "exists q. (, )". The abstraction barrier is then that, for any users of an implementation of this abstract queue type, it is impossible for the code to depend on any details about what "q" is, and cannot rely on any properties not guaranteed to hold for all implementations of the abstract queue signature. So long as the queue module contains these functions and satisfies these theorems, it is then possible to change any implementation details of the queue module without altering any client code.

      Back to sanitized strings: you can think of the concept of a sanitized string as a type like "exists s. ({len :: s -> int, }, )". Your idea is to translate this back into Java/Scala/C++ by writing it as "class SanitizedString { }". Indeed, for any type of the form "exists t. stuff", you can express it as a type "class T { }".

      There is a big thing missing from this story, however. A big part of this abstraction barrier is making sure that users understand that the abstraction barrier is there and to not rely on facts which are not part of the module's contract. In the case of a SanitizedString, it's all too easy for a programmer to write "s.toString().length() - s.toString().count('\'')" instead of s.len(), not realizing that the former is, in a sense, incorrect, even though both versions give the same result in today's implementation.

      A smaller piece is that you may existentially quantify (i.e.: "enact an abstraction barrier around") something other than a single type. A pair of related types (exists (p,q) . ) may correspond to e.g.: related SanitizedString and CompressedSanitizedString types, which have access to each other's implementation details, but hide details from the rest of the world. Meanwhile, the way type theorists view objects, there is an abstraction barrier between an object and all other instances of the same object interface, whether or not they come from the same class.

      So, I think a short answer to your question would be "For any abstraction barrier, it may help you to create a type for it, but there's no guarantee that doing so will be enough to enforce the abstraction barrier." There is no shortcut to teaching your programmers to understand what facts they rely upon and what is at risk of changing.

      Delete
    2. toString is a debug level thing, i don't think it is fair to bring it into this discussion, you might as well start poking at individual bytes

      as for CompressedSanitizedString there are ways to express it in a more composable way, that can't possibly access implementation details, eg see https://github.com/fthomas/refined

      Delete