CUSEC 2010 Keynote: Greg Wilson – “Bits of Evidence”

greg wilson 1

Here’s the fourth in my series of notes taken from keynotes at CUSEC 2010, the 2010 edition of the Canadian University Software Engineering Conference. These are from Bits of Evidence: What we actually know about software development and why we believe it’s true, a keynote given by my friend Greg Wilson, the computer science prof we all wish we had. He’s also the guy who gave me my shot at my first article for a developer magazine, a review of a couple of Ajax books in Software Development.

My notes from his keynote appear below; he’s posted his slides online.

A Little Family History

  • My great-grandfather on my father’s side came from Australia
  • He was sent there, along with many other criminals from the UK, to Botany Bay
  • Whenever we kids did anything bad, my mother would say to my father: “This your side of the family"
  • This happened until the day my sister triumphant discovered that my maternal great-grandfather was a Methodist minister who ran off with his church’s money and a 15-year-old girl from his parish
  • She never brought up my father’s family history again
  • It took years and poring over 70,000 sheets of microfiche to track down my great-grandfather, all to find two lines, which said he was sentenced, but not why
  • These days, students get upset if it takes more than 15 seconds to find answers to last year’s exam
  • Some things haven’t changed: while technology has improved, the way we develop software for it hasn’t


  • In the Seven Years’ War, which lasted longer than seven years (1754-63), Britian lost:
    • 1,512 sailors to enemy action
    • 100,000 sailors to scurvy
  • Scurvy’s a really ugly disease. You get spots on your skin, your gums puff up and go black, you bleed from your mucous membranes, and then the really bad stuff happens
  • In 1747, a Scotsman named James Lind conducted what might have been the first-ever controlled medical experiment
  • Pickled food keeps fresh, he reasoned, how about pickled sailors?
  • Lind tried giving different groups of sailors with scurvy various acidic solutions:
    • Cider
    • Sulfuric acid
    • Vinegar
    • Sea water (this was the control)
    • Oranges
    • Barley water
  • The sailors who had the oranges were the ones who recovered
  • Despite Lind’s discovery, nobody paid attention until a proper Englishman repeated the experiment in 1794
  • This discover probably won the Napoleonic Wars: the British navy was the deciding factor
  • As a result of this discovery, British sailors planted lime trees at their ports of call and ate the fruit regularly; it’s how the term “limey” got applied to them, and later by association to British people in general

Lung Cancer

  • It took a long time for medical science to figure out that controlled studies were good
  • In the 1920s, there was an epidemic of lung cancer, and no one knew the cause
  • There were a number of new things that had been introduced, so any of them could be blamed – was it cars? Cigarettes? Electricity?
  • In the 1950s, the researchers Hill and Doll took British doctors and split them into 2 groups:
    • Smokers
    • Non-smokers
  • The two discoveries to come from their research were:
    • It is unequivocal that smoking causes lung cancer
    • Many people would rather fail than change
  • In response to the study, the head of the British medical association said: "What happens ‘on average’ is of no help when one is faced with a specific patient"
  • The important lesson is to ask a question carefully and be willing to accept the answer, no matter how much you don’t like it

Evidence-Based Medicine

  • In 1992, David Sackett of McMaster University coined the term "evidence-based medicine"
  • As a result, randomized double-blind trials are accepted as the gold standard for medical research
  • now archives results from hundreds of medical studies conducted to that standard
  • Doing this was possible before the internet, but the internet brings it to a wider audience
  • You can go to, look at the data and search for cause and effect

Evidence-Based Development?

  • That’s well and good for medicine. How about programming?
  • Consider this quote by Martin Fowler (from IEEE Software, July/August 2009):
    ”[Using domain-specific languages] leads to two primary benefits. The first and simplest, is improved programmer productivity… The second…is…communication with domain experts.”
  • What just happened?
    • One of the smartest guys in our industry
    • Made 2 substantive claims
    • In an academic journal
    • Without a single citation
  • (I’m not disagreeing with his claims – I just want to point out that even the best of us aren’t doing what we expect the makers of acne creams to do)
  • Maybe we need to borrow from the Scottish legal system, where a jury can return one of three verdicts:
    • Innocent
    • Guilty
    • Not proven
  • Another Martin Fowler line:
    ”Debate still continues about how valuable DSLs are in practice. I believe debate is hampered because not enough people know how to use DSLs effectively.”
  • I think debate is hampered by low standards of proof


  • The good news is that things have started to improve
  • There’s been a growing emphasis on empirical studies in software engineering research since the mid-1990s
  • At ICSE 2009, there were a number of papers describing new tools or practices routinely including results from some kind of test study
  • Many of these studies are flawed or incomplete, but standards are constantly improving
  • It’s almost impossible to write a paper on a new tool or technology without trying it out in the real world
  • There’s the matter of the bias in the typical guinea pigs for these studies: undergrads who are hungry for free pizza

My Favourite Little Result

  • Anchoring and Adjustment in Software Estimation, a 2005 paper by Aranda and Easterbrook
  • They posed this question to programmers:
    ”How long do you think it will take to make a change to this program?”
    • The control group was also told:
      ”I have no experience estimating. We’ll wait for your calculations for an estimate.”
    • Experiment group A was told:
      ”I have no experience estimating, but I guess this will take 2 months to finish.”
    • Experiment group B was told:
      ”I have no experience estimating, but I guess this will take 20 months to finish.”
  • Here were the groups’ estimates:
    • Group A, the lowball estimate: 5.1 months
    • Control group: 7.8 months
    • Group B, the highball estimate: 15.4 months
  • The anchor – the “I guess it will take x months to finish” — mattered more than experience.
  • It was a small hint, hint buried in the middle of the requirements, but it still had a big effect, regardless of estimation method or anything else
  • Engineers give back what they think we want to hear
  • Gantt charts, which are driven by these estimates, often end up being just wild-ass guesses in nice chart form
  • Are agile projects similarly affected, just on a shorter and more rapid cycle?
  • Do you become more percentage-accurate when estimating shorter things?
  • There’s no data to back it up!

Frequently Misquoted

greg wilson 2

  • You’ve probably heard this in one form or another:
    ”The best programmers are up to 28 times more productive than the worst.”
  • It’s from Sackman, Erikson and Grant’s 1968 paper, Exploratory experimental studies comparing online and offline programming performance
  • This quote often has the factor changed – I’ve seen 10, 40, 100, or whatever large number pops into the author’s head
  • Problems with the study:
    • The study was done in 1968 and was meant to compare batch vs. interactive programming
      • Does batch programming resemble interactive programming?
    • None of the programmers had any formal training in computer programming because none existed then
      • (Although formal training isn’t always necessary – one of the best programmers I know was a rabbinical student, who said that all the arguing over the precise meaning of things is old hat to rabbis: “We’ve been doing this for much longer than you”.)
    • What definition of “productivity” were they using? How was it measured?
    • Comparing the best in any class to the worst in the same class exaggerates any effect
      • Consider comparing the best driver to the worst driver: the worst driver is dead!
    • Too small a sample size, too short an experimental period: 12 programmers for an afternoon
    • The next similar “major” study was done with 54 programmers, for “up to an hour”

So What Do We Know?

  • Look at Lutz Prechelt’s work on:
    • Productivity variations between programmers
    • Effects of language
    • Effects of web programming frameworks
  • Things his studies have confirmed:
    • Productivity and reliability depend on the length of program’s text, independent of language level
  • The take-away: Use the highest-level language you can!
    • Might not always be possible: "Platform-independent programs have platform-independent performance"
    • Might require using more/faster/better hardware to compensate
    • That’s engineering – it’s what happens when you take science and economics and put them together


  • Discoveries from Boehm, McClean and Urfrig’s (1975) Some Experience with Automated Aids to the Design of Large-Scale Reliable Software:
    • Most errors are introduced during requirements analysis and design
    • The later a bug is removed, the more expensive it is to take it out
  • This explains the two major schools of software development:
    • Pessimistic, big-design-up-front school: “If we tackle the hump in the error injection curve, fewer bugs get into the fixing curve”
    • Optimistic, agilista school: Lots of short iterations means the total cost of fixing bugs go down
  • Who’s right? If we find out, we can build methodologies on facts rather than best-sellers\

Why This Matters

greg wilson 3

  • Too many people make the "unrefuted hypothesis based on personal observation"
  • Consider this conversation:
    • A: I’ve always believed that there are just fundamental differences between the sexes
    • B: What data are you basing that opinion on?
    • A: It’s more of an unfuted hypothesis based on personal observation. I have read a few studies on the topic and I found them unconvincing…
    • B: Which studies were those?
    • A: [No reply]
  • Luckily, there’s a grown-up version of this conversation, and it takes place in the book Why Aren’t More Women in Science? Top Researchers Debate the Evidence (edited by Ceci and Williams)
    • It’s an informed debate on nature vs. nurture
    • It’s a grown-up conversation between:
      • People who’ve studied the subject
      • Who are intimately familiar with the work of the other people in the field with whom they are debating
      • Who are merciless in picking apart flaws in each other’s logic
    • It looks at:
      • Changes in gendered SAT-M scores over 20 years
      • Workload distribution from the mid-20s to early 40s
      • The Dweck effect
        • Have 2 groups do a novel task
        • Tell group A that success in performing the task is based on inherent aptitude
        • Tell group B that success comes from practice
        • Both groups will be primed by the suggestions and “fulfill the prophecy”
        • We send strong signals to students that programming is a skill inherent to males, which is why programming conferences are sausage parties
      • Facts, data and logic

Some Things We Know (and have proven)

Increase the problem complexity 25%, and you double the solution complexity. (Woodfield, 1979)

The two biggest causes of project failure (van Genuchten et al, 1991):

    • Poor estimation
    • Unstable requirements

If more than 20 – 25% of a component has to be revised, it’s better to rewrite it from scratch. (Thomas et al, 1997)

  • Caveats for this one:
    • It comes from a group at Boeing
    • Applies to software for flight avionics, a class of development with stringent safety requirements
    • Haven’t seen it replicated 

Rigorous inspections can remove 60 – 90% of errors before the first test is run (Fagan, 1975)

  • Study conducted at IBM
  • Practical upshot: Hour for hour, the most effective way to get rid of bugs is to read the code! 

Cohen 2006: All the value in a code review comes from the first reader in the first hour

  • 2 or more people reviewing isn’t economically effective
  • Also, after an hour, your brain is full
  • Should progress be made in small steps? Look at successful open source projects

Shouldn’t our development practices be built around these facts?

Conway’s Law – often paraphrased as “A system reflects the organizational structure that built it” — was meant to be a joke

Nagappan et al (2007) and Bird et al (2009) got their hands on data collected during the development of Windows Vista and learned:

  • Physical distance doesn’t affect post-release fault rates
  • Distance in organizational chart does
  • If two programmers are far apart in the org chart, their managers probably have different goals [Joey’s note: especially in companies like Microsoft where performance evaluations are metrics-driven]
  • Explains why big companies have big problems
  • I remember once being told by someone from a big company that "we can’t just have people running around doing the right thing."

Two Steps Forward

  • Progress sometimes means saying "oops"
  • We once thought that code metrics could predict post-release fault rates until El Emam et al (2001): The Confounding Effector of Size on the Validity of Object-Oriented Metrics, where it was revealed that":
    • Most metrics values increase with code size
    • If you do a double-barrelled correlation and separate the effect of lines of code from the effect of the metric, lines of code accounts for all the signal
  • It’s a powerful and useful result, even if it’s disappointing
  • It raises the bar on what constitutes "analysis"
  • We’re generating info and better ways to tackle problems

Folk Medicine for Software

  • Systematizing and synthesizing colloquial practice has been very productive in other disciplines – examples include:
    • Using science to derive new medicines from folk medicine in the Amazon
    • Practices in engineering that aren’t documented or taught in school
  • There’s a whole lot of stuff that people in industry do as a matter of course that doesn’t happen in school
  • If you ask people in a startup what made them a success, they’ll be wrong because they have only one point of data
  • If you’re thinking about grad school, it’s this area where we’ll add value

How Do We Get There?

  • One way is Beautiful Code, a book I co-edited
    • All proceeds from sales of the book go to Amnesty International
    • In its 34 chapters, each contributed by a different programmer, they explain a bit of code that they found beautiful
    • I asked Rob Pike, who contributed to the development of Unix and C, and he replied "I’ve never seen any beautiful code."
    • I asked Brian Kernighan, another guy who contributed to the development of Unix and C, and he picked a regex manager that Rob Pike wrote in C
  • This has led to a series of "Beautiful" books:
  • The next book is "the book without a name”
    • I would’ve called it Beautiful Evidence, but Edward Tufte got there first
    • The book will be about "What we know and why we think it’s true"
    • Its main point is knowledge transfer
    • I’m trying to change the debate
    • I want a better textbook, a believe that this will be a more useful textbook
  • "I want your generation to be more cynical than it already is."
  • I want you to apply the same standards that you would to acne medication
    • How many of you trust research paid for by big pharma on their own products?
    • I want you to have higher standards for proof
  • The real reason all this matters? [Shows a slide with a picture of the world and his daughter]
    • “She is empirically the cutest kid in the world”
    • Public debate is mostly evidence-free
    • You’re not supposed to convict or release without evidence
    • I want my daughter to inherit a better world

What University is For

  • [Asks the audience] Hands up if you think you know what university is for
    • One student answers “A piece of paper!”
    • Greg’s reply: "Remember when I said I wanted you to be more cynical? Don’t go that far."
  • In my high school in interior BC, there was no senior math or physics course, because there was no interest
  • I went to university, hungry to learn
  • I later discovered that it wasn’t about the stuff you learned
  • Was it supposed to teach us how to learn?
  • It’s supposed to train us to ask good questions
  • Undergraduates are the price universities pay to get good graduate students
  • This thinking got me through the next 15 years
  • Then I went back to teaching, at U of T, and I think I know what universities are for now:
    • We are trying to teach you how to take over the world.
    • You’re going to take over, whether you want to or not.
    • You’re inheriting a world we screwed up
    • You’ll be making tough decisions, without sufficient experience or information
  • Thank you for listening. Good luck.

Comments from the Q&A Session

  • Who’s going to an anti-proroguing rally this weekend?
    • It won’t do you any good
    • You want to make change? It’s not made at rallies, but by working from within
    • The reason you can’t teach evolution in Texas isn’t because fundamentalists held rallies, but because they ran for the school board
    • We have to do the same
    • To make change, you have to get in power and play the game
    • “Dreadlocks and a nose ring does not get you into the corridors of power”
    • The ACM and IEEE are arguing your case
      • Join them and influence them!
  • Do we want to turn out engineers, who are legally liable for their work, rather than computer scientists, who are not?
    • Some of us will have professional designations and be legally liable, some of us won’t — there’s no one future

This article also appears in Canadian Developer Connection.