Categories
Uncategorized

Science 2.0: How Computational Science is Changing the Scientific Method

This article also appears in Canadian Developer Connection.

Victoria Stodden speaking at the Science 2.0 conference    

Here’s the third in a series of notes from the Science 2.0 conference, a conference for scientists who want to know how software and the web is changing the way they work. It was held on the afternoon of Wednesday, July 29th at the MaRS Centre in downtown Toronto and attended by 102 people. It was a little different from most of the conferences I attend, where the primary focus is on writing software for its own sake; this one was about writing or using software in the course of doing scientific work.

My previous notes from the conference:

This entry contains my notes from Victoria Stodden’s presentation, How Computational Science is Changing the Scientific Method.

Here’s the abstract:

As computation becomes more pervasive in scientific research, it seems to have become a mode of discovery in itself, a “third branch” of the scientific method. Greater computation also facilitates transparency in research through the unprecedented ease of communication of the associated code and data, but typically code and data are not made available and we are missing a crucial opportunity to control for error, the central motivation of the scientific method, through reproducibility.  In this talk I explore these two changes to the scientific method and present possible ways to bring reproducibility into today’ scientific endeavor. I propose a licensing structure for all components of the research, called the “Reproducible Research Standard”, to align intellectual property law with longstanding communitarian scientific norms and encourage greater error control and verifiability in computational science.

Here’s her bio:

Victoria Stodden is the Law and Innovation Fellow at the Internet and Society Project at Yale Law School, and a Fellow at Science Commons. She was previously a Fellow at Harvard’s Berkman Center and postdoctoral fellow with the Innovation and Entrepreneurship Group at the MIT Sloan School of Management. She obtained a PhD in Statistics from Stanford University, and an MLS from Stanford Law School.

The Notes

  • Research has been how massive computation has changed the practice of science and the scientific method
    • Do we have new modes of knowledge discovery?
    • Are standards of what we considered knowledge changing?
    • Why aren’t researchers sharing?
    • One of my concerns is facilitating reproducibility
      • The Reproducible Research Standard
      • Tools for attribution and research transmission
  • Example: Community Climate Model
    • Collaborative system simulation
    • There are community models available
    • Built on open code, data
    • If you want to model something a complex as climate, you need data from different fields
    • Hence, it’s open
  • Example: High energy physics
    • Enormous data produced at LHC at CERN — 15 petabytes annually
    • Data shared through grid
    • CERN director: 10 – 20 years ago, we might have been able to repeat an experiment – they were cheaper, simpler and on a smaller scale. Today, that’s not the case
  • Example: Astrophysics
    • Data and code sharing, even among amateurs uploading their photos
    • Simulations: This isn’t new: even in the mid-1930s, they were trying to calculate the motion of cosmic rays in Earth’s magnetic field via simulation
  • Example: Proofs
    • Mathematical proof via simulation vs deduction
    • My thesis was proof via simulation – the results were not controversial, but the methodology was

Victoria Stodden and her "Really Reproducible Research" slide

  • The rise of a “Third Branch” of the Scientific Method
    • Branch 1: Deductive/Theory: math, logic
    • Branch 2: Inductive/Empirical: the machinery of hypothesis testing – statistical analysis of controlled experiments
    • Branch 3: Large-scale extrapolation and prediction – are we gaining knowledge from computation/simulations, or they just tools for inductive reasoning?
    • Contention — is it a 3rd branch?
      • See Chris Anderson’s article, The End of Theory (Wired, June 2008)
      • Systems that explain the world without a theoretical underpinning?
      • There’s the “Hillis rebuttal”: Even with simulations, we’re looking for patterns first, then create hypotheses, the way we always have
      • Steve Weinstein’s idea: Simulation underlies both branches:
        • It’s a tool to build intuition
        • It’s also a tool to test hypotheses
      • Simulations let us manipulate systems you can’t fit in a lab
    • Controlling error is central to scientific process

Victoria Stodden at Science 2.0 and her "Top reasons not to share" slide

  • Computation is increasingly pervasive in science
    • In the Journal of the American Statistical Association (JASA):
      • In 1996: 9 out of 20 articles published were computational
      • In 2006: 33 out 35 articles published were computational
  • There’s an emerging credibility crisis in computational science
    • Error control forgotten? Typical scientific computation papers don’t include code and data
    • Published computational science is near impossible to replicate
    • JASA June 1996: None of the computational papers provided any code
    • JASA June 2006: Only 3 out of the 33 computational articles made their code publicly available
  • Changes in scientific computation:
    • Internet: Communication of all computational research details and data is possible
    • Scientists often post papers but not their complete body of research
    • Changes coming: Madagascar, Sweave, individual efforts, journal requirements
  • A potential solution: Really reproducible research
    • The idea of an article as not being  the scholarship, but merely the advertisement of that scholarship
  • Reproducibility: can a member of the field independently verify the result?

Victoria Stodden at Science 2.0, with her "Controlling error" slide

  • Barriers to sharing
    • Took a survey of computational scientists
    • My hypotheses, based on the literature of scientific sociology:
      • Scientists are primarily motivated by personal gain or loss
      • Scientists are primarily worried about being “scooped”
  • Survey:
    • The people I surveyed were from the same subfield: Machine learning
    • They were American academics registered at a top machine learning conference (NIPS)
    • Respondents: 134 responses from 638 requests (23%, impressive)
    • They were all from the same legal environment of American intellectual property
  • Based on comments, it’s in the back of people’s minds
    • Reported sharing habits
      • 32% put their code available on the web
      • 48% put their data
      • 81% claimed to reveal their code
      • 84% said their data was revealed
      • Visual inspection of their sites revealed:
        • 30% had some code posted
        • 20% had some data posted
  • Preliminary findings:
    • Surprising: They were motivated to share by communitarian ideals
    • Surprising: They were concerned about copyright issues
  • Barriers to sharing: legal
    • The original expression of ideas falls under copyright by default
    • Copyright creates exclusive right of author to:
      • Reproduce work
      • Prepare derivative works
  • Creative Commons
    • Make it easier for artists to share and use creative works
    • A suite of licences that allows the author to determine the terms
    • Licences:
      • BY (attribution)
      • NC (non-commercial)
      • ND (no derived work)
      • SA (share-alike)
  • Open Source Software Licencing
  • Creative Commons follows the licencing approach used for open source software, but adapted for creative works
  • Code licences:
    • BSD licence: attribution
    • GPL: attribution and share-alike
  • Can this be applied to scientific work?
  • The goal is to remove copyright’s block to fully reproducible research
  • Attach a licence with an attribution to all elements of the research compendium

Victoria Stodden at the Science 2.0 conference and her "Real and Potential Wrinkles" slide

  • Proposal: Reproducible research standard
    • Release media components (text, data) under CC BY
    • Code: Modified BSD or MIT (attrib only)
  • Releasing data
    • Raw facts alone are generally not copyrightable
    • Selection or arrangement of data results in a protected compilation only if the end result is an original intellectual creation (US and Canada)
    • Subsequently qualified: facts not copied from another source can be subject to copyright protection
  • Benefits of RRS
    • Changes the discussion from "here’s my paper and results" to "here’s my compendium”
    • "Gives funders, journals and universities a “hook”
    • If your funding is public, so should your work!
    • Standardization avoids licence incompatibiltiies
    • Clarity of rights beyond fair use
    • IP framework that supports scientific norms
    • Facilitation of research, thus citation and discovery
  • Reproducibility is Subtle
    • Simple case: Open data and small scripts. Suits simple definition
    • Hard case: Inscrutable code; organic programming
    • Harder case: Massive computing platforms, streaming sensor data
    • Can we have reproducibility in the hard cases?
    • Where are acceptable limits on non-reproducibility?
      • Privacy
      • Experimental deisgn
    • Solutions for harder cases
      • Tools
  • Openness and Taleb’s criticism
    • Scientists are worried about contamination by amateurs
    • Also concerned about the “Prisoner’s dilemma”: they’re happy to share their work, but not until everyone else does
Categories
Uncategorized

Science 2.0: A Web Native Research Record – Applying the Best of the Web to the Lab Notebook

This article also appears in Canadian Developer Connection.

Cameron Neylon and his "Creative Commons" slide at Science 2.0

Intro

Here’s the second of my notes from the Science 2.0 conference, a conference for scientists who want to know how software and the web is changing the way they work. It was held on the afternoon of Wednesday, July 29th at the MaRS Centre in downtown Toronto and attended by 102 people. It was a little different from most of the conferences I attend, where the primary focus is on writing software for its own sake; this one was about writing or using software in the course of doing scientific work.

My previous notes from the conference:

This entry contains my notes from Cameron Neylon’s presentation, A Web Native Research Record – Applying the Best of the Web to the Lab Notebook.

Here’s the abstract:

Best practice in software development can save researchers time and energy in the critical analysis of data but the same principles can also be applied more generally to recording research process. Successful design patterns on the web tend to be those that successfully couple people into efficient information transfer mechanisms. Can we re-think the way we create, keep, and share our research records by using these design patterns to make it more effective?

Here’s Cameron’s bio:

Cameron Neylon is a biophysicist who has always worked in interdisciplinary areas and is a leading advocate of data availability. He currently works as Senior Scientist in Biomolecular Sciences at the ISIS Neutron Scattering facility at the Science and Technology Facilities Council. He writes and speaks regularly on the interface of web technology with science and is well-known as one of the leading proponents of open science.

The Notes

  • Feel free to copy and remix this presentation – it’s licenced under Creative Commons

 

  • What is the web good for?
    • Publishing
    • Subscribing
    • Syndicate
    • Remix, mash up and generally do stuff with
    • Collaborate
  • What do scientists do?
    • Publish
    • Syndicate (CRC books are a form of syndication)
    • Remix (take stuff from different disciplines — pull things to toghter, remix them
    • Validate
    • Collaborate
  • So, with this overlap, the web has solved science problems, right?
    • No — papers are dead, broken and disconnected
      • Papers don’t have links
      • The whole scientific record is fundamentally a dead document
    • The links between things make the web go round
    • I want to make science less like a great big monolithic document and make it more like a network of pieces of knowledge, wired together:
      • Fragments of science
      • Loosely coupled
      • Tightly wired

Cameron Neylon and his "Fragments of science / Loosely coupled / Tightly wired" slide at Science 2.0

  • What is a “fragment of science”?
    • A paper is too big a piece, even if it is the "minimal publishable unit"
    • A tweet is too small
    • A blog post would be the right size
  • His lab book is a collection of various electronic documents:
    • Excel files
    • Some basic version control
    • Data linked back to description of process used to create the data
    • As far as possible, the blogging is done automatically by machines
    • It doesn’t have to be complicated
  • [Shows a scatter plot, with each point representing an experiment]:
    • Can we tell an experiment didn’t work by its position on the graph?
    • We can tell which experiments weren’t recorded properly – they have no links to other experiments
  • The use of tagging and “folksonomies” goes some way, but how do you enforce it?
    • Tags are Inconsistent — not just between people, but even within a single person – you might tag the same thing differently from day to day
    • Templates create a virtuous circle, a self-assembling ontology
    • We found that in tagging, people were mixing up process and characteristics – this tells us something about the ontology process

Cameron Neylon and his "Physical objects / Digital objects" slide at Science 2.0

  • Put your data in external services where appropriate
    • Flickr for images
    • YouTube for video
    • RCSBPDB Protein Data Bank
    • Chemspider
    • Even Second Life can be used as a graphing medium!
    • All these services know how to deal with specific data types
  • Samples can be offloaded
    • LIMS, database, blogs, wiki, spreadsheet
    • Procedures are just documents
    • Reuse existing services
    • Semantic feed of relationships — harness Google: most used is the top result
  • Semantic web creates UI issues
    • Just trying to add meaning to results is one step beyond what scientists are expected to do
    • We need a collaborative document environment
    • The document environment must feel natural for people to work in
    • When they type something relevant, the system should realize that and automatically link it
    • We’re at the point where doc authoring systems can use regular expressions to recognize relevant words and autolink them

Cameron Neylon and his "Open" slide at Science 2.0

  • The current mainstream response to these ideas is:
    • The gamut from "You mean facebook?" to horror
    • I’m not worried about these ideas not getting adopted
  • Scientists are driven by impact and recognition
    • How do we measure impact?
      • Right now, we do this by counting the number of papers for which you’re an author
      • Most of my output is not published in traditional literature; it’s published freely on the web for other people to use
      • If they’re not on the web, they disappear from the net
      • The future measure of your scientific impact will be its effect on the global body of knowledge
      • Competition will drive adoption
Categories
Uncategorized

Science 2.0: Choosing Infrastructure and Testing Tools for Scientific Software Projects

Titus Brown at the podium at MaRSC. Titus Brown delivering his presentation.

Here’s the first of my notes from the Science 2.0 conference, a conference for scientists who want to know how software and the web is changing the way they work. It was held on the afternoon of Wednesday, July 29th at the MaRS Centre in downtown Toronto and attended by 102 people. It was a little different from most of the conferences I attend, where the primary focus is on writing software for its own sake; this one was about writing or using software in the course of doing scientific work.

This entry contains my notes from C. Titus Brown’s presentation, Choosing Infrastructure and Testing Tools for Scientific Software Projects. Here’s the abstract:

The explosion of free and open source development and testing tools offers a wide choice of tools and approaches to scientific programmers.  The increasing diversity of free and fully hosted development sites (providing version control, wiki, issue tracking, etc.) means that most scientific projects no longer need to self-host. I will explore how three different projects (VTK/ITK; Avida; and pygr) have chosen hosting, development, and testing approaches, and discuss the tradeoffs of those choices.  I will particularly focus on issues of reliability and reusability juxtaposed with the mission of the software.

Here’s a quick bio for Titus:

C. Titus Brown studies development biology, bioinformatics and software engineering at Michigan State University, and he has worked in the fields of digital evolution and physical meteorology. A cross-cutting theme of much of his work has been software development for computational science, which has led him to software testing and agile software development practices. He is also a member of Python Software Foundation and the author of several widely-used Python testing toolkits.

  • Should you do open source science?
    • Ideological reason: Reproducibility and open communication are supposed to be at the heart of good science
    • Idealistic reason: It’s harder to change the world when you’re trying to do good science and keep your methods secret
    • Pragmatic reason: Maybe having more eyes on your project will help!
  • When releasing the code for your scientific project to the public, don’t worry about which open source licence to use – the important thing is to release it!
  • If you’re providing a contact address for your code, provide a mailing list address rather than your own
    • It makes it look less “Mickey Mouse” – you don’t seem like one person, but a group
    • It makes it easy to hand off the project
    • Mailing lists are indexed by search engines, making your project more findable
  • Take advantage of free open source project hosting

 

  • Distributed version control
    • “You all use version control, right?” (Lots of hands)
    • For me, distributed version control was awesome and life-changing
    • It decouples the developer from the master repository
    • It’s great when you’re working away from an internet connection, such as if you decide to do some coding on airplanes
    • The distributed nature is a mixed mixed blessing
      • One downside is "code bombs", which are effective forks of the project, created when people don’t check in changes often enough
      • Code bombs lead to complicated merges
      • Personal observation: the more junior the developer, the more they feel that their code isn’t “worthy” and they hoard changes until it’s just right. They end up checking in something that’s very hard to merge
    • Distributed version control frees you from permission decisions – you can simply say to people who check out your code "Do what you want. If I like it, I’ll merge it."

 

  • Open source vs. open development
    • Do you want to simply just release the source code, or do you want participation?
      • I think participation is the better of the two
    • Participation comes at a cost, in both support time and attitude
      • There’s always that feeling of loss of control when you make your code open to use and modification by other people
      • Some professors hate it when someone takes their code and does "something wrong" with it
      • You’ll have to answer “annoying questions” about your design decisions
      • Frank ("insulting") discussion of bugs
      • Dealing with code contributions is time-consuming – it takes  time to review them
    • Participation is one of the hallmarks of a good open source project

 Slide: "The Stunning Realization"

  • Anecdote
  • I used to work on the “Project Earthshine” climatology project
    • The idea behind the project was to determine how much of the sunlight hitting the Earth was being reflected away
    • You can measure this be observing the crescent moon: the bright part is lit directly by the sun; the dark part is also lit – by sunlight reflected from the Earth
    • You can measure the Greenhouse Effect this way
    • It’s cheaper than measuring sunlight reflected by the Earth directly via satellite
  • I did this work at Big Bear Lake in Califronia, where they hung telescopes to measure this effect at solar observatories
  • I went through the the source code of the application they were using, trying to figure out what grad student who worked on it before me did
  • It turned out that to get “smooth numbers” in the data, his code applied a correction several times
  • His attitude was that there’s no such thing as too many corrections
  • "He probably went on to do climate modelling, and we know how that’s going"
  • How do we know that our code works?
    • We generally have no idea that our code works, all we do is gain hints
    • And what does "works" mean anyway, in the context of research programming? Does it means that it gives results that your PI expects?
  • Two effects of that Project Earthshine experience:
  • Nowadays, if I see agreement between 2 sources of data, I think at least one of them must be wrong, if not both
  • I also came to a stunning realization that:
    • We don’t teach young scientists how to think about software
    • We don’t teach them to be suspicious of their code
    • We don’t teach them good thought patterns, techniques or processes
    • (Actually, CS folks don’t teach this to their students either)
  • Fear is not a sufficient motivator: there are many documented cases where things have gone wrong because of bad code, and they will continue to do so. Famous cases include:
  • If you’re throwing out experimental data because of ifs lack of agreement with your software model, that’s not a technical problem, that’s a social problem!

 

  • Automated testing
    • The basic idea behind automated testing is to write test code that runs your main code and verifies that the behaviour is expected
    • Example – regression test
      • Run program with a given set of parameters and record the output
      • At some later time, run the same program with the same parameters and record the output
      • Did the output change in the second run, and if so, do you know why?
      • This is different thing from "is my program correct"
      • If results change unintentionally, you should ask why
    • Example – functional test
      • Read in known data
      • Check that the known data matches your expectations
      • Does you data loading routine work?
      • It works best if you also test with "tricky" data
    • Example – assertions
      • Put "assert parameter >=0" in your code
      • Run it
      • Do I ever pass garbage into this function?
      • You’ll be surprised that things that "should never happen", do happen
      • Follow the classic Cold War motto: “Trust, but verify”
    • Other kinds of automated testing (acceptance testing, GUI testing), but they don’t usually apply to scientists
    • In most cases, you don’t need to use specialized testing tools
    • One exception is a code coverage tool
      • Answers the question “What lines of code are executed?”
      • Helps you discover dead code branches
      • Guide test writing to untested portions of code
    • Continuous integration
      • Have several "build clients" building your software, running tests and reporting back
      • Does my code build and run on Windows?
      • Does my code run under Python 2.4? Debian 3.0? MySQL 4?
      • Answers the question: “Is there a chance in hell that anyone else can use my code?”
    • Automated testing locks down "boring" code (that is, code you understand)
      • Lets you focus on "interesting" code – tricky code or code you don’t understand
      • Freedom to refactor, tinker, modify, for you and others

C. Titus Brown delivering his presentation at MaRS 

  • If you want to suck people into your open source project:
    • Choose your technology appropriately
    • Write correct software
    • Automated testing can help
  • Closed source science is not science
    • If you can’t see the code, it’s not falsifiable, and if it’s not falsifiable, it’s not science!