In this episode, Dr. Bryce Meredig and Prof. Murray-Rust discuss:
- How Peter’s research background in crystallography inspired him to lead the development of tools and communities around open science and open data
- Lessons the materials and chemistry communities can learn from bioscience to create a more open community in scientific publishing
- The impact that open data and open research can have on accelerated industrial materials development
- The role of public funding and policymaking on encouraging a more open scientific community
- The importance of machine readable data and semantic databases in the physical sciences
- Dr. Murray-Rust’s non-profit Content Mine, which seeks to unlock scientific data through advocacy, community, and software development
“The multiplying factor of the Human Genome Project was over 100x. For every $1 million invested, it led to over $100 million of value created downstream…There’s no doubt that funding these sorts of things leads to a huge amount of realizable public good.” — Dr. Peter Murray-Rust
Dr. Peter Murray-Rust is the Reader Emeritus in Molecular Informatics at the University of Cambridge and Senior Research Fellow Emeritus of Churchill College, where he brings together tools from computer science to chemistry, biosciences and earth sciences, integrating humans and machines in managing information.
Peter has held multiple faculty positions throughout his career, first as a lecturer at the University of Sterling, and later as Professor or Pharmacy at the University of Nottingham. He also led molecular graphics, computational chemistry, and protein structure determination efforts at the Glaxo Group Research.
In addition to his industrial and academic work in chemistry and molecular informatics, Peter is well-known for his support and work on open access and open data. He led the development of the Chemical Markup Language, co-authored the Panton Principles for Open Scientific data, and co-founded the Blue Obelisk community to promote open data and develop open source cheminformatics tools. In 2014, Dr. Murray-Rust was granted a Shuttleworth Foundation fellowship in support of his work leading the non-profit ContentMine, where he and his team develops tools to mine literature to make scientific data open and accessible.
Bryce Meredig: Welcome to DataLab. A materials informatics podcast with Bryce Meredig, Chief Science Officer at Citrine Informatics. Our guest today is Dr. Peter Murray-Rust. Dr. Murray-Rust is the Reader Emeritus in Molecular Informatics at the University of Cambridge and Senior Research Fellow Emeritus of Churchill College where he brings together tools from Computer Science to Chemistry, Bio-sciences, and Earth Sciences with the goal of integrating humans and machines and managing information. Prior to joining Cambridge, Peter was on the faculty of the University of Sterling and also the University of Nottingham. He also spent some time in industry at Glaxo Group Research. In addition to his industrial and academic work in Chemistry and Molecular Informatics, Peter is well known for his support and work on open access and open data. He led the development of the chemical markup language, coauthored the patent principles for open scientific data, and co-founded the Blue Obelisk community to promote open data and develop open source cheminformatics tools. In 2014 Dr. Murray-Rust was granted a Shuttleworth Foundation Fellowship in support of his work leading the non-profit content mine, where he and his team developed tools to mine literature to make scientific data open and accessible.
Bryce Meredig: Peter, thank you so much for joining us and welcome to the podcast.
Peter Murray-Rust: Thank you.
Bryce Meredig: Now, we like to start often with our guests with sort of a fun fact about them and Peter I learned on your Twitter that at one point you showed up to vote in Cambridge wearing a bear suit. I would love to hear a bit more about that story.
Peter Murray-Rust: Well, actually every time I go to vote I go in the bear suit. And you can see my Twitter avatar. And some years ago I was concerned whether people had to show their face when they voted in the UK. So I thought I’ll turn up in the bear suit and see if thy ask me to remove my head and I’ve done this ever since. And almost always they do ask me to remove my head, but there was one year when somebody else also voted in the bear suit. My son picked that up and everybody was very happy about it.
Bryce Meredig: So you’ve started a bit of a movement in Cambridge then?
Peter Murray-Rust: Not really (laughs)
Bryce Meredig: Peter of course I wanna dive into your background in chem-informatics, and open data and open science. What has inspired you to pursue your work in advocacy in these areas? Why do you feel they’re so important?
Peter Murray-Rust: Well probably about 45 years ago when I was doing crystallography and it’s an exciting subject, I was getting results but it was clear that in a small university I wasn’t going to be able to create a lot of results by myself we were very small in the groups.
Peter Murray-Rust: So I thought, why don’t I see what’s in the literature, and do a search using other peoples published data instead of collecting data myself. And I found there was actually a lot of data in the literature on crystal structures and put this together I spent time with Jack Dunitz and Hans Buergi in Zurich and we developed patents of extracting themes and correlations from the crystallographic literature. And that has sort of led to crystallography being very much aware of data. It coincided with the Cambridge Crystallographic Database and in 1978 I worked with Sam Motherwell to show that you could actually use machines to do science, what was then essentially an archival database.
Peter Murray-Rust: So I’ve always felt that science publishes stuff in an archival manner, sometimes in a reading matter, but not in a way where it all comes together as a global knowledge object, and that’s my passion. During this talk there will probably be something of the order of 100 papers published in science. Maybe 200. And of those perhaps 30, 40 would be of relevance to materials. That’s the rate at which it’s coming out.
Peter Murray-Rust: So we have more data in the public literature than we know what to do with unless we use machines. And I just feel it’s such an opportunity to bring all this together.
Bryce Meredig: Since your earliest work pioneering this field how have you seen the capabilities and the technology evolve with advances in computation?
Peter Murray-Rust: Well, it hasn’t kept up with computation. Computation has followed an exciting Moore’s Law, that in terms of data, although data also follows a Moore’s Law, there hasn’t been the development of knowledge tools to make it useful.
Peter Murray-Rust: So we’re still seeing data published in the way that Victorian scientist, 19th century scientist would understand, rather than 21st century scientist connected through so much of the electronic media. Facebook and Amazon and all these people, understand the modern world, but scientific publishing still thinks it’s done on paper.
Bryce Meredig: What do you think is holding us back there? What are some of the barriers that are preventing us from looking more like Facebook or Amazon when it comes to scientific data?
Peter Murray-Rust: Well some of it is a lack of vision, so the point is that when people publish, they publish a single paper and they expect people to read it, they don’t think here’s a paper within context of lets say, 500 papers, or 1000 papers, and I actually now work with communities who read 30,000 papers. They do systematic reviews particularly in medicine and so the publication is not something in itself it’s part of a much larger whole. That has not got through to almost all people in funding, publication, academia, or whatever. Very slightly in some sciences such as bioscience and so on but, generally we’re trapped in this idea of the paper copy is what matters.
Bryce Meredig: So you would say then that perhaps the biosciences are ahead of chemistry and material science in these respects?
Peter Murray-Rust: Yes. They’re ahead because for probably for about 40 years they have built public databases, and the first of these was the Protein Information Resource back in the early 70’s when people started recording protein sequences in the literature, because they realized that this was valuable and it was at that time that people started doing sequence alignment and saying that if you need a sequence of one protein and what it did and if you had a similar protein and if you didn’t know what it did it probably had a similar effect in that organism and so forth.
Peter Murray-Rust: And there’s been this absolute imperative in bioscience, that genes, protein structures, nucleic acids, cell processes, components all sorts of things were so valuable that they shouldn’t be hidden away, they should be put into databases.
Peter Murray-Rust: Almost all the effort in this comes from public funding, and probably the great success of this was the human genome in 2000 and the late 1990’s, and that’s what’s inspired the idea of materials genome, but the materials genome hasn’t really caught up, you know the volume of excitement and multi-disciplinary and interrelationships that the bioscience community has.
Bryce Meredig: Now of course the materials genome was an initiative launched in 2011 here in the United States announced by the White House and adopted by some of the US federal funding agencies. Is there something like that going on in the United Kingdom as well?
Peter Murray-Rust: Not really we actually were a little bit ahead in the 2000’s there was a big impetus in the the UK for e-science. And the government put a lot of money into it and I was one of the beneficiaries, and we developed tools and protocols that allowed us to do many computational science in electronic environment so it was about workflow, about semantics, about job share dealing and so on.
Peter Murray-Rust: And what is particularly good about the UK, is it’s actually very good at collaboration so, we had 9 sensors who linked together we had lots of joint meetings. We shared tools and so forth. So that sort of laid the basis for this and in the US it was often called cyber science, in Australia it’s called e-research.
Peter Murray-Rust: So yes we had something similar, but we don’t have the same level of countrywide funding at the moment.
Bryce Meredig: From my perspective a lot of times the discussion around open data and open science for example, government funded research centers on the academic and government research apparatus, but of course you spent time in industry, Citrine is a company, how do you think some of these themes are relevant to commercial enterprises?
Peter Murray-Rust: Very interesting question. I’m reading peoples opinions at the moment that say, most pharmaceutical research is actually funded by public monies and so on, and there’s no doubt in my mind that public funding of research is critically important to the development of an industry.
Peter Murray-Rust: I was very fortunate when I was in Glaxo now GSK, because I was able to take part in a industry academia club called protein engineering club where the companies put money in, the government put money in, and we did projects which were aimed at benefiting industry by using the innovation from academia’s. For example, I was able to spawn two projects one on crystallization techniques which you would never really get in responsive mode funding, and the other on databases which at that time was also pretty new so it was infrastructural funding which helped these things get off the ground and I think any government funding that does that, is going to be welcomed by industry and lead to a general upwelling of value in both directions.
Peter Murray-Rust: You probably know that it’s been computed that the multiplier factor for the human genome was of the order of over 100 times; in other words, for every billion put into the human genome project, 100 billion of value appeared downstream. Now I think it’s probably slightly over the top but there’s no doubt that funding these sorts of things does lead to a huge amount of realizable public good.
Bryce Meredig: There is a tremendous amount of excitement right now in the material science field around AI and machine learning, but as I’m sure you would say based on your time at Glaxo, a lot of these ideas have existed for sometime in the pharmaceutical sector and in chemistry under the banner of cheminformatics or related disciplines. Have you seen much cross-pollination between these fields or why might chemistry have gotten out of head of material science in some of these respects?
Peter Murray-Rust: Well I believe in the value of AI, I don’t believe in the hype that there is, and often it’s difficult to distinguish between AI and machine learning. I’m excited about the new tools that have come out in the last two or three years likes TensorFlow, and Apache Spark and so on, which make this accessible to everybody, but ultimately most of that depends on having semantic data annotated so machines can learn from it and, many of the examples we see and the popular thing about image recognition and so on, because there’s a lot of data out there, and they found ways of capturing peoples annotation.
Peter Murray-Rust: Now if we take something like material science data and let’s look at it in a graphical form so you’ve got something like a phase diagram or you have a diffractogram or whatever it might be, ye,s you can certainly apply machine learning to that but, the more you know about the representation of the knowledge and the science, the more productive the machine then will be. So the idea that you can take huge mounts of data and throw it into a machine and not think about it has been common for 40 years. I’ve seen so much of this where people say well, “Just throw it into the machine and you get answers out.” And the answer is that we’ll mainly get rubbish out of it.
Peter Murray-Rust: So you need to think about what you’re doing, and I think the key thing is developing a knowledge representation. What we need I believe is a semantic description of materials, and that’s why I and Henry Rzepa developed Chemical Markup, it was to say, when you’ve got a piece of knowledge, we can represent it in semantic form, and that means that machines can handle it happily, but if you get it from a database and the units are missing, or if you get a figure from a paper and you don’t know what the axis really represents you can’t make progress there and so on, so moving towards communal validated semantics in my view is a necessary part of process.
Bryce Meredig: I think you brought up a great point which is, how important it is for us to have machine readable data in material science and chemistry really all of the sciences and you’ve alluded to some of the work going on in creating semantic databases for these domains. In your view where does the community stand right now?
Peter Murray-Rust: Well, I’ve been involved in material science but I’m not very involved on a day to day basis. So with the first material science meeting in Melbourne about five, six years ago that Nico Adams, my colleague, ran and I got the impression that not very much had started at the moment. The area where it’s best understood is crystallography. I’m a crystallographer and the crystallographic community has put a huge amount of effort into building communal semantics.
Peter Murray-Rust: If you take chemistry there’s actually very little public, communal, reusable semantics. Much of the pressures come from pharmaceuticals and there the unit discourse is the molecule. Normally the isolated connection table in the gas phase, and this leads to things like representation by the ink sheet, you know canonical form of the molecule, and that’s fine and I and others helped in that, but it doesn’t work for anything beyond molecules. It doesn’t work for the crystalline state, for surfaces, for polymers, and so on, and in my view we really have to, and nano objects and so on…we have start building a language for that, and I’d be very happy to help with that because we sort of foresaw that in Chemical Markup Language, the bits are all there if people want them.
Peter Murray-Rust: So to appeal to anyone who’s listening to this, if you’re interested in semantic material science then I’d love to talk to you and see if there are things where we can move forward together.
Bryce Meredig: I think it’s often bemoaned in the material science domain that we lack tools or representation like SMILES and InChI Strings, who in your view are the right people to put this together? Is this sort of a top down process, should it be bottoms up where it’s individual end-user scientists who are doing it? What do you think is the most efficient way to get us to a better place in that respect?
Peter Murray-Rust: Well first of all it’s hard. Secondly it’s boring. Thirdly it probably needs funding to keep it going year after year, it’s not something where you can have a two year project, do it, and it will keep going. It won’t keep going so you’ve got to have that heartbeat of the project driving it and that’s what the crystallographers have, it’s what the bioscience community has. I would say that it probably needs an institution of some sort. It could be done by a relatively altruistic university, but they’re rarer and rarer because of the requirements to compete and the fact that this is not glamour science and doesn’t get you impact points.
Peter Murray-Rust: It’s particularly appropriate when you’ve got national or international laboratories like NIST or National Physical Laboratory or whoever and they do a good job here, but I think many of them feel the funding squeeze as well. A lot of work went on in the eastern block in Europe, collecting data on, in this case things like liquid combinations and boiling points and such.
Peter Murray-Rust: The thing is the idea is still there, collect together and systematize the properties, it’s the properties that matter, you need a dictionary of properties, you need a semantic representation of what type of property it is. One of the people I worked with was [Henry Keheinen] who built a set of semantic representation particularly for liquid liquids, but the process would work for solids as well.
Peter Murray-Rust: We’ve got to do this sort of thing and it’s a political task. Somebody’s got to get up and say look this industry needs it, the industry needs to come together and say look, it is more important to work together than to hide and compete and make things difficult and so on, because ultimately everybody benefits and we see this in the subjects I’ve mentioned where people are able to build on other peoples representations rather than everybody doing it themselves.
Peter Murray-Rust: A bottom up will not generally work because it doesn’t have the push from the people who really need it who are industry and national apart.
Bryce Meredig: Now of course you are leading an important initiative around extracting or collecting scientific data from the literature which is your ContentMine project could you tell us a little bit more about that?
Peter Murray-Rust: Yes. So when I retired from Cambridge, I had worked to build semantic tools which would, if the data were available, take that data and turn it into semantic form and so on, and I thought it’s so important to keep doing them. I looked for ways of doing it and about five years ago I applied to the Shuttleworth Foundation which is a non-profit which sponsors in their words “amazing people to change the world.”
Peter Murray-Rust: So this is not just in science it’s in things like student debt or, injustice in the middle-east or whatever. I was leading the fellowship to essentially assert that public scientific data should belong to the world and should not be tied up in journals and should not be hidden away and sold because the whole world was impoverished.
Peter Murray-Rust: So I got two years of funding from this to build up ContentMine which is based on three pillars. One is advocacy so A is for advocacy and activism. Trying to get people to realize this so this is exactly what we’re doing, activism advocacy saying this really matters and some.
Peter Murray-Rust: The second is community because you need a community of practice which develops this and looks to what sort of ways are going to work, what sort of tools work, what’s needed getting fresher ideas and so on. I’ve worked to try and build up communities.
Peter Murray-Rust: The third is tools. So, developing tools which is this sort of thing, so I’m developing tools which read the literature and read things like diffraction diagrams possibly two dimensional phase diagrams, things of that sort, and I can interpret different subjects ten diagrams in a second.
Peter Murray-Rust: So if you’ve got the right sort of software you can go through the literature and extract this sort of data. The idea is that the paper based literature leads to a huge communal semantic resource. The problem is that very few people get this vision and unfortunately there are many people who think it’s going to be a problem for them, particularly certain large publishers which I won’t name, but who see the value in keeping data to themselves and selling a data or metadata service for money.
Peter Murray-Rust: Now I’m not against people selling metadata services so long as the data itself is open, but what I do have a problem with is closed, walled gardens where the data is only accessible through a gatekeeper. It’s not just the expense, it almost normally leads to a lack of innovation, falling behind in the type of technology that you use.
Peter Murray-Rust: Where things are open, people develop new tools, where they’re closed, then everybody reinvents the wheel and it’s expensive and poor quality. So the idea is to build this world, the technology is there, I can read 10,000 papers a day, I can extract data from them and publish it, but if I do this I will probably have lawsuits from publishers who claim that it is some way upsetting them and many universities will actually, are so risk averse that they actually side with the publishers and they cut people off.
Peter Murray-Rust: Now I haven’t been cutoff because I don’t actually do this en masse, but other people have been cut off, for mining materials they took from the literature, which is a perfectly legal thing, but the publishers are trying to stop it.
Peter Murray-Rust: I’d like to see the culture change and this podcast is maybe, I’d guess, another step towards changing that culture. I’d like people to say look, we really want that data out there we can use it. Let’s actually regard data as a public good, because data are not copyrightable. So this is not a legal aspect it’s restrictive practices.
Bryce Meredig: Now along these lines my understanding is that back in 2014 the UK created a text and data mining exception to existing copyright law, so I’d be interested to hear from you, have you seen the environment around those topics change in the United Kingdom since the law was changed?
Peter Murray-Rust: No. Absolutely nothing has happened. I’ve been very saddened by this because one of the things I believed when I setup ContentMine was that we would have this bright new future of mining the literature because it was now legal. So I built tools but the universities in general say well we don’t want you to use these tools because it might get us cut off by the public shows and that will upset the academics, so you need some brave universities who say: “this actually matters to us rather than being dictated to by the mega-publishers.”
Peter Murray-Rust: So nothing has happened. I have done more than anybody else in this area. There are no meetings, there are no movements, no protocols, no funding in the UK at all.
Bryce Meredig: Peter, looking ahead in the fields that you’re most involved in, text mining, cheminformatics, open data, open science, what are some developments that you can see on the horizon that you’re particularly excited about?
Peter Murray-Rust: Well, I think the most exciting one from the political point of view is preprints. I think preprints change the dynamics because preprints do not belong to publishers and although technically mining PDF’s is a horrible business but we’ve actually developed tools in ContentMine which can do it, so we see preprints as a very exciting thing because it means that they first of all come out immediately, so they’re often months maybe even more ahead of the “print or e-print” publication and B, they’re not restricted by copyright so we can, if people wanted, maybe we can follow this up afterwards, build a materials or chemical preprint reader that extracts and indexes everything that’s published everyday. So we could for example work out all the compounds all the materials, the spectrum.
Peter Murray-Rust: One of the things I’m doing in a week or two I’ve got some volunteers, we’re meeting in Cambridge and we’re going to look to what we need to extract spectra from supplemental data. But this of course is volunteers, and early adopters and so on and really we need people who believe in the results. People say well we would really like all these spectra so what are we going to do to get them? We would really like all these difractograms, all of these lists of material properties or whatever because it can be done, it just needs somebody, you know, group of people who say this really matters so let’s go and do it.
Bryce Meredig: Peter we are coming up on our time here, I want to thank you again for joining us and also wanted to give you the opportunity if folks want to get in touch with you and learn more about your work, what’s the best way for them to do that?
Peter Murray-Rust: Well, the best way is website contentmine.org. You can also email me, firstname.lastname@example.org is probably the best thing to do. And if you’ve got an idea of data of any sort in the materials or chemical literature that you would like to extract en masse, then just mail me. That’s the simplest thing, and we’ll consider each case on its merits.
Peter Murray-Rust: I’ll be upfront we need cost recovery on this because we have a modest number of stuff so it can’t be free, but we’re certainly happy to look at projects here and I think it will be a valuable return on investment.
Bryce Meredig: Well Peter thanks again for joining us and thanks to everyone for listening. Please subscribe and rate our podcast on iTunes, Stitcher, or wherever you listen to podcasts. Listen to past episodes, learn more about our guests, and submit questions and guest suggestions at citrine.io/podcast.
Bryce Meredig: Thanks for listening to DataLab. If you have questions or an idea for an episode contact our team at email@example.com.