Blog

The Big Picture: Kristin Persson on Data and Machine Learning

Distinguished Scientist Fellow Kristin Persson reflects on the importance of data in discovering new materials and her contributions to the field.

Office of Science

April 10, 2025

Shannon Brescher Shea

Shannon Brescher Shea (shannon.shea@science.doe.gov) is the social media manager and senior writer/editor in the Office of Science’s Office of Communications and Public Affairs. She writes and curates content for the Office of Science’s Twitter and LinkedIn accounts as well as contributes to the Department of Energy’s overall social media accounts. In addition, she writes and edits feature stories covering the Office of Science’s discovery research and manages the Science Public Outreach Community (SPOC). Previously, she was a communications specialist in the Vehicle Technologies Office in the Office of Energy Efficiency and Renewable Energy. She began at the Energy Department in 2008 as a Presidential Management Fellow. In her free time, she enjoys bicycling, gardening, writing, volunteering, and parenting two awesome kids.

Kristin Persson (a woman in a black shirt) sitting at a desk in an office. She is holding a complex molecular structure and there are several molecular structures on the table in front of her.

As the director of the Materials Project at DOE’s Berkeley Lab, Kristin Persson is making it easier to find and analyze important materials.

Image courtesy of Elena Zhukova

Scientists recognized by the Department of Energy (DOE) Office of Science Distinguished Scientist Fellow Award are pursuing answers to science’s biggest questions. Kristin Persson is a professor at the Department of Materials Science and Engineering at the University of California, Berkeley and the director of the Materials Project at DOE’s Lawrence Berkeley National Laboratory.

Thomas Edison tried out more than 2,000 different materials for the lightbulb filament before finding the right one. Until recently, materials scientists were stuck taking a similar approach. They would be inspired, make a new material, and test it for needed properties. They would keep doing that same process over and over until they found an appropriate material.

Even finding out the properties for a specific element took enormous amounts of time. When I was a graduate student in Stockholm in 1996, I spent an entire year figuring out some properties of tungsten. If you wanted curated data about materials back then, you pulled out your reliable reference book on Phase Diagrams and Physical Properties.

But now there is a better way forward.

Science advances through different paradigms – shifts in how we work to understand the world. Machine learning and artificial intelligence bring us to the fourth paradigm. This is a major change from what we’ve done in the past.

The first and second paradigms are the ones that non-scientists are familiar with. The first is empirical science – using experiments and observations to understand the world around us. This was the only paradigm until the 1600s. The second paradigm is theoretical science. This is developing physical laws, like the law of thermodynamics.

The third and fourth paradigms have come about more recently. The third paradigm centers computational science. The ability to solve equations without doing them manually revolutionized the speed and complexity by which we could do research. The fourth paradigm focuses on harnessing data to conduct big science. This approach includes artificial intelligence and machine learning. Because of this shift, the process that took me a year in graduate school now only takes seconds. This is a fundamental paradigm change.

Today, we’re able to use all four paradigms simultaneously, each building on the other. With support from the Department of Energy’s (DOE) Office of Science, much of my career has been spent managing data to make that fourth paradigm possible. I work to answer the question, “How can we curate data to accelerate the discovery of new, useful materials?”

Why is data so important?

Machine learning allows scientists to feed data into a computer and receive informed predictions in return. In the case of materials discovery, machine learning allows scientists to break away from the cycle of making and testing new chemical compounds that make up materials.

Machine learning programs depend on having access to massive amounts of high-quality, well-curated data. It may not seem as exciting or fun as developing programs, but data is the fuel of accelerated learning.

In fact, the biggest accomplishments in machine learning wouldn’t have happened without valuable databases. The 2024 Nobel Prizes in both chemistry and physics recognized work that relied on machine learning. The winners of the chemistry prize made major advances in computational protein design. The AlphaFold architecture they used was trained on data from the Protein Data Bank. Originally established in 1971 at DOE’s Brookhaven National Laboratory, the Protein Data Bank is free and publicly available for scientists around the world to use.

Unfortunately for materials science, there’s often a lack of data. Of the 100,000 to 200,000 known inorganic compounds, we only have certain fundamental properties for 200 of them. There is experimental data available in the open scientific literature for fewer than one percent of compounds.

The need for good, open, scientific datasets is more important than ever.

The origins of the Materials Project

Back in 2010, that need inspired me to write a proposal. DOE’s Lawrence Berkeley National Laboratory held a competition to imagine and develop mobile applications that could advance the lab’s mission. With postdoctoral researcher Michael Kocher, I pitched the Materials Genome Browser. This app would allow scientists to look for the right types of material properties. A few months later, the laboratory accepted my proposal for a Laboratory Directed Research and Development project. This project focused on using data mining to accelerate how we design materials. Together, these two projects became the foundation for the Materials Project.

The Materials Project officially started the next year as one of the five inaugural software centers in DOE’s competition for awards supporting the goals of the Materials Genome Initiative.

We had three goals:

Accelerate materials discovery by developing advanced scientific computing methods and designs.
Modify computations so they cover all known inorganic compounds and use predictions driven by data.
Push data and design tools to make them available to the larger materials community.

The Materials Project website now allows scientists to filter potential materials for useful properties before they make them. With specialty search engines, researchers can drastically reduce the number of chemical structures they have to test to find new materials. Running thousands of density functional theory calculations that provide information on these structures requires just three lines of code and a submission command.

I hope one day for it to be like J.A.R.V.I.S. in the movie Iron Man. In the film, Tony Stark can just pick elements and have the computer synthesize them. While we aren’t there yet, that’s the dream.

Building The Materials Project

Making that dream possible requires a lot of foundational work. The Materials Project relies on a huge amount of backend programming that isn’t obvious to users. We’ve developed workflows, pipelines, cloud software partnerships, and data visualization tools to make it as easy as possible.

All of our software is open source. Sharing our code not only allows more people to use it, but also for people to add to it. We have more than 250 active developers who develop new code, find bugs, and ensure consistency. Dedicated volunteers spend weekends and evenings checking changes to the code. While there was some initial skepticism about making it open source, this decision has truly paid off.

These programs are never static. In addition to data produced in-house, other people in the research community are contributing large, wonderful, gold-standard data sets. As researchers add more data, the system checks it for accuracy, categorizes it, and ensures that it’s consistent with what is already there. We regularly recalculate the data so that it’s based on state-of-the-art research. The Materials Project is like a living organism that keeps getting better and better.

An innovation multiplier

The Materials Project is now an innovation multiplier. The website includes information on 150,000 materials and millions of properties.

Researchers all over the world use our data and algorithms. We have more than half a million registered users. On average, five papers a day are published that cite the Materials Project. Meta’s Open Catalyst project has been using our data since 2019. Google’s announcement that it used DeepMind to find thousands of new materials depended on our data and software infrastructure.

Needless to say, there have been a huge number of materials made possible by the Materials Project. Their applications range from sensors to carbon capture to power generation.

One of the Materials Project’s biggest accomplishments was enabling a better cathode for non-rechargeable batteries. Duracell challenged researchers at the Massachusetts Institute of Technology, including me, to come up with a better material for this fundamental component. They gave us $1 million and a year to do it.

Running our initial computer script, we started with 130,000 potential compounds. Only 30,000 of these were previously known. Based on requirements for voltage, energy capacity, and energy density, we narrowed it down to only 1,500 compounds. Stability was the hardest requirement to meet. Only 200 compounds were close to stable enough. We handed those 200 compounds over to Duracell. The company chose one material. In 2019, they launched a new brand of batteries that used this material. These batteries had twice as much life or power as previous ones.

Keepers of data

One of the Materials Project’s strengths is curating data for properties and characteristics of materials that aren’t considered the most “interesting.” While these properties don’t always have immediate applications, they often have a wealth of useful information. For example, pre-Materials Project, the open scientific literature only listed 200 elastic tensors. (The elastic tensor measures the relationship between the amount of strain in an elastic material and how much stress it undergoes.) Through the Materials Project, scientists have now calculated 14,000 tensors for 6,000 materials. One group used this data to train machine learning software to find materials with high hardness. They were able to synthesize two new materials as a result.

Another advantage is the fact that we save all of the data we have calculated. It’s never wasted. The properties we find in materials inform future research, making every search useful beyond a single study. While we’re constantly updating our dataset, we also archive old data for reference.

For our data stewardship, the DOE’s Office of Science has recognized the Materials Project as a PuRe Data Resource. These are authoritative providers of data in a subject area. This recognizes how we act as community leaders in data stewardship. It was a true honor for us to join this fabulous family of data curation.

The work I and my colleagues have put into the Materials Project has paid huge dividends and will continue to in the future. As a community, we have to recognize that data is a tremendous innovation multiplier and the fuel for discovery of new, improved materials for a better future.