Driven by still exponentially increasing computational power, machine learning has made its way into more and more applications. However, a large amount of high-quality, structured, and machine-readable data is prerequisite to successful machine learning approaches. Therefore, data has been described as the new oil of the digital economy.
The field of chemistry is still lagging behind. Data and detailed information on experiments is often incomplete or hidden in the plain text, tables, or figures of research papers. Raw data is commonly stored locally and in proprietary formats, being lost when the vendor disappears or when a computer gets decommissioned. Therefore, sharing data, confirming completeness, performing meta-analyses with multiple datasets, and many other applications are currently difficult, laborious, and mostly impossible tasks. The lack of research data management (RDM) which provides findable, accessible, interoperable, and re-usable (F.A.I.R.) data considerably impedes scientific progress.
We address these problems in the framework of CRC 1333 by establishing an RDM infrastructure for molecular catalysis, ranging from organic synthesis to computational chemistry. Based on our experience with the development of our standardised data exchange format EnzymeML, we aim to develop and implement useful, novel, bottom-up solutions for RDM, collaborating closely with the Cluster of Excellence SimTech and NFDI consortia like NFDI4Chem and NFDI4Cat.