Making data discoverable

SourceData, an initiative by EMBO, is a platform that facilitates the public accessibility and citation of the source data behind scientific publications. SourceData relies on an intuitive representation of metadata describing scientific figures. By consistently identifying and tagging both the biological entities and their relationships, related data can be easily linked to one another.

Through making data discoverable, we enhance research output and reproducibility in the scientific community. Currently, valuable biomedical data exists within published scientific illustrations that would benefit from better indexing and accessibility for re-analysis.

Key Features

Integration with EMBO Press

SourceData is integrated in the editorial workflow at EMBO Press. Every paper is curated and the associated research data are posted on the EMBL-EBI BioStudies database under a unique citable accession numbers. You can explore the SourceData collection on BioStudies here.

Machine Learning Dataset

SourceData curation has enabled the distribution of a large machine learning dataset of 68,543 annotated experiments. You can access this public dataset on Hugging Face.

Data4Rev: AI-Based Workflow

The team is now developing an AI-based workflow (Data4Rev) to assist authors in performing basic quality checks on figures and their associated data, automating the deposition of source data files to BioStudies, and presenting the underlying data to peer reviewers.

Resources

Publication

SourceData: a semantic platform for curating and searching figures Robin Liechti, Nancy George, Lou Götz, Sara El-Gebali, Anastasia Chasapi, Isaac Crespo, Ioannis Xenarios & Thomas Lemberger
Nature Methods (2017) 14:1021-1022, DOI: https://doi.org/10.1038/nmeth.4471

Integrating curation into scientific publishing to train AI models Jorge Abreu-Vicente, Hannah Sonntag, Thomas Eidens, Cassie S. Mitchell, Thomas Lemberger
arXiv:2310.20440 [cs.CL] DOI: https://doi.org/10.48550/arXiv.2310.20440

Links