About

Making data discoverable

The large majority of biomedical data is locked away in published scientific illustrations that cannot be indexed by search tools, making them difficult to find or re-analyze and resulting in a substantial waste of knowledge and resources, with negative impact on the science communities' research output and its reproducibility.

SourceData, an initiative by EMBO, is an openly accessible and easily applicable data discovery tool allowing biomedical scientists to share figures and the underlying source data in a way that is machine-readable. SourceData provides a novel platform for researchers who wish to make their publications discoverable based on their data content, to find specific data, to test or generate new hypotheses, and to share and connect data.

How it works

SourceData is based on an intuitive representation for metadata describing scientific figures, along with the tools to create, search and analyse this information. It consists of a machine-readable description of underlying data in figures and figure legends submitted by authors as part of the normal publication process. By referring to established public databases of biological terms, the specific biological entities, their roles as target, intervention or outcome measure in each paper can be consistently identified. Once a figure is represented in the SourceData standard, it can easily be found by scientists searching for data at the level of individual biological entities all the way to full experimental designs.

By consistently identifying and tagging both the biological entities and their relationships, related data can easily be linked to one another, ensuring that relevant publications can be found and cited, never missed from search results due to the choice of keywords or descriptive text. The standardized data format allows for the comparison of data from many published papers, making it easier to directly examine the reproducibility of results across different studies. SourceData also offers the potential to collate and integrate data for large-scale data-mining and hypothesis generation thanks to the rich structured data format. With these advantages, SourceData has the potential to significantly contribute to accelerating research and increasing productivity.