menu

Synthetic health data can ensure better disease prevention and treatment

17 Dec 2019

Denmark and the other Nordic countries have some of the best and most complete health data in the world. These data have considerable potential to enable the healthcare sector to detect diseases early, improve diagnosis and create individually tailored treatment. However, this potential cannot be realized easily because of the great difficulty in sharing the compiled health data and thus using them for such purposes as research across areas and national borders.

There is good reason to restrict the sharing of health data, as they are basically personal and thus sensitive. However, the inability to share data poses a problem for the healthcare sector in finding new treatment options by analysing the large quantities of health data, for example, in collaboration with the other Nordic countries. A new research project based at the University of Copenhagen will explore a possible solution to this problem by developing and refining a method that can use original data to generate synthetic data sets. The Novo Nordisk Foundation is supporting the project with a grant of DKK 7.5 million.

“Danish health data can help the individual patient but also contains important information that can benefit other patients. The healthcare sector urgently needs to develop new solutions on the basis of the data collected, but this often requires sharing data more flexibly. Synthetic data can help meet this need because they are based on original data cleared of any details that could be traced back to the original data and thereby the people who provided them,” says Henning Langberg, Professor, Department of Public Health, University of Copenhagen, the recipient of the grant from the Foundation.

Open-source access will ensure quality
The project, called Synthetic Health and Research Data (SHARED), is a proof-of-concept project intended to show that a method can be found that can actually transform original data into synthetic data in a way that makes it impossible to trace the data back to the sources. The synthetic data are created by running an original data set through a mathematical program that adds noise on the data set to ensure that the synthetic data cannot be attributed to specific individuals while maintaining a dispersion and context that makes them statistically valid. This enables data to be shared – without compromising data security.

“An elaborate and secure model capable of generating synthetic data can help to harness the great potential inherent in deriving new contexts from our common health data in a safe and secure way. The results of the project can influence both disease prevention and treatment, not only in Denmark’s healthcare sector but throughout the Nordic countries,” says Niels-Henrik von Holstein-Rathlou, Head of Biomed, Novo Nordisk Foundation.

Together with Finnish partners Turku University Hospital and Institute for Molecular Medicine Finland (FIMM), Henning Langberg will work to develop a mathematical method that can transform the original data into synthetic data and to test the methods and models developed in a test battery that enables them to test how well the synthetic data are like the original data.

“Our major challenge is to include as many parameters as possible in the synthetic dataset without losing the contexts between data. In addition, it is important for us to have an open-source approach to developing the method so that the academic community can ask relevant questions about the method during the project. This is essential when working in such a sensitive and regulated area as health data,” explains Henning Langberg.

Further information

Christian Mostrup Scheel, Senior Press Officer, phone: +45 3067 4805, cims@novo.dk

Henning Langberg, Professor, University of Copenhagen, phone: +45 2612 7913, langberg@sund.ku.dk

Synthetic data may be one way the healthcare sector can fulfil the great potential of data sharing. The Novo Nordisk Foundation is supporting a new research project that is working on a method of using original data to generate synthetic data without compromising data security.