Data is the most valuable resource we have today when it comes to solving our biggest challenges. With the right data, and enough of it, there’s no limit to the compelling use cases we can create. Imagine a world where we could stop financial crimes like money laundering, help reduce the number of deaths due to breast cancer, and more accurately track and balance the global import and export of goods between countries. Solving problems like these has the potential to save countless lives and revolutionize industries. Creating solutions to such problems, however, requires massive amounts of data that are spread across the globe.
This data is available but not centrally accessible. It is hidden in the databases of individual hospitals, small bank branches, manufacturing facilities, and in the trenches of other isolated public and private sector databases.
This leads to the concept of centralization. Can we hypothetically centralize the world’s data, or centralize only that which is relevant to a particular use case? The natural follow-up to this question is, of course, should I? There are some nuances to this debate, but the short answer is no. While incredibly useful if used only for good, a massive database is a huge risk if it becomes available to bad actors.
So if a massive database is too risky, then how can we put the world’s data into the hands of data scientists and machine learning engineers to accelerate the development of breakthrough solutions? The answer may lie in an open source library called PySyft.
Our ability to develop models and answer hard questions is limited because the data is spread around the world, isolated and completely inaccessible by legal contracts and strict partnership agreements. PySyft is pushing for privacy-enhancing technologies (PETs) that allow data scientists to compute on information they don’t own, without ever getting a copy of the data, on machines they don’t have full control over. It eliminates the need to move potentially sensitive data to a remote server, allowing data owners to keep their data on their machines while enabling data professionals to extract value and implement innovative solutions. PySyft is developing the future of data sharing through federated data networks powered by PET, enabling data scientists to use more data than ever before.
To conceptualize how PySyft can deliver truly revolutionary results, let’s return to our breast cancer use case. Currently, the most effective machine learning models for breast cancer detection use less than 0.1% of the world’s data. Worldwide, there are more than 750 million mammogram images taken in a decade. If a data scientist wants access to even a fraction of these images, they will need to sign partnership agreements, go through governance reviews, implement secure data stores, manage access, and more. From time to monetary costs, it’s not scalable and doesn’t give us enough data to work with.
However, with federated data networks, hospitals around the world could share their data in a safe and secure way and allow scientists and data developers to securely compute and develop models that greatly improve our understanding of disease, its progression and diagnosis, saving lives. Those using the data would not have physical access to the medical data sets, would not be able to store the data on their machines, and instead of going through the process of securing five to ten partnership agreements, they could access a network by hundreds or thousands of hospitals.
In my opinion, PySyft is creating the future of data sharing through the use of federated data networks. The world has enough data to solve many important and unsolved problems. However, strict access restrictions on data centralization have hindered progress. We have the computing power, we have the data – PySyft can give us the access we need.