17.11.20
Platforms are being developed to aid data discovery.
Data discovery is an important part of data science. Often companies hold data across multiple sources and systems. Finding different sources and linking them together takes up a large portion of data scientists' time. So it is understandable why people would want to make the process easier.
Lots of big tech firms have been building platforms to help make this process more efficient. Some, including Amundsen from Lyft, Apaches Atlas, and DataHub from Linked-in, have open sourced these platforms.
It seems they combine natural language search (based on elastic search mostly) with easily accessible information on how data links together, how and when it has been used, and who owns the data.
I'm not sure how easy it would be to integrate these systems in a new company - datasets are often spread across many systems including multiple clouds and various SaaS systems. These platforms do, however, look like they could be a useful starting point.