15.12.20
What's the difference between a data lake and a data warehouse? An analyst or a scientist?
Recently the Data Driven NYC hosted a conversation between Tristan Handy, founder & CEO of Fishtown Analytics (which makes DBT, a command line tool for doing data transformations) and Jeremiah Lowin, founder & CEO of Prefect (which makes an orchestration tool that is a bit like, but supposedly superior to, Airflow). It was an interesting conversation.
Handy on Data Lakes:
"A data lake is a bunch of files in some particular file format. It can be any one of a large number of formats, including CSV, that is just shoved into an object store. And then you can take a compute engine and run data processing on top of that object store. A data warehouse, ultimately, is the same... the data lake can do anything, but it also probably takes more work to do anything. Whereas the data warehouse has a more constrained set of use cases, but it is much easier to get up and running for those constrained set of use cases."
On data analysts:
"a data analyst is somebody who answers business questions with data, and they frequently will have a business or econ degree. And they often have learned all the technologies that are required to do a good job of answering those questions, whether that’s sometimes Excel, sometimes SQL, sometimes Python or R. But they don’t self-identify as technologists."