What is a data lake?
In business computing, data lake refers to a system or repository of data stored in raw data format, usually blobs or files. Typically, a data lake is a single repository for all enterprise data, including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics, and machine learning. A data lake can contain structured data from relational databases (rows and columns), from CSV, XML or JSON formats, or unstructured data such as emails, documents, PDF files and binary data (image, sound, memory images).
Data Lakes are used in industries such as retail, banking or hospitality and even travel. To track and predict customer preferences and improve the overall customer experience.
Generic analytics are also stored alongside the data. These are thus also available for the centrally stored data and do not need to be compiled in advance of each analysis process. Compared to data warehouses, data lakes therefore usually require much more storage capacity. Unprocessed raw data is also malleable, can be quickly analyzed for a wide variety of purposes, and is ideal for machine learning.
A data swamp is an unmanaged data lake that is either inaccessible to intended users or offers little value. If adequate data quality and data governance measures are not implemented, then data swamps are created.