What is a data lake?
In information systems, a data lake is a system or repository of data stored in raw data format, usually blobs or files. Typically, a data lake is a single repository for all enterprise data, including raw copies of source system data and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can contain structured data from relational databases (rows and columns), from CSV, XML or JSON formats or unstructured data such as emails, documents, PDF files and binary data (image, sound, memory images).
Data lakes are used in industries such as retail, banking, hospitality and even travel. To track and predict customer preferences and improve the overall customer experience.
Generic analysis methods are also stored alongside the data. These are therefore also available for the centrally stored data and do not have to be compiled in advance of each analysis process. Compared to data warehouses, data lakes therefore usually require much more storage capacity. Unprocessed raw data is also malleable, can be quickly analyzed for a wide variety of purposes and is ideal for machine learning.
A data swamp is an unmanaged data lake that is either inaccessible to its intended users or offers little value. If appropriate data quality and data governance measures are not implemented, data swamps are created.