Data ingestion is the importing of data from one location to a new destination for further storage and analysis, such as in a data warehouse or a data lake. Oftentimes moving data in this manner will include multiple different data formats and come from different sources. For this reason, an important aspect of the data ingestion process is to extract, transform, and load data in order to create a uniform format. Newer data ingestion tools can help speed up the process and allow for real time data ingestion.
Data ingestion processes can occur in real time, or they can be ingested as part of a batch. If ingestion occurs in real time, then each data point is streamed immediately after creation. An automatic streaming data process is common when collecting big data, as it ensures that data is transmitted in small bits rather than large chunks, and that it will be available for processing as soon as it is needed. If the ingestion is part of batch processing, then rather than streaming immediately, the process waits until an assigned amount of time has elapsed before transmitting the data for storage. This allows for predictable trends around size of batches, as well as the times when the data will be available for access or for analysis.
There are also different data ingestion tools that can help process the data effectively, as well as perform analyses as part of the process, such as a data ingestion pipeline. A pipeline is a series of data processing elements where the output of one element is the input of the next. These different elements can be set up for a delay, or for real-time processing of data, and automatically push it along each part of the pipeline after ingestion. A data pipeline can help create logical data models as part of database management.
Well architected data ingestion and analyses can benefit organizations through: