Changed data collecting/capturing is a topic frequently discussed as part of the design of auditing and ETL systems. In auditing system, changed data collection processes are usually just build on the system to be audited. In very few cases, the system needs to audit over the collected historical data, But this is not the case in ETL system. Especially in complex ETL systems, in order to get the desired data shape at target, data from different sources need to be processed into a series of format and saved in the staging area and one or more level posted processes might also be needed.
The simplest way to achieve that is to wipe out everything from the taget and repopulate them from the sources. This is a full process. Such design needs less programming effort, however, users have to suffer from insufficient data at the target while the processes are running. It might not be an bad idea if the data is relatively small and the duration of the reprocessing is rather short. When the data is big, this approach will become problematic, excessive waiting time and resource consumption such as CPU, memory, and disk. The most critical measure is the processing time. Longer waiting time makes process useless, for instance, business requires data refresh once a day, but the process needs 25 hours to run. What causes such long processing time? The answer is the amount of data!
Let’s say, our process needs 20 hours to complete. Business users have 4 hours, half day, to analyse their data. It doesn’t sound too bad, as long as you can train your users to accept such working habit. One step back, can we ask ourself a question how those business users can view or consume those data within 4 hours in which it needs 20 hours processing time on a dedicated server? The fact is that the end users only look at smaller amount of data from a the process – the new data since last full process plus smaller amount of old data. If this is the case, why we bother including everthing in each of full process? Incrementally delivering the new data since last full process is more than enough.
What is the new data since last process? Changed data collection mechanism comes into play. There are many ways helping you to determine the data changes after a point. Depending on the characteristics of the way the new data being entered into the source system, different approaches can give you different result in terms of the performance and complexity of the implementation. Some approaches are just not applicable to the way the data entered even they are nicely advertised by the companies who produce it.
Here is a list of methods detecting data changes which fall into two main categories
As its name, record itself can explain what has been changed. This type of change detecting algoritms has one or more flags on each record that explains the status of the record. The ETL program based on the status filters out the rows had been processed. The flag can be system managed values such as identity values and row versions. It can also be user defined types such as string, integer, datetime, etc. The value in the flag reflecting the changes can be modified by the system such as identity and row version in which it can also be modified by triggers and applications.
Again, as its name, these type of change detecting mechanisms saves list of changes from the source system. The history is read by ETL system and send to the target. Typical approach of doing that is using triggers. SQL Server introduces CT, Change Tracking, and CDC (which is acronym of Change Data Capturing) to supporting that. Many people might neglect that replication system can also be a change detecting (and delivery) system.
Different approaches will give you different result. None of them works the best for all scenarios. I will talk more about it in detail in the series.
One thought on “Collect Changed Data (1) – Overview”
Great topic for the series! Looking forward to the other articles.