I was recently faced with the following challenge at work:
A database process could be improved by processing diffs between two sets of data if there was a checksum that could be calculated for the record, which would subsequently be used to check if a record had changed.
The database was on mysql, and there was already a process in the workflow that calculated a unique key based on other fields in each record.
basic calculation of the unique key was as follows:
key = SHA1(concat(field1, field2));
so I wanted something similar, only for all fields that I cared about changing:
checksum = SHA1(concat(field1, field2, ..., field{n}));
however, the problem with concat in mysql is that it returns null if one of the fields in it is null.
Another method that is similar is concat_ws and looks like this
checksum = SHA1(concat_ws(',', field1, field2, ..., fieldN));
that first character in the quotes before the fields is the delimiter for concat_ws, and would normally produce strings like:
"field1value,field2value,...,fieldNvalue"
but this doesn't work if the delimiter is changed in the future (all records would be updated because all checksums would be different).  What I needed was a string that didn't have a delimiter in it.  Here's what I used:
checksum = SHA1(concat_ws('', field1, ..., fieldN));
by leaving the delimiter blank, I was able to get a string that looked like:
"field1valuefield2value...fieldNvalue"
so why use concat_ws in all this?  Because it ignores (drops) nulls by default.  Because of this, I didn't have to write any additional code to handle that case.
End result is the checksum was calculated in the same amount of time that the unique key was generated, and the diff processing it allowed provided a nearly 80% reduction in the amount of time to process the data. 
 
 
No comments:
Post a Comment