Can ClickHouse be used as financial data warehouse?
In this post we are going to look into pitfalls of using the ClickHouse as data warehouse for financial data. Behaviour was tested on version 18.104.22.1689
Do not use Float/Double for monetary data
We should not use Float and Double data types for any monetary calculations and data. It is linked to, how the floats and doubles are stored in memory and they can not store currency (Euro and Cents) exactly. Instead it stores it as close approximation of value. For example if we want to store 0.1 Euro (10 cents) it is stored as 0.100000001490116119384765625 in float. The magnitude of this problem is known as loss of significance when we repetitively perform arithmetic operation using these data types.
Thank you Decimal/BigDecimal
This loss of significance is reason, why all the times we should use Decimal/BigDecimal data types for monetary data. Decimal/BigDecimal represents signed decimal number of arbitrary precision with an associated scale. Precision defines the total number of digits or real number, while scale specifies the number of digits after decimal point.
ClickHouse supports the following decimal datatypes and ranges:
- Decimal32(S) - ( -1 * 10^(9 - S), 1 * 10^(9 - S) )
- Decimal64(S) - ( -1 * 10^(18 - S), 1 * 10^(18 - S) )
- Decimal128(S) - ( -1 * 10^(38 - S), 1 * 10^(38 - S) )
- Decimal256(S) - ( -1 * 10^(76 - S), 1 * 10^(76 - S) )
As you can see, the ranges are huge and there should not be any problem storing any financial data. However there is significant quality issues with ClickHouse implemetations. In theory the only stable are Decimal32/Decimal64 data types. Decimal128/Decimal256 types do have several issues.
Rounding functions are buggy with Decimal128/Decimal256
select floor(toDecimal128(1232.123123,24)) as floor_decimal128, floor(toDecimal64(1232.123123,12)) as floor_decimal64
Row 1: ────── floor_decimal128: 1232.12311679615299966394772 floor_decimal64: 1232
As you can see, floor/ceil functions are not working at all with Decimal128/Decimal256
There is no decimal overflow/underflow check for Decimal128/Decimal256
select toDecimal256('10000000000000000000000000000000000000000000000000000000000000000000000000',2) * 100 as decimal_overflow;
Row 1: ────── decimal_overflow: -157920892373161954235709850086879078532699846656405640394575840079131296399.36
As you can see ClickHouse does not check decimal math overflow when working with Decimal128/Decimal256 and it overflows the same way as in C language. This bug/behaviour is pretty nasty one, as it can easily lead to wrong results without any warning. This will raise the question about trust into your data warehouse and once the trust is lost, the whole data and business intelligence projects are doomed.
Here is another example of decimal overflow, showing how scale can impact the correctness of results
select ['7758.616920151350000000000000', '9684.662273359581200000000000'] as vs, toDecimal128(vs,12) / toDecimal128(vs, 12) frac_dec128_12, toDecimal128(vs,24) / toDecimal128(vs, 24) frac_dec128_24, toFloat64(vs) / toFloat64(vs) frac_float64
Row 1: ────── vs: ['7758.616920151350000000000000','9684.662273359581200000000000'] frac_dec128_12: 1.248245966134 frac_dec128_24: -0.00000000000000876895937 frac_float64: 1.2482459661342138
And again, you can see how you can receive misleading results without any warning.
There is no integer overflow/underflow check
ClickHouse also do not check Integer overflow/underflow which again can lead to returning of misleading results without any warning
select toInt8(127+4) as integer_overflow
Row 1: ────── integer_overflow: -125
I have been on call with ClickHouse team several months ago and I have stressed this to them. This behaviour is unacceptable for OLAP system to be used as source for financial reporting. It looks, that ClickHouse team does not see this as issue or do not give any prirotity to it. Decimal functionality is simply out of their focus. Sadly this may affect any scenario. It is very hard to educate the BI Analysts to count with this behavour during query design, especially when they are coming from other database systems. Where data type overflow/underflow leads to out of range error and does not return misleading results. As they say: “It just simply works”. It is also giving to them massive development overhead during dashboard query design and in many cases it is even impossible to maintain. Sadly this undermines ClickHouse implementations, because otherwise the ClickHouse is great tool. But as the result, I simply can’t recommend ClickHouse for finacial reporting or to switch to using Float/Double datatypes and accept data inaccuracies cause by loss of significance.