stonedb icon indicating copy to clipboard operation
stonedb copied to clipboard

feature: Extend the precision of decimal datatype to 32 digits

Open swoapri opened this issue 1 year ago • 0 comments

Describe the problem

  1. Issue 103 In Tianmu, a decimal point is converted to a number of int64_t for storage. Following provides the code:
    auto dec_f = dynamic_cast<Field_new_decimal *>(f);
    *reinterpret_cast<int64_t *>(vc.Prepare(sizeof(int64_t))) = 
    std::lround(dec_f->val_real() * types::PowOfTen(dec_f->dec));
    vc.ExpectedSize(sizeof(int64_t));
    
    As shown in the code above, dec_f->val_real() is called, and then the decimal point is multiplied by 10^scale to convert the point to an int64_t number for storage. Following provides the code to define dec_f->val_real():
    double Field_new_decimal::val_real(void)
    {
      ASSERT_COLUMN_MARKED_FOR_READ;
      double dbl;
      my_decimal decimal_value;
      my_decimal2double(E_DEC_FATAL_ERROR, val_decimal(&decimal_value), &dbl);
      return dbl;
    }
    
    Based on line 6 in the code above, the my_decimal type provided by MySQL needs to be converted to double. Following provides the code to implement this conversion:
    int decimal2double(const decimal_t *from, double *to)
    {
      char strbuf[FLOATING_POINT_BUFFER], *end;
      int len= sizeof(strbuf);
      int rc, error;
      rc = decimal2string(from, strbuf, &len, 0, 0, 0);
      end= strbuf + len;
      DBUG_PRINT("info", ("interm.: %s", strbuf));
      *to= my_strtod(strbuf, &end, &error);
      DBUG_PRINT("info", ("result: %f", *to));
      return (rc != E_DEC_OK) ? rc : (error ? E_DEC_OVERFLOW : E_DEC_OK);
    }
    
    my_decimal is first converted to strbuf, and then my_strtod is called to convert strbuf to double. However, when the number is processed by calling my_strtod, its precision is lost, as mentioned in issue 103. Following provides an example:
    99999999.9999999999 => 100000000
    
    For information about the algorithm used in my_strtod to convert string characters to data of the double data type, see How to Read Floating Point Numbers Accurately.

Solution

  1. Boost Multiprecision Design Boost Multiprecision Library provides integer, rational, and floating-point number types in C++ that have more range and precision than built-in types of C++. The big number types in Multiprecision can be used with a wide selection of basic mathematical operations. Boost Multiprecision provides a generic interface to GMP, MPFR, MPIR, TomMath backends, with support for integer, rational and floating-point types. In addition, user-defined backends can be created and used with the interface. The Multiprecision library consists of two parts: ● An expression-template-enabled frontend number that handles all the operator overloading, expression evaluation optimization, and code reduction. ● A selection of backends that implement the actual arithmetic operations, and need conform only to the reduced interface requirements of the frontend. Supported types of backends include: GMP, MPFR, MPIR, TomMath, and Boost-licensed. Frontend declaration:
template <class Backend, expression_template_option ExpressionTemplates = expression_template_default::value>
class number;

Backend declaration (cpp_int_backend is used as an example):

template <unsigned MinBits = 0, unsigned MaxBits = 0, boost::multiprecision::cpp_integer_type SignType = signed_magnitude, cpp_int_check_type Checked = unchecked, class Allocator = typename mpl::if_c<MinBits && (MinBits == MaxBits), void, std::allocator<limb_type> >::type >
struct cpp_int_backend;

Frontend and backend declaration (boost::multiprecision::int128_t is used as an example):

typedef number<cpp_int_backend<128, 128, signed_magnitude, unchecked, void> >    int128_t;

Another backend declaration (GMP is used as an example):

typedef number<gmp_float<50> >    mpf_float_50;
  1. Integer ● Declaration of the Boost's built-in Integer data type:
typedef number<cpp_int_backend<> >                   cpp_int;

// Fixed precision unsigned types:

typedef number<cpp_int_backend<128, 128, unsigned_magnitude, unchecked, void> >   uint128_t;
typedef number<cpp_int_backend<256, 256, unsigned_magnitude, unchecked, void> >   uint256_t;
typedef number<cpp_int_backend<512, 512, unsigned_magnitude, unchecked, void> >   uint512_t;
typedef number<cpp_int_backend<1024, 1024, unsigned_magnitude, unchecked, void> > uint1024_t;

// Fixed precision signed types:

typedef number<cpp_int_backend<128, 128, signed_magnitude, unchecked, void> >    int128_t;
typedef number<cpp_int_backend<256, 256, signed_magnitude, unchecked, void> >    int256_t;
typedef number<cpp_int_backend<512, 512, signed_magnitude, unchecked, void> >    int512_t;
typedef number<cpp_int_backend<1024, 1024, signed_magnitude, unchecked, void> >  int1024_t;
template <unsigned MinBits, unsigned MaxBits, cpp_integer_type SignType, cpp_int_check_type Checked, class Allocator>
struct cpp_int_backend
: public cpp_int_base<
min_precision<cpp_int_backend<MinBits, MaxBits, SignType, Checked, Allocator> >::value,
max_precision<cpp_int_backend<MinBits, MaxBits, SignType, Checked, Allocator> >::value,
SignType,
Checked,
Allocator,
is_trivial_cpp_int<cpp_int_backend<MinBits, MaxBits, SignType, Checked, Allocator> >::value>
#if defined(BOOST_HAS_INT128)
typedef detail::largest_unsigned_type<64>::type limb_type;
typedef detail::largest_signed_type<64>::type signed_limb_type;
typedef boost::uint128_type double_limb_type;
typedef boost::int128_type signed_double_limb_type;
static const limb_type max_block_10 = 1000000000000000000uLL;
static const limb_type digits_per_block_10 = 18;
.......
#else
typedef detail::largest_unsigned_type<32>::type limb_type;
typedef detail::largest_signed_type<32>::type signed_limb_type;
typedef detail::largest_unsigned_type<64>::type double_limb_type;
typedef detail::largest_signed_type<64>::type signed_double_limb_type;
static const limb_type max_block_10 = 1000000000;
static const limb_type digits_per_block_10 = 9;
.......
#endif

● Analysis on the storage of data of Boost's built-in Integer data type

template <unsigned MinBits, unsigned MaxBits, cpp_integer_type SignType, cpp_int_check_type Checked, class Allocator>
struct is_trivial_cpp_int<cpp_int_backend<MinBits, MaxBits, SignType, Checked, Allocator> >
{
typedef cpp_int_backend<MinBits, MaxBits, SignType, Checked, Allocator> self;
static const bool value = is_void::value && (max_precision::value <= (sizeof(double_limb_type) * CHAR_BIT) - (SignType == signed_packed ? 1 : 0));
};

If is_trivial_cpp_int::value is set to true, the maximum precision is smaller than or equal to the value set for double_limb_type:

typedef typename trivial_limb_type::type  local_limb_type;
typedef local_limb_type*                           limb_pointer;
typedef const local_limb_type*                     const_limb_pointer;
typedef mpl::int_                         checked_type;
protected:
BOOST_STATIC_CONSTANT(unsigned, limb_bits = sizeof(local_limb_type) * CHAR_BIT);
BOOST_STATIC_CONSTANT(local_limb_type, limb_mask = (MinBits < limb_bits) ? local_limb_type((local_limb_type(~local_limb_type(0))) >> (limb_bits - MinBits)) : local_limb_type(~local_limb_type(0)));
private:
local_limb_type    m_data;
bool               m_sign;

If is_trivial_cpp_int::value is set to true, the maximum precision is larger than the value set for double_limb_type:

public:
   BOOST_STATIC_CONSTANT(unsigned, limb_bits = sizeof(limb_type) * CHAR_BIT);
   BOOST_STATIC_CONSTANT(limb_type, max_limb_value = ~static_cast<limb_type>(0u));
   BOOST_STATIC_CONSTANT(limb_type, sign_bit_mask = static_cast<limb_type>(1u) << (limb_bits - 1));
   BOOST_STATIC_CONSTANT(unsigned, internal_limb_count = MinBits / limb_bits + ((MinBits % limb_bits) ? 1 : 0));
   BOOST_STATIC_CONSTANT(bool, variable = false);
   BOOST_STATIC_CONSTANT(limb_type, upper_limb_mask = (MinBits % limb_bits) ? (limb_type(1) << (MinBits % limb_bits)) -1 : (~limb_type(0)));
   BOOST_STATIC_ASSERT_MSG(internal_limb_count >= 2, "A fixed precision integer type must have at least 2 limbs");

private:
   union data_type{
      limb_type          m_data[internal_limb_count];
      limb_type          m_first_limb;
      double_limb_type   m_double_first_limb;

      BOOST_CONSTEXPR data_type() : m_first_limb(0) {}
      BOOST_CONSTEXPR data_type(limb_type i) : m_first_limb(i) {}
      BOOST_CONSTEXPR data_type(double_limb_type i) : m_double_first_limb(i) {}
#if defined(BOOST_MP_USER_DEFINED_LITERALS)
      template <limb_type...VALUES>
      BOOST_CONSTEXPR data_type(literals::detail::value_pack<VALUES...>) : m_data{ VALUES... } {}
#endif
   } m_wrapper;
   boost::uint16_t    m_limbs;
   bool               m_sign;

● Required storage for each data type (BOOST_HAS_INT128 Macro enabled)

Data type Fixed or arbitrary precision Storage (in bytes)
int128_t Fixed 32
int256_t Fixed 48
int512_t Fixed 80
int1024_t Fixed 144
cpp_int Arbitrary >= 32

intX_t data types are used for calculating required memory capacity during compilation. They consume more memory capacity while providing higher performance, compared to integer data types with arbitrary precision. The cpp_int data type is a combination of static scaling and dynamic scaling. This data type requires memory allocation. It sacrifices performance to ensure scalability. ● Required storage for cpp_int, GMP, and TomMath (BOOST_HAS_INT128 Macro enabled)

Type 2~32 digit 64 digit 128 digit 256 digit 512 digit 1024 digit
cpp_int 32+0 32+72 32+72 32+264 32+264 32+1032
gmp 26+14 16+40 16+72 16+120 16+232 16+440
tommath 24+264 24+264 24+264 24+264 24+264 24+264

According to the table above, GMP is less memory-consuming, TomMath requires more times of memory allocation, cpp_int consumes the most memory capacity. The required storage for each data type in the table is presented in the "a+b" format, where a indicates the capacity required for the data type and b indicates the capacity allocated from the stack for the data type. ● Performance comparison Based on the official description, GMP provides higher performance than cpp_int. However, whether GMP provides higher performance than int256_t needs to be tested. ● Performance comparison between GMP and int256_t About the performance test: single thread that is bound to CPU cores. Ten thousands of 1- to 65-bit integers are pupulated, and two of them are randomly choosen to complete an operation (such as add, divide, multiply, and subtract) for 1 billion times. The following table provides the time consumed (unit: ms):

Type 1 2 3 4 5
GMP 201088417 201931033 200657491 200551173 200842938
int256_t 340221396 336306805 337436802 335756615 336096974

Based on the table above, the performance of GMP is 1.68 times of that of int256_t. Following is the test code.

#include <iostream>
#include <string>
#include <quadmath.h>
#include "boost/multiprecision/gmp.hpp"
#include "boost/multiprecision/tommath.hpp"
#include "boost/multiprecision/cpp_int.hpp"
#include "boost/multiprecision/cpp_dec_float.hpp"

#include <vector>
#include <cstdio>
#include <cstddef>
#include <malloc.h>
#include <cstdlib>
#include <sys/time.h>

size_t usectime() {
	struct timeval val;
	gettimeofday(&val, nullptr);
	return val.tv_sec*1000000 + val.tv_usec;
}

std::vector<std::string> vects;
int vects_len;

static const std::string ALPHA = "0123456789";

std::string generate(int len) {
	srand(usectime());
	std::string result;
	bool first = true;
	for (int i=0; i<len; ) {
		int r = rand() % 10;
		if (first) {
			if (ALPHA[r] == '0') {
				continue;
			} else {
				result.push_back(ALPHA[r]);
			}
			first = false;
		} else {
				result.push_back(ALPHA[r]);
		}
		i++;
	}
	std::cout << "generate : " << result << std::endl;
	return result;
}

void init(int len, int max_len) {
	srand(usectime());
	vects.resize(len);
	vects_len = len;
	for (int i=0; i<len; i++) {
		int gl = rand() % max_len + 1;
		vects.push_back(generate(gl));
	}
}


int main(int argc, char* argv[]) {
init(10000, 65);
srand(usectime());	
size_t start = usectime();
for (int i=0; i<1000000000; i++) {
	int rd1 = rand();	
	int rd2 = rand();	
	boost::multiprecision::mpz_int iz1(vects[rd1%vects_len]);
	boost::multiprecision::mpz_int iz2(vects[rd2%vects_len]);
	boost::multiprecision::mpz_int iz3;
	switch((rd1+rd2)%5) {
		case 0 : 
			iz3 = iz1 + iz2;
			break;		
		case 1 : 
			iz3 = iz1 - iz2;
			break;		
		case 2 : 
			iz3 = iz1 * iz2;
			break;		
		case 3 : 
			if (iz2 != 0) {
				iz3 = iz1 / iz2;
			}
			break;		
		case 4 : 
			if (iz2 != 0) {
				iz3 = iz1 % iz2;
			}
			break;		
	}
}

size_t end = usectime();
std::cout << "usectime :"<< end - start << std::endl;

return 0;

}

Command: numactl -C 2 ./mmx 3. Conclusion For integers with 1 to 65 precision, the required storage for GMP is 40 to 56 bytes. If the precision is evenly distributed, the average storage is 48 bytes. int256_t adopts fixed precision, with the required storage of 48 bytes. Its performance is 68% higher than GMP. What's more, using GMP will introduce another third-party library, which will increase the integration complexity, and GMP is not suitable for cross-platform scenarios. Therefore, GMP is not the optimal option. cpp_int provides good scalability but low performance, while int256_t is on the contrary. Currently, scalability is not a factor for us to consider. In conclusion, to address the current problem, int128_t and int256_t can be used to replace cpp_int.

Implementation steps

  1. To convert decimal to binary for storage in Tianmu, modifications must be made to the following parts and features: ● Data type for storage and storage encoding format. ● decimal's support for SQL logical operators, including Greater Than, Smaller Than, and Equal To. ● decimal's support for aggregate functions, including SUM, MAX, and STD. ● decimal's support for GROUP BY, JOIN, SORT, IN, and NOT IN. ● Range check for decimal. ● decimal's support for indexing and filtering. Specifically, the following must be achieved: ● Conversion of data type for storage from PackInt to PackStr, conversion from decimal points to binary strings, and support for all kinds of conditions for decimal. ● core::ValueOrNull's support for decimal[optional]. ● ConstColumn's support for decimal. ● Item_tianmufield's support for decimal. ● core::DataType's support for decimal: Binary strings can be converted to any other data type. ● RCAttr's support for decimal: Filtering and comparison of data packs whose data type is decimal are supported. ● MysqlExpression's support for decimal[option]: The underlying support can remain unchanged. ● MultiValColumn's support for decimal: especially for IN and NOT IN. ● decimal's support for AVG, SUM, MAX, and MIN. ● decimal's support for JOIN: HASH, SORT, and MAP. ● SorterWrapper's support for decimal: ORDER is supported.

CheckLists

  • [x] change the type of Pack to PackDec for adapting Decimal.
  • [x] max_s, min_s, sum_s extend in DPN,and some serialize function in it, add the field of Pack's type in it.
  • [x] add the RCDecimal class,and some transfer method between RCNum.
  • [x] sql logical operator ">,<,=,>=,<=" support in decimal.
  • [x] Item_tianmudecimal supports the operation of Set BString.
  • [x] ValueOrNull supports decimal datatype.
  • [x] MysqlExpression convert Decimal to ValueOrNull.
  • [x] change MagicNumber Precision 18 to MAX_DEC_PRECISION.
  • [x] RCAttr supports PackDec type, eg: LoadData,UpdateData.
  • [x] RCAttr support RCDecimal class.
  • [x] Field2VC support decimal,convert to the string of decimal,then store to the disk.
  • [x] check the data in the Decimal Pack, filter the related data.
  • [x] ValueCache class update the decimal Pack's min, max, sum value for future computing.
  • [x] ValueParserForText support parsing the RCDecimal class from string.
  • [x] ConstColumn supports that getting value from RCDecimal.
  • [ ] ParseRealDecimal method needs transferring the string (eg. 1E3) to RCDecimal.
  • [ ] according the condition, filter the suspect Pack data. eg: IN or NOT IN.
  • [ ] Pack of decimal datatype supports the KG, eg: Histogram, charmap.
  • [ ] AggregationAlgorithm support decimal.
  • [ ] change exponent number to RCDecimal.
  • [ ] ConditionEncoder::TransformINs support decimals.
  • [ ] group by support decimals.
  • [ ] order by support decimals.
  • [ ] sql update operate support decimals.

swoapri avatar Aug 31 '22 03:08 swoapri