firestore-leveldb-tools icon indicating copy to clipboard operation
firestore-leveldb-tools copied to clipboard

FR: Python 3 Support

Open scottfr opened this issue 4 years ago • 5 comments

It would be great if this could be used with Python 3.

scottfr avatar Aug 16 '20 08:08 scottfr

Agreed. Unfortunately, I don't have the expertise to (efficiently) convert the Google scripts that this repo is based on (which are written in Python 2.7).

If someone successfully converts it though (not just the wrapper script, but also the Google scripts, so that program can actually run in latest Python), I'll create a new folder for that version.

Venryx avatar Aug 16 '20 08:08 Venryx

Ok, spent some time on this today. I ended up focusing on a Node.js version as that is more useful for us, but a Python 3 version should be possible.

Here is what I learned, if it is helpful to you or anyone else.

There are two parts to the task:

  1. Building the LevelDB log file reader
  2. Building the Protobuf decoder

Part 1

Here is very crude Python 3 code for the LevelDB RecordsReader. It is the existing Python 2 code you have in your repository, with all the needed dependencies pulled into a single file and with some slight changes for Python3 compatibility

import os
import io
import json
import sys
import logging
import struct


import array

CRC_TABLE = (
    0x00000000, 0xf26b8303, 0xe13b70f7, 0x1350f3f4,
    0xc79a971f, 0x35f1141c, 0x26a1e7e8, 0xd4ca64eb,
    0x8ad958cf, 0x78b2dbcc, 0x6be22838, 0x9989ab3b,
    0x4d43cfd0, 0xbf284cd3, 0xac78bf27, 0x5e133c24,
    0x105ec76f, 0xe235446c, 0xf165b798, 0x030e349b,
    0xd7c45070, 0x25afd373, 0x36ff2087, 0xc494a384,
    0x9a879fa0, 0x68ec1ca3, 0x7bbcef57, 0x89d76c54,
    0x5d1d08bf, 0xaf768bbc, 0xbc267848, 0x4e4dfb4b,
    0x20bd8ede, 0xd2d60ddd, 0xc186fe29, 0x33ed7d2a,
    0xe72719c1, 0x154c9ac2, 0x061c6936, 0xf477ea35,
    0xaa64d611, 0x580f5512, 0x4b5fa6e6, 0xb93425e5,
    0x6dfe410e, 0x9f95c20d, 0x8cc531f9, 0x7eaeb2fa,
    0x30e349b1, 0xc288cab2, 0xd1d83946, 0x23b3ba45,
    0xf779deae, 0x05125dad, 0x1642ae59, 0xe4292d5a,
    0xba3a117e, 0x4851927d, 0x5b016189, 0xa96ae28a,
    0x7da08661, 0x8fcb0562, 0x9c9bf696, 0x6ef07595,
    0x417b1dbc, 0xb3109ebf, 0xa0406d4b, 0x522bee48,
    0x86e18aa3, 0x748a09a0, 0x67dafa54, 0x95b17957,
    0xcba24573, 0x39c9c670, 0x2a993584, 0xd8f2b687,
    0x0c38d26c, 0xfe53516f, 0xed03a29b, 0x1f682198,
    0x5125dad3, 0xa34e59d0, 0xb01eaa24, 0x42752927,
    0x96bf4dcc, 0x64d4cecf, 0x77843d3b, 0x85efbe38,
    0xdbfc821c, 0x2997011f, 0x3ac7f2eb, 0xc8ac71e8,
    0x1c661503, 0xee0d9600, 0xfd5d65f4, 0x0f36e6f7,
    0x61c69362, 0x93ad1061, 0x80fde395, 0x72966096,
    0xa65c047d, 0x5437877e, 0x4767748a, 0xb50cf789,
    0xeb1fcbad, 0x197448ae, 0x0a24bb5a, 0xf84f3859,
    0x2c855cb2, 0xdeeedfb1, 0xcdbe2c45, 0x3fd5af46,
    0x7198540d, 0x83f3d70e, 0x90a324fa, 0x62c8a7f9,
    0xb602c312, 0x44694011, 0x5739b3e5, 0xa55230e6,
    0xfb410cc2, 0x092a8fc1, 0x1a7a7c35, 0xe811ff36,
    0x3cdb9bdd, 0xceb018de, 0xdde0eb2a, 0x2f8b6829,
    0x82f63b78, 0x709db87b, 0x63cd4b8f, 0x91a6c88c,
    0x456cac67, 0xb7072f64, 0xa457dc90, 0x563c5f93,
    0x082f63b7, 0xfa44e0b4, 0xe9141340, 0x1b7f9043,
    0xcfb5f4a8, 0x3dde77ab, 0x2e8e845f, 0xdce5075c,
    0x92a8fc17, 0x60c37f14, 0x73938ce0, 0x81f80fe3,
    0x55326b08, 0xa759e80b, 0xb4091bff, 0x466298fc,
    0x1871a4d8, 0xea1a27db, 0xf94ad42f, 0x0b21572c,
    0xdfeb33c7, 0x2d80b0c4, 0x3ed04330, 0xccbbc033,
    0xa24bb5a6, 0x502036a5, 0x4370c551, 0xb11b4652,
    0x65d122b9, 0x97baa1ba, 0x84ea524e, 0x7681d14d,
    0x2892ed69, 0xdaf96e6a, 0xc9a99d9e, 0x3bc21e9d,
    0xef087a76, 0x1d63f975, 0x0e330a81, 0xfc588982,
    0xb21572c9, 0x407ef1ca, 0x532e023e, 0xa145813d,
    0x758fe5d6, 0x87e466d5, 0x94b49521, 0x66df1622,
    0x38cc2a06, 0xcaa7a905, 0xd9f75af1, 0x2b9cd9f2,
    0xff56bd19, 0x0d3d3e1a, 0x1e6dcdee, 0xec064eed,
    0xc38d26c4, 0x31e6a5c7, 0x22b65633, 0xd0ddd530,
    0x0417b1db, 0xf67c32d8, 0xe52cc12c, 0x1747422f,
    0x49547e0b, 0xbb3ffd08, 0xa86f0efc, 0x5a048dff,
    0x8ecee914, 0x7ca56a17, 0x6ff599e3, 0x9d9e1ae0,
    0xd3d3e1ab, 0x21b862a8, 0x32e8915c, 0xc083125f,
    0x144976b4, 0xe622f5b7, 0xf5720643, 0x07198540,
    0x590ab964, 0xab613a67, 0xb831c993, 0x4a5a4a90,
    0x9e902e7b, 0x6cfbad78, 0x7fab5e8c, 0x8dc0dd8f,
    0xe330a81a, 0x115b2b19, 0x020bd8ed, 0xf0605bee,
    0x24aa3f05, 0xd6c1bc06, 0xc5914ff2, 0x37faccf1,
    0x69e9f0d5, 0x9b8273d6, 0x88d28022, 0x7ab90321,
    0xae7367ca, 0x5c18e4c9, 0x4f48173d, 0xbd23943e,
    0xf36e6f75, 0x0105ec76, 0x12551f82, 0xe03e9c81,
    0x34f4f86a, 0xc69f7b69, 0xd5cf889d, 0x27a40b9e,
    0x79b737ba, 0x8bdcb4b9, 0x988c474d, 0x6ae7c44e,
    0xbe2da0a5, 0x4c4623a6, 0x5f16d052, 0xad7d5351,
    )



CRC_INIT = 0

_MASK = 0xFFFFFFFF


def crc_update(crc, data):
  """Update CRC-32C checksum with data.
  Args:
    crc: 32-bit checksum to update as long.
    data: byte array, string or iterable over bytes.
  Returns:
    32-bit updated CRC-32C as long.
  """

  if type(data) != array.array or data.itemsize != 1:
    buf = array.array("B", data)
  else:
    buf = data

  crc = crc ^ _MASK
  for b in buf:
    table_index = (crc ^ b) & 0xff
    crc = (CRC_TABLE[table_index] ^ (crc >> 8)) & _MASK
  return crc ^ _MASK


def crc_finalize(crc):
  """Finalize CRC-32C checksum.
  This function should be called as last step of crc calculation.
  Args:
    crc: 32-bit checksum as long.
  Returns:
    finalized 32-bit checksum as long
  """
  return crc & _MASK


def crc(data):
  """Compute CRC-32C checksum of the data.
  Args:
    data: byte array, string or iterable over bytes.
  Returns:
    32-bit CRC-32C checksum of data as long.
  """
  return crc_finalize(crc_update(CRC_INIT, data))


BLOCK_SIZE = 32 * 1024


HEADER_FORMAT = '<IHB'


HEADER_LENGTH = struct.calcsize(HEADER_FORMAT)


RECORD_TYPE_NONE = 0


RECORD_TYPE_FULL = 1


RECORD_TYPE_FIRST = 2


RECORD_TYPE_MIDDLE = 3


RECORD_TYPE_LAST = 4


class Error(Exception):
  """Base class for exceptions in this module."""


class InvalidRecordError(Error):
  """Raised when invalid record encountered."""


class FileReader(object):
  """Interface specification for writers to be used with recordrecords module.
  FileReader defines a reader with position and efficient seek/position
  determining. All reads occur at current position.
  """

  def read(self, size):
    """Read data from file.
    Reads data from current position and advances position past the read data
    block.
    Args:
      size: number of bytes to read.
    Returns:
      iterable over bytes. If number of bytes read is less then 'size' argument,
      it is assumed that end of file was reached.
    """
    raise NotImplementedError()

  def tell(self):
    """Get current file position.
    Returns:
      current position as a byte offset in the file as integer.
    """
    raise NotImplementedError()


_CRC_MASK_DELTA = 0xa282ead8

def _mask_crc(crc):
  """Mask crc.
  Args:
    crc: integer crc.
  Returns:
    masked integer crc.
  """
  return (((crc >> 15) | (crc << 17)) + _CRC_MASK_DELTA) & 0xFFFFFFFF


def _unmask_crc(masked_crc):
  """Unmask crc.
  Args:
    masked_crc: masked integer crc.
  Retruns:
    orignal crc.
  """
  rot = (masked_crc - _CRC_MASK_DELTA) & 0xFFFFFFFF
  return ((rot >> 17) | (rot << 15)) & 0xFFFFFFFF



class RecordsReader(object):
  """A reader for records format."""

  def __init__(self, reader):
    self.__reader = reader

  def __try_read_record(self):
    """Try reading a record.
    Returns:
      (data, record_type) tuple.
    Raises:
      EOFError: when end of file was reached.
      InvalidRecordError: when valid record could not be read.
    """
    block_remaining = BLOCK_SIZE - self.__reader.tell() % BLOCK_SIZE
    if block_remaining < HEADER_LENGTH:
      return ('', RECORD_TYPE_NONE)

    header = self.__reader.read(HEADER_LENGTH)
    if len(header) != HEADER_LENGTH:
      raise EOFError('Read %s bytes instead of %s' %
                     (len(header), HEADER_LENGTH))

    (masked_crc, length, record_type) = struct.unpack(HEADER_FORMAT, header)
    crc = _unmask_crc(masked_crc)

    if length + HEADER_LENGTH > block_remaining:

      raise InvalidRecordError('Length is too big')

    data = self.__reader.read(length)
    if len(data) != length:
      raise EOFError('Not enough data read. Expected: %s but got %s' %
                     (length, len(data)))

    if record_type == RECORD_TYPE_NONE:
      return ('', record_type)

    actual_crc = crc_update(CRC_INIT, [record_type])
    actual_crc = crc_update(actual_crc, data)
    actual_crc = crc_finalize(actual_crc)

    
    if actual_crc != crc:
      raise InvalidRecordError('Data crc does not match')
    return (data, record_type)

  def __sync(self):
    """Skip reader to the block boundary."""
    pad_length = BLOCK_SIZE - self.__reader.tell() % BLOCK_SIZE
    if pad_length and pad_length != BLOCK_SIZE:
      data = self.__reader.read(pad_length)
      if len(data) != pad_length:
        raise EOFError('Read %d bytes instead of %d' %
                       (len(data), pad_length))

  def read(self):
    """Reads record from current position in reader."""
    data = None
    while True:
      last_offset = self.tell()
      try:
        (chunk, record_type) = self.__try_read_record()
        if record_type == RECORD_TYPE_NONE:
          self.__sync()
        elif record_type == RECORD_TYPE_FULL:
          if data is not None:
            logging.warning(
                "Ordering corruption: Got FULL record while already "
                "in a chunk at offset %d", last_offset)
          return chunk
        elif record_type == RECORD_TYPE_FIRST:
          if data is not None:
            logging.warning(
                "Ordering corruption: Got FIRST record while already "
                "in a chunk at offset %d", last_offset)
          data = chunk
        elif record_type == RECORD_TYPE_MIDDLE:
          if data is None:
            logging.warning(
                "Ordering corruption: Got MIDDLE record before FIRST "
                "record at offset %d", last_offset)
          else:
            data += chunk
        elif record_type == RECORD_TYPE_LAST:
          if data is None:
            logging.warning(
                "Ordering corruption: Got LAST record but no chunk is in "
                "progress at offset %d", last_offset)
          else:
            result = data + chunk
            data = None
            return result
        else:
          raise InvalidRecordError("Unsupported record type: %s" % record_type)

      except InvalidRecordError:
        logging.warning("Invalid record encountered at %s (%s). Syncing to "
                        "the next block", last_offset)
        data = None
        self.__sync()

  def __iter__(self):
    try:
      while True:
        yield self.read()
    except EOFError:
      pass

  def tell(self):
    """Return file's current position."""
    return self.__reader.tell()

Part 2

The Datastore Protobuf definition is here:

https://gitlab.com/gitlab-org/gitlab-runner/-/blob/25fb7833fa9035b85b88b9e56db59e08bc130124/vendor/google.golang.org/appengine/internal/datastore/datastore_v3.proto

This can be used to generate a Python Protobuf decoder as described here:

https://developers.google.com/protocol-buffers/docs/pythontutorial#compiling-your-protocol-buffers

Hope this helps!

scottfr avatar Aug 16 '20 18:08 scottfr

@scottfr. I now have the Reader and also the Protobuf, both in Python thanks to you. But how do I use it ? How can I call them from the initial script of @Venryx ?

cuzureau avatar Feb 08 '21 19:02 cuzureau

@Venryx I have taken a stab at creating the converter in Python 3. It works for my use case but not sure whether it would work well for sub collections or any other use cases. The source can be found at firestore-export-json.

Thanks to you @Venryx for the excellent work and @scottfr for providing the file reader.

labbots avatar Apr 26 '21 19:04 labbots

@labbots Nice work!

For this repo, I'll eventually get around to integrating the Python 3 changes; but till then, I've included a link to your Python 3 version in the readme.

Venryx avatar Apr 26 '21 20:04 Venryx