tinydb
tinydb copied to clipboard
Json decoder error when reading DB from multiple processes
I was doing some testing with tinydb. Its an awesome software. I found that its not recommended for multiprocess reads like usecase in flask etc. But as i had only parallel read operation, i thought it would work. But when multiple process try to read the db i get a json error
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/process.py", line 313, in _bootstrap
self.run()
File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "ptyhon_db_test_2.py", line 15, in search_data
result = db.search(Query()['workflowId'] == 'm9imoy9b')
File "/usr/local/lib/python3.8/dist-packages/tinydb/table.py", line 254, in search
for doc_id, doc in self._read_table().items()
File "/usr/local/lib/python3.8/dist-packages/tinydb/table.py", line 704, in _read_table
tables = self._storage.read()
File "/usr/local/lib/python3.8/dist-packages/tinydb/storages.py", line 136, in read
return json.load(self._handle)
File "/usr/lib/python3.8/json/__init__.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
To reproduce it: (need to change the db json and search)
import multiprocessing
import time
from tinydb import TinyDB, Query
# Create a TinyDB database
db = TinyDB('my_database.json')
# Define the ID you want to search for
search_id = 'm9imoy9b' # Replace with your desired ID
# Define a function to perform the search operation
def search_data():
while True:
# Perform a search for the specific ID
result = db.search(Query()['workflowId'] == search_id)
print(f"Search result for ID {search_id}")
# Define the number of processes you want for simultaneous searches
num_processes = 100
# Create and start processes for searching
processes = []
for _ in range(num_processes):
process = multiprocessing.Process(target=search_data)
process.start()
processes.append(process)
try:
# Keep the processes running in the background
for process in processes:
process.join()
except KeyboardInterrupt:
# Terminate the processes gracefully on Ctrl+C
for process in processes:
process.terminate()
# Close the database
db.close()
I have been experiencing the same thing. For me, it seems to clone the json files contents at the end of the file (basically if you copied everything, and then pasted it again at the end of the json file)
The problem is that you don't use Locks while reading and writing. Because of this it's possible that one thread or process does a 'seek' operation just before another thread or process wants to read.
For example, below is the code for a read using the JSONStorage.
def read(self) -> Optional[Dict[str, Dict[str, Any]]]:
# Get the file size by moving the cursor to the file end and reading
# its location
self._handle.seek(0, os.SEEK_END)
size = self._handle.tell()
if not size:
# File is empty, so we return ``None`` so TinyDB can properly
# initialize the database
return None
else:
# Return the cursor to the beginning of the file
self._handle.seek(0)
# Load the JSON contents of the file
return json.load(self._handle)
- proces one does a 'read'.
1.1 proces one does
self._handle.seek(0, os.SEEK_END)
, the cursor is now at the end of the file. 1.2 proces one does self._handle.seek(0), the cursor is now at the beginning of the file. - proces two start a
read
. 2.1 proces two doesself._handle.seek(0, os.SEEK_END)
, the cursor is now at the end of the file. - proces one does a
json.load(self._handle)
-> you read from the end of a file -> file seems empty
process one set the cursor to the beginning, but proces 2 changed it to the end just before process one wants to read, resulting in an empty str. This is one way things can go wrong but you can imagine that there are a lot of ways that this can fail. Like @SpiralAPI saw there might be a proces that does a self._handle.seek(0, os.SEEK_END)
just before a write resulting in appending all the data instead of overwriting.
It's not very usefull to do a search using multiple processes or thread over a single file. Since you would need to use locks every time you read or write, you basically turned it into a synchronous operation.
If you would need to do a CPU-intensive task, it's beter to read everything ahead of time (or at least in chunks) and then pass the data to different processes or threads.