gutendex
gutendex copied to clipboard
Error when updating the catalog on server
Hey, I am trying to run gutendex on an azure server but when trying to update the catalog, I get:
./manage.py updatecatalog
/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/environ/environ.py:628: UserWarning: /tmp/8dbba6c19a61f2e/gutendex/.env doesn't exist - if you're not configuring your environment separately, create one.
warnings.warn(
Starting script at 10:22:34 on September 21, 2023
Making temporary directory...
Downloading compressed catalog...
Decompressing catalog...
Detecting stale directories...
Error: [Errno 2] No such file or directory: '/tmp/8dbba6c19a61f2e/catalog_files/tmp/cache/epub'
Traceback (most recent call last):
File "/tmp/8dbba6c19a61f2e/./manage.py", line 22, in <module>
execute_from_command_line(sys.argv)
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/management/base.py", line 323, in run_from_argv
self.execute(*args, **cmd_options)
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/management/base.py", line 364, in execute
output = self.handle(*args, **options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/8dbba6c19a61f2e/books/management/commands/updatecatalog.py", line 345, in handle
send_log_email()
File "/tmp/8dbba6c19a61f2e/books/management/commands/updatecatalog.py", line 260, in send_log_email
send_mail(
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/mail/__init__.py", line 60, in send_mail
return mail.send()
^^^^^^^^^^^
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/mail/message.py", line 306, in send
return self.get_connection(fail_silently).send_messages([self])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/mail/backends/smtp.py", line 103, in send_messages
new_conn_created = self.open()
^^^^^^^^^^^
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/mail/backends/smtp.py", line 70, in open
self.connection.login(self.username, self.password)
File "/opt/python/3.11.4/lib/python3.11/smtplib.py", line 750, in login
raise last_exception
File "/opt/python/3.11.4/lib/python3.11/smtplib.py", line 739, in login
(code, resp) = self.auth(
^^^^^^^^^^
File "/opt/python/3.11.4/lib/python3.11/smtplib.py", line 662, in auth
raise SMTPAuthenticationError(code, resp)
smtplib.SMTPAuthenticationError: (530, b'Must issue a STARTTLS command first')
It seems like the main problem here is Error: [Errno 2] No such file or directory: '/tmp/8dbba6c19a61f2e/catalog_files/tmp/cache/epub'
.
Do you have any idea what could cause this?
When creating a directory at `Error: [Errno 2] No such file or directory: '/tmp/8dbba6c19a61f2e/catalog_files/tmp/cache/epub', I get an error telling me that a file or directory already exists at that path.
Hi David, it looks like the code is failing to find a folder that is supposed to be created when the Gutenberg XML files are unzipped. Maybe the file was temporarily changed to something unexpected on the Gutenberg server, or maybe it was not fully downloaded due to a network issue, but I am just guessing.
In any case, I would recommend running it again without any manually created folders like catalog_files/tmp/cache/epub
.
If the issue still occurs after that, then I would suspect it is caused by some kind of restrictions from the execution environment. You seem to be using a tmp
folder as a working directory, so maybe the azure server aggressively deletes files or something like that, but that is also just a guess.
I hope that helps. Good luck!
Hi,
thank you for the quick response. I have found that putting an sys.exit() after downloading the file, downloads it correctly. But removing the sys.exit() after the download leads to the rdf folder being empty and the downloaded file not existing anymore.
This video shows what I mean. The first run seems to download it correctly due to the manually inserted sys.exit() but removing the sys.exit() doesn't save the data correctly.
https://github.com/garethbjohnson/gutendex/assets/69865187/1d3a4bc2-2f16-42fd-a5a4-34b47fa43bde
@garethbjohnson might this happen due to the environment variables for e.g. the DB access being wrong, or would that give me a proper error?
Ok, I have managed to populate the DB from my local machine and let the API use the DB to server the data. But inserting the books into DB takes extremely long. I am at 600 books in 10 minutes, from afaik. 70.000. Is there a way to multi-thread this?
The script is supposed to delete the files before it finishes, but only after populating the database, and it looks like that is not happening in the video for some reason, even without the initial error. I do not think that either issue would be caused by a lack of compatible environment variables
As for multi-threading, I had not considered it, but I found a StackOverflow answer that makes it seem pretty simple to me at first glance: https://stackoverflow.com/a/28463266
Multithreading it would be great. I am running the script to setup the db from my personal machine since it doesn't seem to work in the cloud environment and I am 24h in now and got about 50k/70k books. Running this on my 12 cores would be a huge improvement.