gutendex icon indicating copy to clipboard operation
gutendex copied to clipboard

Error when updating the catalog on server

Open DavidLazarescu opened this issue 1 year ago • 7 comments

Hey, I am trying to run gutendex on an azure server but when trying to update the catalog, I get:

./manage.py updatecatalog
/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/environ/environ.py:628: UserWarning: /tmp/8dbba6c19a61f2e/gutendex/.env doesn't exist - if you're not configuring your environment separately, create one.
warnings.warn(
Starting script at 10:22:34 on September 21, 2023
Making temporary directory...
Downloading compressed catalog...
Decompressing catalog...
Detecting stale directories...
Error: [Errno 2] No such file or directory: '/tmp/8dbba6c19a61f2e/catalog_files/tmp/cache/epub'

Traceback (most recent call last):
File "/tmp/8dbba6c19a61f2e/./manage.py", line 22, in <module>
execute_from_command_line(sys.argv)
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/management/__init__.py", line 381, in execute_from_command_line
utility.execute()
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/management/__init__.py", line 375, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/management/base.py", line 323, in run_from_argv
self.execute(*args, **cmd_options)
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/management/base.py", line 364, in execute
output = self.handle(*args, **options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/8dbba6c19a61f2e/books/management/commands/updatecatalog.py", line 345, in handle
send_log_email()
File "/tmp/8dbba6c19a61f2e/books/management/commands/updatecatalog.py", line 260, in send_log_email
send_mail(
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/mail/__init__.py", line 60, in send_mail
return mail.send()
^^^^^^^^^^^
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/mail/message.py", line 306, in send
return self.get_connection(fail_silently).send_messages([self])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/mail/backends/smtp.py", line 103, in send_messages
new_conn_created = self.open()
^^^^^^^^^^^
File "/tmp/8dbba6c19a61f2e/antenv/lib/python3.11/site-packages/django/core/mail/backends/smtp.py", line 70, in open
self.connection.login(self.username, self.password)
File "/opt/python/3.11.4/lib/python3.11/smtplib.py", line 750, in login
raise last_exception
File "/opt/python/3.11.4/lib/python3.11/smtplib.py", line 739, in login
(code, resp) = self.auth(
^^^^^^^^^^
File "/opt/python/3.11.4/lib/python3.11/smtplib.py", line 662, in auth
raise SMTPAuthenticationError(code, resp)
smtplib.SMTPAuthenticationError: (530, b'Must issue a STARTTLS command first')

It seems like the main problem here is Error: [Errno 2] No such file or directory: '/tmp/8dbba6c19a61f2e/catalog_files/tmp/cache/epub'.

Do you have any idea what could cause this?

DavidLazarescu avatar Sep 21 '23 10:09 DavidLazarescu

When creating a directory at `Error: [Errno 2] No such file or directory: '/tmp/8dbba6c19a61f2e/catalog_files/tmp/cache/epub', I get an error telling me that a file or directory already exists at that path.

DavidLazarescu avatar Sep 21 '23 11:09 DavidLazarescu

Hi David, it looks like the code is failing to find a folder that is supposed to be created when the Gutenberg XML files are unzipped. Maybe the file was temporarily changed to something unexpected on the Gutenberg server, or maybe it was not fully downloaded due to a network issue, but I am just guessing.

In any case, I would recommend running it again without any manually created folders like catalog_files/tmp/cache/epub.

If the issue still occurs after that, then I would suspect it is caused by some kind of restrictions from the execution environment. You seem to be using a tmp folder as a working directory, so maybe the azure server aggressively deletes files or something like that, but that is also just a guess.

I hope that helps. Good luck!

garethbjohnson avatar Sep 22 '23 07:09 garethbjohnson

Hi,

thank you for the quick response. I have found that putting an sys.exit() after downloading the file, downloads it correctly. But removing the sys.exit() after the download leads to the rdf folder being empty and the downloaded file not existing anymore.

This video shows what I mean. The first run seems to download it correctly due to the manually inserted sys.exit() but removing the sys.exit() doesn't save the data correctly.

https://github.com/garethbjohnson/gutendex/assets/69865187/1d3a4bc2-2f16-42fd-a5a4-34b47fa43bde

DavidLazarescu avatar Sep 22 '23 08:09 DavidLazarescu

@garethbjohnson might this happen due to the environment variables for e.g. the DB access being wrong, or would that give me a proper error?

DavidLazarescu avatar Sep 22 '23 09:09 DavidLazarescu

Ok, I have managed to populate the DB from my local machine and let the API use the DB to server the data. But inserting the books into DB takes extremely long. I am at 600 books in 10 minutes, from afaik. 70.000. Is there a way to multi-thread this?

DavidLazarescu avatar Sep 22 '23 10:09 DavidLazarescu

The script is supposed to delete the files before it finishes, but only after populating the database, and it looks like that is not happening in the video for some reason, even without the initial error. I do not think that either issue would be caused by a lack of compatible environment variables

As for multi-threading, I had not considered it, but I found a StackOverflow answer that makes it seem pretty simple to me at first glance: https://stackoverflow.com/a/28463266

garethbjohnson avatar Sep 24 '23 14:09 garethbjohnson

Multithreading it would be great. I am running the script to setup the db from my personal machine since it doesn't seem to work in the cloud environment and I am 24h in now and got about 50k/70k books. Running this on my 12 cores would be a huge improvement.

DavidLazarescu avatar Sep 24 '23 14:09 DavidLazarescu