django-dynamic-scraper
django-dynamic-scraper copied to clipboard
Scrapers Detail Admins Page load too slow
If you have many scrapers (around 100) and u try to change one of them then detail page load too slow (around 35 sec). I try to inspect problem with DjDT and i see that there are 5790 SQL Queries to load this page and from them 5780 duplicates. I know u cant use select_related or prefetch_related like in Django to eliminate this problem but i think its important to fix it cause DDS with many scrapers is unusable atm.
Hi @bezkos, what do you mean with "change one of them" respectively what are you changing/doing for the detail pages to load so slow?
In scrapers page, i have around 100 scrapers. I need to change xpath for example in 1 of them then it needs around 35 sec to load this page with 5790 SQL Queries and 5780 duplicates. This is a part of the results from Debug toolbar:
SELECT "dynamic_scraper_scrapedobjclass"."id", "dynamic_scraper_scrapedobjclass"."name", "dynamic_scraper_scrapedobjclass"."scraper_scheduler_conf", "dynamic_scraper_scrapedobjclass"."checker_scheduler_conf", "dynamic_scraper_scrapedobjclass"."comments" FROM "dynamic_scraper_scrapedobjclass" WHERE "dynamic_scraper_scrapedobjclass"."id" = 104 Duplicated 5758 times. Connection: default C:\venv27\lib\site-packages\dynamic_scraper/models.py in str(203) return self.name + " (" + self.scraped_obj_class.name + ")" 18 {% trans 'Home' %} 19 › {{ opts.app_config.verbose_name }} 20 › {% if has_change_permission %}{{ opts.verbose_name_plural|capfirst }}{% else %}{{ opts.verbose_name_plural|capfirst }}{% endif %} 21 › {% if add %}{% blocktrans with name=opts.verbose_name %}Add {{ name }}{% endblocktrans %}{% else %}{{ original|truncatewords:"18" }}{% endif %} 22 23 {% endblock %} 24 {% endif %} 25 C:\venv27\lib\site-packages\django\contrib\admin\templates\admin\change_form.html
SELECT "dynamic_scraper_scrapedobjclass"."id", "dynamic_scraper_scrapedobjclass"."name", "dynamic_scraper_scrapedobjclass"."scraper_scheduler_conf", "dynamic_scraper_scrapedobjclass"."checker_scheduler_conf", "dynamic_scraper_scrapedobjclass"."comments" FROM "dynamic_scraper_scrapedobjclass" WHERE "dynamic_scraper_scrapedobjclass"."id" = 89 Duplicated 5758 times. Connection: default C:\venv27\lib\site-packages\dynamic_scraper/models.py in str(55) return self.name + " (" + str(self.obj_class) + ")"
SELECT "dynamic_scraper_scrapedobjclass"."id", "dynamic_scraper_scrapedobjclass"."name", "dynamic_scraper_scrapedobjclass"."scraper_scheduler_conf", "dynamic_scraper_scrapedobjclass"."checker_scheduler_conf", "dynamic_scraper_scrapedobjclass"."comments" FROM "dynamic_scraper_scrapedobjclass" WHERE "dynamic_scraper_scrapedobjclass"."id" = 27 Duplicated 5758 times. Connection: default C:\venv27\lib\site-packages\dynamic_scraper/models.py in str(55) return self.name + " (" + str(self.obj_class) + ")" There are 5758 duplicates for each id............
Ah, I thought you meant the detail pages of the websites you are going to scrape.
With detail page do you mean the edit form page of a scraper in the admin? or do you mean the overview site with all the scrapers?
Have you got one Scraped Obj Class for every scraper? And how many Scraped Obj Classes have you got?
Can you make a test and edit the C:\venv27\lib\site-packages\dynamic_scraper/models.py file and remove the part of the returned name being in parantheses, both in line 55 and line 203?
So just leave return self.name
.
Yes i did the test and it fixes the problem. Load time is 1.8 sec from 35sec. Queries are 29(20 duplicates) from 5790.
SELECT "dynamic_scraper_scrapedobjattr"."id", "dynamic_scraper_scrapedobjattr"."name", "dynamic_scraper_scrapedobjattr"."order", "dynamic_scraper_scrapedobjattr"."obj_class_id", "dynamic_scraper_scrapedobjattr"."attr_type", "dynamic_scraper_scrapedobjattr"."id_field", "dynamic_scraper_scrapedobjattr"."save_to_db" FROM "dynamic_scraper_scrapedobjattr" ORDER BY "dynamic_scraper_scrapedobjattr"."order" ASC Duplicated 11 times. Connection: default
I have around 100 obj classes and 110 scrappers. I mean the edit form page of a scraper in the admin.
This was actually trickier than I though, experimented with 2-3 different things, all not completely satisfying (thought I could quickly fix this since I'm doing a minor release today anyhow).
I actually need the complete names otherwise users get confused when selecting the scraped object attributes for the scraper, so simplify the naming is not an option. I also experimented with simple caching of the name which also didn't work.
Limit the choices to only the attributes of the corresponding scraped object class is also trickier than one might think, since the object class is not determined yet when adding a new scraper or adding new scraper elems. I have now added such a limitation, but this works only for already saved scrapers for already added attributes.
Let me know if this improves the performance situation for you. Otherwise you will have to monkey patch this for yourself in your installed DDS version.
Greetings Holger
Ok @holgerd77 i found a way to reduce 75% time and queries.
class ScraperElemInline(admin.TabularInline):
model = ScraperElem
extra = 3
def formfield_for_foreignkey(self, db_field, request=None, **kwargs):
if db_field.name == 'scraped_obj_attr':
kwargs['queryset'] = ScrapedObjAttr.objects.select_related('obj_class').all()
return super(ScraperElemInline, self).formfield_for_foreignkey(db_field, request, **kwargs)
And my last update with no duplicates and <1 sec load (from 35secs) In model.py class WithObJClass(models.Manager): def get_queryset(self): return super(WithObJClass, self).get_queryset().select_related('obj_class')
@python_2_unicode_compatible class ScrapedObjAttr(models.Model): ATTR_TYPE_CHOICES = ( ('S', 'STANDARD'), ('T', 'STANDARD (UPDATE)'), ('B', 'BASE'), ('U', 'DETAIL_PAGE_URL'), ('I', 'IMAGE'), ) name = models.CharField(max_length=200) order = models.IntegerField(default=100) obj_class = models.ForeignKey(ScrapedObjClass) attr_type = models.CharField(max_length=1, choices=ATTR_TYPE_CHOICES) id_field = models.BooleanField(default=False) save_to_db = models.BooleanField(default=True) objects = WithObJClass()
def __str__(self):
return self.name + " (" + str(self.obj_class.name) + ")"
class Meta(object):
ordering = ['order',]