Duplicate Entries
Hi, I am using frontera revisiting Backend. The spider scraping previously scraped items. How can I make sure that there will be no duplicates?
Here is my frontera settings.
BACKEND = 'frontera.contrib.backends.sqlalchemy.revisiting.Backend'
SQLALCHEMYBACKEND_ENGINE = 'sqlite:///olx_frontier_v2.db'
SQLALCHEMYBACKEND_ENGINE_ECHO = False
SQLALCHEMYBACKEND_DROP_ALL_TABLES = False
SQLALCHEMYBACKEND_CLEAR_CONTENT = False
from datetime import timedelta
SQLALCHEMYBACKEND_REVISIT_INTERVAL = timedelta(days=1)
DELAY_ON_EMPTY = 20.0
MAX_NEXT_REQUESTS = 256
MIDDLEWARES.extend([
'frontera.contrib.middlewares.domain.DomainMiddleware',
'frontera.contrib.middlewares.fingerprint.DomainFingerprintMiddleware'
])
I seemed to have this issue due to an inconsistent State or Metadata table. For some reason when the worker stopped, it didn't flushed its cache.
A quick fix is to add the unique attribute to the fingerprint column in the Queue table.
@isra17 Thank you for your quick response. Can you tell me how to add unique attribute to fingerprint column?
There: https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/backends/sqlalchemy/models.py#L74
Rewrite it as:
fingerprint = Column(String(40), nullable=False, unique=True)
Here's the code for the model. just override the fingerprint in mixin class.
class CustomQueueModelMixin(QueueModelMixin):
fingerprint = Column(String(40), nullable=False, unique=True)
class CustomQueueModel(CustomQueueModelMixin, DeclarativeBase):
__tablename__ = 'queue'
@classmethod
def query(cls, session):
return session.query(cls)
def __repr__(self):
return '<Queue:%s (%d)>' % (self.url, self.id)
and in your settings modify SQLALCHEMYBACKEND_MODELS setting to point the QueueModel to your custom class.
Oops @isra17 answered before I did. However, @isra17 I have a question for you. Won't simply inserting give duplicate key error? Isn't an insert ignore statement needed while inserting?
Excellent point! We are using a custom backend on our side so that's not an issue, but the default revisiting backend doesn't handle those error so it will drop all the scheduled link if one of them is duplicated. I wonder if ignoring duplicate can be done from the model, otherwise it would require changes on the backend as well.
My actual fix has been to set fingerprint as the primary index and use
session.execute(
insert(self.queue_model).values(values)
.on_conflict_do_nothing())
to insert the rows.