frontera icon indicating copy to clipboard operation
frontera copied to clipboard

Duplicate Entries

Open ijharulislam opened this issue 8 years ago • 7 comments

Hi, I am using frontera revisiting Backend. The spider scraping previously scraped items. How can I make sure that there will be no duplicates?

Here is my frontera settings.

BACKEND = 'frontera.contrib.backends.sqlalchemy.revisiting.Backend'
SQLALCHEMYBACKEND_ENGINE = 'sqlite:///olx_frontier_v2.db'
SQLALCHEMYBACKEND_ENGINE_ECHO = False
SQLALCHEMYBACKEND_DROP_ALL_TABLES = False
SQLALCHEMYBACKEND_CLEAR_CONTENT = False
from datetime import timedelta
SQLALCHEMYBACKEND_REVISIT_INTERVAL = timedelta(days=1)

DELAY_ON_EMPTY = 20.0
MAX_NEXT_REQUESTS = 256

MIDDLEWARES.extend([
    'frontera.contrib.middlewares.domain.DomainMiddleware',
    'frontera.contrib.middlewares.fingerprint.DomainFingerprintMiddleware'
])

ijharulislam avatar Oct 11 '17 10:10 ijharulislam

I seemed to have this issue due to an inconsistent State or Metadata table. For some reason when the worker stopped, it didn't flushed its cache.

A quick fix is to add the unique attribute to the fingerprint column in the Queue table.

isra17 avatar Oct 11 '17 12:10 isra17

@isra17 Thank you for your quick response. Can you tell me how to add unique attribute to fingerprint column?

ijharulislam avatar Oct 11 '17 14:10 ijharulislam

There: https://github.com/scrapinghub/frontera/blob/master/frontera/contrib/backends/sqlalchemy/models.py#L74 Rewrite it as: fingerprint = Column(String(40), nullable=False, unique=True)

isra17 avatar Oct 11 '17 14:10 isra17

Here's the code for the model. just override the fingerprint in mixin class.

class CustomQueueModelMixin(QueueModelMixin):
   fingerprint = Column(String(40), nullable=False, unique=True)

class CustomQueueModel(CustomQueueModelMixin, DeclarativeBase):
    __tablename__ = 'queue'

    @classmethod
    def query(cls, session):
        return session.query(cls)

    def __repr__(self):
        return '<Queue:%s (%d)>' % (self.url, self.id)

and in your settings modify SQLALCHEMYBACKEND_MODELS setting to point the QueueModel to your custom class.

voith avatar Oct 11 '17 14:10 voith

Oops @isra17 answered before I did. However, @isra17 I have a question for you. Won't simply inserting give duplicate key error? Isn't an insert ignore statement needed while inserting?

voith avatar Oct 11 '17 14:10 voith

Excellent point! We are using a custom backend on our side so that's not an issue, but the default revisiting backend doesn't handle those error so it will drop all the scheduled link if one of them is duplicated. I wonder if ignoring duplicate can be done from the model, otherwise it would require changes on the backend as well.

isra17 avatar Oct 11 '17 14:10 isra17

My actual fix has been to set fingerprint as the primary index and use

session.execute(
  insert(self.queue_model).values(values)
    .on_conflict_do_nothing())

to insert the rows.

isra17 avatar Oct 11 '17 14:10 isra17