datacamp icon indicating copy to clipboard operation
datacamp copied to clipboard

Create search index

Open Stiivi opened this issue 15 years ago • 8 comments

Create index and process for rebuilding the index for searching

Notes:

  • normalize fields (remove diacritics, remove spaces, ...)
  • compute distances

Stiivi avatar Mar 28 '10 13:03 Stiivi

ThinkingSphinx ;)

jsuchal avatar Mar 28 '10 15:03 jsuchal

can you be more specific? how to use sphinx with datacamp data store layer?

Stiivi avatar Apr 07 '10 13:04 Stiivi

hm, from brief look at ThiningSphinx it looks that it is not useable for Datacamp, at least not for datastore (for metadata/application data may be), as the datastore is no ActiveRecord based (it is just temporary hack, that should be removed in the future, as there is no real reason for it).

However, raw Sphinx looks interesting and there might be some use of it. Also it can index other sources as well, therefore it might be used if datastore was changed to mognoDB/some amazon storage service.

Stiivi avatar Apr 07 '10 13:04 Stiivi

We need to keep in mind the structure of tables might and will change over time.

vojto avatar Apr 07 '10 14:04 vojto

@Stiivi i am sorry, there is no mention here that this issue is related data store layer (whatever that is).

Looking at http://github.com/Stiivi/datacamp/tree/master/app/models/ most of them are AR-based.

@vojto I don't see a problem there.

I thought that you were looking for fulltext/search engine solution.

jsuchal avatar Apr 07 '10 14:04 jsuchal

Well, this is just not simple "scaffold" of AR objects. See the sources and documentation for more information how datasets are implemented.

you want to search mainly in Datasets, not in application structures - that is, you want to search in datastore schema tables which have no real active record counterparts.
There is DatasetRecord class, which serves as AR hack on top of Datstore schema. However this should be removed in the future, as there is no reason for have AR for dataset records. Dataset record API should be moved into DatastoreManager, but this is out of topic of this thread.

yes, we want fulltext/search engine solution, but for data in datasets. sphinx is good, however it assumes static table structure, which is not the case of datacamp datasets. there can be some workaround, though. I was looking at it, and roughly you can do it this way:

  1. dynamically create indexing configuration file for Sphinx: 1.1 create source for each dataset (from dataset description) 1.2 create index for each source
  2. run indexing with generated configuration file - on demand/scheduled
  3. when forming search query create one sphinx query for each dataset = sphinx source

possible indexing optimisation: add dirty flag to speed-up periodix indexing, do full reindex weekly/daily - depending on required indexing time

Stiivi avatar Apr 08 '10 09:04 Stiivi

Well actually its only one layer of abstraction above basic scaffold. You define columns and records attributes and using some metamodels. Nothing really special, but nice.

Anyways regarding searching, I do think you are reinventing the wheel since thinking sphinx handles everything you want (even dirty flag - delta indexing). You just have to use it a little differently because you will need to define index by lookup from database, not by hardcoding column names. No big deal I think.

jsuchal avatar Apr 08 '10 09:04 jsuchal

If it is no big deal, please, provide a patch

Stiivi avatar Apr 08 '10 10:04 Stiivi