geowave icon indicating copy to clipboard operation
geowave copied to clipboard

How to ingest data without duplication allowed?

Open parselife opened this issue 2 years ago • 4 comments

With documentation, there is :

Data ID: An identifier for the data represented by this row. We do not impose a requirement that Data IDs are globally unique but they should be unique for the adapter. Therefore, the pairing of Internal Adapter ID and Data ID define a unique identifier for a data element. An example of a data ID for vector data would be the feature ID.

according to that, Adapter ID and Data ID define a unique identifier, so how to ingest data without duplication allowed?

now, my index looks like

adapter_id                 data_id   
4	......              places.12
4	......              places.12

Why this happened?

The values of adapter_id and data_id in these two records are the same

i want to get a single record without a duplicated one, how can i do?

parselife avatar Mar 31 '22 04:03 parselife

I find the cassandra's table definition :

**primary key (partition, adapter_id, sort, data_id, vis, nano_time, field_mask, value, num_duplicates)**

Any way to custom this ?

parselife avatar Apr 01 '22 09:04 parselife

Not sure why you'd why exactly you'd want to customize that primary key, you can give it the data ID to be unique, and other things like sort and partition key come from the index (which again you could customize but probably don't want to).

The issue is most likely that you are inserting rows into the index with the same adapter ID and data ID but different sort keys. This would happen, for example, if you were using a spatial index and the rows had different geometries (or similarly a temporal index with different date/times). In these rare cases you would want to delete the row prior to ingesting. The num_duplicates identifier that we tack onto the primary key is a hint that we intentionally are storing duplicates, and this can happen in rare circumstances such as if you are storing a time range (consider a track that has a start time and and end time) and that time range crosses a periodicity boundary on a temporal index (because time is unbounded, we place it on the space filling curve by applying a periodicity such as a year which is our default but can be configured, so in the case of a year periodicity if the track started on Dec. 31 and ended on Jan 1 for example, we have to insert 2 rows on each side of the boundary and we maintain that with the hint num_duplicates). Hopefully that adds some clarity to your situation - as mentioned most likely you are inserting a data ID multiple times with different sort keys, such as different geometries within a spatial index, which will require deleting the previous one prior to insertion in that case.

rfecher avatar Apr 01 '22 12:04 rfecher

Thx for your reply, Where can i find the sort keys ? My situation is that: The data written twice is just the same

parselife avatar Apr 24 '22 08:04 parselife

Do you have a "ROUND_ROBIN" partition strategy on your index (such as described in this add index help output, https://locationtech.github.io/geowave/latest/userguide.html#help-command)? This partition strategy would by design add random partition keys even to identical rows and explain this behavior you're seeing.

rfecher avatar Apr 25 '22 12:04 rfecher