synth icon indicating copy to clipboard operation
synth copied to clipboard

Semantic detection PoC

Open brokad opened this issue 3 years ago • 0 comments

Semantic detection PoC

This defines a framework for more advanced, statistics based, ways of importing data into synth. This paves the way for more automation in the process of writing synth schemas tailored to a specific data source.

Underpinning this is the semdet crate which aims to provide synth with the ability to do fast, zero-copy, in-memory trainable analytics for table instances provided by the user as an import data source. It is built on arrow, ndarray and tch.

The PoC is an end-to-end implementation of a dummy model that detects the most likely fake generator based on a simple dictionary lookup. The example is simple enough that we can get it done very quickly and yet involves enough moving parts to evidence the possibility of implementing more complex data driven inference mechanisms.

How to test it

cargo test --features torch in semdet/ will run the dummy E2E scenario and should be successful.

Roadmap to readiness

  • [x] Composable API for the embedding of input data as valid module inputs
  • [x] Composable API for handling prediction targets in our domain-specific application
  • [x] Load a 'pre-trained' dummy module embedded at compile-time
  • [x] Document the Encoder/Decoder/Module APIs
  • [x] Attach to the CLI's import logic
    • [x] Project down string columns from sqlx query results
  • [x] Windows build needs fixing
  • [x] Make tch optional so the built binary does not have to carry a dynamic dependency into libtorch

brokad avatar Aug 14 '21 15:08 brokad