synth
synth copied to clipboard
Semantic detection PoC
Semantic detection PoC
This defines a framework for more advanced, statistics based, ways of importing data into synth
. This paves the way for more automation in the process of writing synth
schemas tailored to a specific data source.
Underpinning this is the semdet
crate which aims to provide synth
with the ability to do fast, zero-copy, in-memory trainable analytics for table instances provided by the user as an import data source. It is built on arrow
, ndarray
and tch
.
The PoC is an end-to-end implementation of a dummy model that detects the most likely fake
generator based on a simple dictionary lookup. The example is simple enough that we can get it done very quickly and yet involves enough moving parts to evidence the possibility of implementing more complex data driven inference mechanisms.
How to test it
cargo test --features torch
in semdet/
will run the dummy E2E scenario and should be successful.
Roadmap to readiness
- [x] Composable API for the embedding of input data as valid module inputs
- [x] Composable API for handling prediction targets in our domain-specific application
- [x] Load a 'pre-trained' dummy module embedded at compile-time
- [x] Document the
Encoder
/Decoder
/Module
APIs - [x] Attach to the CLI's import logic
- [x] Project down string columns from
sqlx
query results
- [x] Project down string columns from
- [x] Windows build needs fixing
- [x] Make
tch
optional so the built binary does not have to carry a dynamic dependency intolibtorch