PClean
                                
                                
                                
                                    PClean copied to clipboard
                            
                            
                            
                        A domain-specific probabilistic programming language for scalable Bayesian data cleaning
Read CSV string columns as String rather than fixed size string
Added Japanese support for the StringPrior distribution DigitalGarageLab Collaborating with MIT ProbComp
A good prior distribution on person names (first names, last name, etc.) -- but many other types of names including place names -- seems important for cases when it is...
The goal is to allow parameters (possibly from different classes) to be transformed before they are used as arguments to distributions. For example, linear combinations of normally-distributed parameters can still...
Performance of Flights model suffers without the subproblem block at https://github.com/probcomp/PClean/commit/f51c9489dda76a6dbfd7c64fc166a5c94b13db7a#diff-2a3b7234fcda10bae8f2e3e677e2add7dc29ea841a266f8a13708c4e57ac069bR14 but it is unclear to me why this should be the case: the flight ID is always observed.
This includes * Runtime + accuracy-over-time measurements against baseline inference algorithms (Figure 6) * Configuration for baseline systems (HoloClean + NADEEF) * Uncertainty-aware analysis of Rents dataset
This text implies that `ProposalDummyValue`s are only used for distributions that have infinite (and discrete) support: https://github.com/probcomp/PClean/blob/master/src/distributions/distributions.jl#L10-L14 But `StringPrior` has finite support (there is a maximum length), and it implements...
Also related to https://github.com/probcomp/GenDistributions.jl and https://github.com/probcomp/Gen.jl/issues/362
Here are a bunch of not particularly organized notes I had lying around about this... ## Existing approaches to Split Merge in the literature: A split-merge algorithm is made up...