BayesianDataFusion.jl icon indicating copy to clipboard operation
BayesianDataFusion.jl copied to clipboard

Upgrading to Julia 1.x

Open suchow opened this issue 4 years ago • 3 comments

I'm opening this issue to begin a conversation about upgrading BayesianDataFusion.jl to be compatible with the most recent version of Julia, 1.x. We've begun upgrading the package on a fork and have succeeded in bringing most it up to date.

I have a few questions for you:

Meta-level questions

  • Are you interested in being included in discussions about the upgrade as it proceeds?
  • Once the upgraded version is feature complete, would you consider a pull request to this repository?

Substantive questions

  • The current implementation allows multithreaded sampling only when there is 1 featureless relation: https://github.com/jaak-s/BayesianDataFusion.jl/blob/7c9b3de22cd61f03913e79058bea86bf01acb7b4/src/macau.jl#L64 Why? Is it impossible in principle to run Macau across multiple threads or machines when there is more than one relation or a relation with features? Or is this rather a limitation of the implementation as it stands?
  • test/sparse_csr.jl tests a function named sparse_csr that casts a matrix of type SparseMatrixCSC to type SparseMatrixCSR. However, SparseMatrixCSR matrices are not used anywhere in the codebase except in tests. Is this a vestige of previous development? Is there any reason why it cannot be removed entirely?
  • macau_blocked.jl appears to be a stub that is not used anywhere in the code. Can it be safely removed?
  • macau_hmc.jl, macau_sgd.jl, and macau_vb.jl appear to implement Hamiltonian Monte Carlo, Stochastic Gradient Descent, and Variational Bayes variants of Macau. However, only the HMC variant has tests. Are all three functional? Are there any limitations? (Specifically, can they be used to fit any model that macau.jl can fit?)

suchow avatar Jun 29 '20 15:06 suchow

Meta

  • I'm happy to help this upgrade process.
  • Yes, pull request to this repository would be welcome!

Details

  • There is no limitation from Macau. I think the reason was that I implemented parallel version only for a special case of a single relation.
  • Yes, you can remove it. It is not currently used, if we need it in the future we can bring it back again.
  • macau_blocked.jl can be removed.
  • I quickly scanned the code and macau_hmc.jl, macau_sgd.jl and macau_vb.jl all solve only the special case of a single matrix without side information, i.e., the standard BPMF. So I don't mind retiring these files as they are not required for the main functionality of the package.

jaak-s avatar Jun 29 '20 18:06 jaak-s

Regarding parallelism, do you have a sense of how much effort would be required to implement parallelism for the general multi-relational plus side-information case?

suchow avatar Jun 30 '20 16:06 suchow

Actually, I think the effort might not be too large at all (unless I've forgotten something). Basically, the algorithm is the same when sampling the latent matrix of one entity, each row can be sampled independently. Thus, one can parallelize over rows. The complexity in the multiple relation case is that for each row one needs to access data from more than one relation. But otherwise the parallelization strategy is the same.

jaak-s avatar Jun 30 '20 20:06 jaak-s