stdlib icon indicating copy to clipboard operation
stdlib copied to clipboard

[RFC]: Metalog Distribution

Open Hazelfire opened this issue 2 years ago • 4 comments

Description

This RFC proposes to add a metalog distribution, based on the implementation provided by rmetalog. This implementation will include functions for:

  • quantile
  • pdf
  • cdf
  • mean

I've already started to make this contribution on my local machine.

Related Issues

No response

Questions

  1. Metalog takes an array of coefficients as parameters. What is the standard way to pass arrays in stdlib? Would you be happy with just a JS array?
  2. Metalog has two ways of construction, one is to provide coefficients, but it also really common to fit a distribution to samples via least squares regression. Where would the fitting functionality be provided in stdlib?
  3. Would you accept a contribution of the metalog distribution?
  4. Would you be ok if I created a draft PR so that you can see the changes that are being made to create a metalog distribution in stdlib? That way I can ask questions related to implementation there?

Other

The implementation will be checked against the rmetalog package for accuracy.

Checklist

  • [X] I have read and understood the Code of Conduct.
  • [X] Searched for existing issues and pull requests.
  • [X] The issue name begins with RFC:.

Hazelfire avatar Aug 30 '22 00:08 Hazelfire

:tada: Welcome! :tada:

And thank you for opening your first issue! We will get back to you shortly. :runner: :dash:

github-actions[bot] avatar Aug 30 '22 00:08 github-actions[bot]

@Hazelfire Thanks for filing this RFC.

The metalog distribution would be a welcome addition. Based on the API and nature of the distribution, I am not sure the best location for a metalog distribution package. In https://github.com/stdlib-js/stdlib/tree/d5827a51f78e852b59f5a1dd958ffa6afaee7530/lib/node_modules/%40stdlib/stats/base/dists, distributions (almost?) exclusively operate on scalar values. Perhaps @Planeshifter has thoughts?

Had a look at rmetalog. As it is licensed under a permissive MIT license, should be fine to base a stdlib implementation on this reference implementation.

Checking against rmetalog seems reasonable. However, we should only expect approximate accuracy, due to error accumulation (i.e., result divergence) in underlying special functions.

Questions

  1. Passing an array should be fine for now. Can refactor later to support ndarray objects.
  2. For LSTSQ, @Planeshifter may have thoughts.
  3. Yes, we'd be happy to accept a metalog contribution.
  4. A draft PR would be great. My advice would be to start by submitting a README with the proposed API as a WIP PR. Once the API design seems reasonable, moving to implementation should be more straightforward, as we can avoid churn if the API, for whatever reason, doesn't match stdlib design conventions.

kgryte avatar Aug 30 '22 00:08 kgryte

Another reference implementation is a pymetalog package, which is also MIT licensed, and, so long as equivalent to R, may provide a better basis, as we can more easily map APIs.

kgryte avatar Aug 30 '22 01:08 kgryte

Looking through the implementations, looks like we'll need to add some functionality to stdlib for array ops. When submitting the draft PR, may be good to provide a list of various ops (e.g., matrix inversion, stacking, lstsq, etc).

kgryte avatar Aug 30 '22 01:08 kgryte

Hey! I've spent about 3 days trying to contribute this distribution, and I have a whole menagerie of development questions:

  1. How do you generate <toc>, <toc-namespace>, <equations> and other markdown pre-processing directives? I know you've told me that I shouldn't need that and you'll do it on your end, but there are a lot of design questions that come around that level, rather than the individual function level. It would be nice to know how to do it (or even better, have it documented). It seems a large amount of your code is automatically generated (for instance, the tests?)
  2. The pre-commit hook is preventing me from contributing to this incrementally (for instance, starting with a simple README that's not correct, and then trying to refine from there) because eslint seems to expect me to have implemented the function along with what looks to be doc tests. Should I ignore the pre-commit hook? If not, how do you set up eslint? Eslint seems to ignore node_modules directories by default, and you have another config at /etc/eslint/.eslintignore to override that. But how do you go about including that ignore file? I can't find documentation about the ignore.
  3. How do you go about generating and viewing the documentation? It would be nice to be able to see whether my readme compiles correctly

I'm also curious about namespaces because metalog is kind of a family of four distributions. Unbounded, Bounded, and Semi Bounded upper and lower. I'm not sure how we should represent this, my idea was to have four different modules inside metalog for the four different types, and then functions within those modules. Either that or we can follow the implementation set by rmetalog and simply set the different types to be parameters to the function (like a string that specifies what type)

Any help would be greatly appreciated

Hazelfire avatar Oct 19 '22 01:10 Hazelfire

The pre-commit hook also provides some extremely mysterious errors: image I'm very much at a loss

Hazelfire avatar Oct 19 '22 06:10 Hazelfire

@Hazelfire I'm sorry that you are experiencing such a difficult time contributing.

For the time being, you can skip linting and bypass the commit hook by doing the following:

git commit --no-verify git push --no-verify

We can sort out styling, etc, once we take a look at the initial API design and details.

kgryte avatar Oct 19 '22 08:10 kgryte

Hi Hazelfire,

Apologies for the cryptic lint error messages; I have just been refactoring the pre-commit hooks and our eslint rules to allow automatic fixes in order to streamline the development experience, but apparently there are some bumps in the road along this journey. As @kgryte mentioned, feel free to bypass the hooks for now. But you may pull the latest changes and re-initialize the lint rules via make init, which should avoid the cryptic error messages encountered above.

How do you generate , , and other markdown pre-processing directives? I know you've told me that I shouldn't need that and you'll do it on your end, but there are a lot of design questions that come around that level, rather than the individual function level. It would be nice to know how to do it (or even better, have it documented). It seems a large amount of your code is automatically generated (for instance, the tests?)

None of our tests are automatically generated. We are currently investigating scaffolding of packages via AI, but don't plan to auto-generate tests and implementations due to fears that this would lead to a false sense of security. We do have a directory of snippets that may be used when writing a new package.

Equations should be manually inserted into the README.mds via equation comments of the form

<!-- <equation class="equation" label="eq:gamma_function_positive_integers" align="center" raw="\Gamma ( n ) = (n-1)!" alt="Gamma function for positive integers."> -->

<!-- </equation> -->

where raw is the raw LaTeX code for the equation and alt is a human-readable text description for the equation.

We manually insert <toc /> comments for namespace packages, i.e. packages that contain one or several other packages. For example, for many of the distribution packages (e.g., @stdlib/stats/base/dists/normal), we may insert the following into their README.md:

The namespace contains the following distribution functions:

<!-- <toc pattern="*+(cdf|pdf|mgf|quantile)*"> -->

<!-- </toc> -->

The namespace contains the following functions for calculating distribution properties:

<!-- <toc pattern="*+(entropy|kurtosis|mean|median|mode|skewness|stdev|variance)*"> -->

<!-- </toc> -->

Here, the pattern attribute is used to determine which packages will be included inside the <toc /> block. A wildcard pattern pattern="*" will cause all packages to be listed inside the respective table of contents section.

We have various Make recipes that auto-populate the table of contents sections and generate equation SVG files. These are automatically run upon merging pull requests, so you don't have to worry about them.

The pre-commit hook is preventing me from contributing to this incrementally (for instance, starting with a simple README that's not correct, and then trying to refine from there) because eslint seems to expect me to have implemented the function along with what looks to be doc tests. Should I ignore the pre-commit hook? If not, how do you set up eslint? Eslint seems to ignore node_modules directories by default, and you have another config at /etc/eslint/.eslintignore to override that. But how do you go about including that ignore file? I can't find documentation about the ignore.

The Make recipes for linting as well as the pre-commit hook will use the correct eslint configuration without any custom configuration. To see eslint warnings and errors inside your IDE, you may need to change the eslint settings for your workspace. Personally, I am using VSCode and get all lint annotations after setting the eslint workingDirectories setting of the workspace to:

    "eslint.workingDirectories": [
        "lib/node_modules/@stdlib"
    ],

We may be better able to assist with more information on your development setup.

How do you go about generating and viewing the documentation? It would be nice to be able to see whether my readme compiles correctly

Would just preview the Markdown files and check that they render correctly; the compile step for inserting equations and updating table of content will happen upon merging in a pull request.

I'm also curious about namespaces because metalog is kind of a family of four distributions. Unbounded, Bounded, and Semi Bounded upper and lower. I'm not sure how we should represent this, my idea was to have four different modules inside metalog for the four different types, and then functions within those modules. Either that or we can follow the implementation set by rmetalog and simply set the different types to be parameters to the function (like a string that specifies what type)

That's a good question. One question, which you may be best suited to answer, is how much code overlap there is between the four different distributions. If there was considerable overlap and re-use of code, then the R approach of having a string parameter for setting the type seems sensible to me. However, if in contrast the four distributions have completely separate implementations, then having four different sub-namespaces for @stdlib/stats/base/dists/metalog with functions within seems like the way to go. The latter would then allow users of the library to only depend on the subset of code they need without having to pull in the code for all four distributions.

Hope this answers at least some of your questions! Please don't hesitate to follow-up as needed.

Planeshifter avatar Oct 20 '22 20:10 Planeshifter