SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

docs: add aisample notebooks into community folder

Open thinkall opened this issue 3 years ago • 2 comments

Related Issues/PRs

None

What changes are proposed in this pull request?

Add two notebooks for aisample into notebooks/community/aisample folder.

  • AIsample - Book Recommendation.ipynb
  • AIsample - Fraud Detection.ipynb

How is this patch tested?

  • [ ] I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Does this PR change any dependencies?

  • [x] No. You can skip this section.
  • [ ] Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

  • [x] No. You can skip this section.
  • [ ] Yes. Make sure you have added samples following below steps.
  1. Find the corresponding markdown file for your new feature in website/docs/documentation folder. Make sure you choose the correct class estimators/transformers and namespace.
  2. Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
  3. Make sure the DocTable points to correct API link.
  4. Navigate to website folder, and run yarn run start to make sure the website renders correctly.
  5. Don't forget to add <!--pytest-codeblocks:cont--> before each python code blocks to enable auto-tests for python samples.
  6. Make sure the WebsiteSamplesTests job pass in the pipeline.

AB#1920095

thinkall avatar Aug 09 '22 06:08 thinkall

Hey @thinkall :wave:! Thank you so much for contributing to our repository :raised_hands:. Someone from SynapseML Team will be reviewing this pull request soon. We appreciate your patience and contributions :100:!

github-actions[bot] avatar Aug 09 '22 06:08 github-actions[bot]

Hey @thinkall welcome to the project! I added you to the SynapseML team so you can now queue build with the magical /azp run comment. But before you queue a few questions and requests

  1. Do you intend for this notebook to run on platforms like Synapse and Databricks? I know there's a dependence on the lakehouse, but we can just as easily move this data to our public blob so that the demo works on all the platforms. If so we'll want to make some changes so that we can test this notebook as part of our build system and Serena might be able to help you make these changes

  2. Can you run black on these notebooks so that they get properly formatted into multiple lines to make the code review process easier?

  3. Can you clear the output of the cells prior to checkin? To show key results, include them as an image in the markdown so that folks can compare.

Appreciate these contributions and from the start these notebooks look like they have some great content!

mhamilton723 avatar Aug 10 '22 03:08 mhamilton723

Hi @mhamilton723 , thanks a lot for the suggestions. I've formatted the notebooks with black and cleared the output of cells. As for the cross-platform issue, we're focusing on just one platform at the moment, and we prefer not to introduce concepts of other platforms. I guess we can add some code to make it runnable in Synapse/Databricks after MVP. Does this make sense?

thinkall avatar Aug 10 '22 10:08 thinkall

Hey @mhamilton723 , first of all, a huge thank you to you. Your detailed comments are so helpful. I modified our notebooks following most of your comments. For the rest, please let me explain why we did it in that way.

The purpose of these notebooks is to showcase the functionalities of our new platform with some e2e examples in different scenarios. For users who are not familiar with the techniques (ML, Spark, etc.), they can just copy&paste the notebooks, run it directly without changing anything. They can also apply a notebook to their own datasets by simply uploading the dataset to the lakehouse and modifying the params in the first code cell accordingly, then click run! For experienced users, they can also use these notebooks as references.

Loading Data

To achieve these goals, we choose to load arbitrary dataset and reorder/rename some of its columns to make the code in other parts simple. So we hope to keep the load data part as it is. The downloading data part is also needed, using lakehouse is actually one of the feature we want to showcase.

EDA

  • "coalesce before adding monotonic IDs" is needed to keep the ids continously.
  • "_" is used to avoid naming conflicts as much as possible.

Merge Data

  • "get the columns explicitly" is for reordering the columns.

Model Eval

  • It seems that <Regression>Evaluator from pyspark.ml can't eval all the metrics at the same time, <Regression>Metrics from pyspark.mlib can do that but it only accept rdd inputs. Moreover, the current eval time is only a few seconds, so I guess it's acceptable.

Model saving and loading

  • Yes, it is intentional. We'll add code using mlflow plugin for model saving and loading.

Plus, I adjusted the first cell, with comments aligned. It's different from what black wants, but I think it looks better for this cell. Is it OK for you?

thinkall avatar Aug 12 '22 10:08 thinkall

Will give this another full review tonight, one thing i noticed is that the files are still very large due to random synapse data held in the notebooks, can you remove those entries in the json so these notebooks are smaller files?

mhamilton723 avatar Aug 15 '22 17:08 mhamilton723

Will give this another full review tonight, one thing i noticed is that the files are still very large due to random synapse data held in the notebooks, can you remove those entries in the json so these notebooks are smaller files?

Thanks @mhamilton723, all the unnecessary random metedata has been removed.

thinkall avatar Aug 16 '22 02:08 thinkall

Li, thanks so much for your fixes and changes! Apologies in advance for any annoying comments i leave you andi appreciate your patience in working with me for these fixes. I think its become alot simpler since the first iteration and im hoping we can continue to distill this into its simplest and easiest form for users.

Book Recs

EDA

  • "coalesce before adding monotonic IDs" is needed to keep the ids continously." - I dont think they need to be continuous, its just a user GUID and coalesce forces you to go to a single machine which doesent scale

  • "_" is used to avoid naming conflicts as much as possible. - I think we should still remove these underscore names and instead save _names for private members as is the standard. Consider simplifying the logic of your queries to avoid the need to use these underscores. Also consider this for the variable _df_all as its used alot in your code and hence shouldnt be marked as a private variable. In general, _ should be reserved for private variables/methods in python classes as opposed to used objects and names in a public demo.

Model Training

  • I still think this can be simplified with a smaller if statement. Consider the following
if model_tuning_method == "CrossValidator":
    tuner = CrossValidator(
        estimator=als, estimatorParamMaps=param_grid, evaluator=evaluator, numFolds=5
    )
elif model_tuning_method == "TrainValidationSplit":

    tuner = TrainValidationSplit(
        estimator=als,
        estimatorParamMaps=param_grid,
        evaluator=evaluator,
        # 80% of the data will be used for training, 20% for validation.
        trainRatio=0.8,
    )
else:
    raise ValueError(f"Unknown model_tuning_method: {model_tuning_method}")

Wioth this downstream code becomes:

models = tuner.fit(train)
model = models.bestModel

Likewise the next if can be simplified

if model_tuning_method == "CrossValidator":
    metric = models.avgMetrics
elif model_tuning_method == "TrainValidationSplit":
    metric = models.validationMetrics
else:
    raise ValueError(f"Unknown model_tuning_method: {model_tuning_method}")

And the other stuff can be factored out.

  • the code here:
# collect metrics
tmp_list = []
for params in models.getEstimatorParamMaps():
    tmp = ""
    for k in params:
        tmp += k.name + "=" + str(params[k]) + " "
    tmp_list.append(tmp)

Can be simplified to something like

# collect metrics
param_strings = []
for param_map in models.getEstimatorParamMaps():
    param_strings.append(" ".join(f"{k}={v}" for (k,v) in param_map.items())

Model Evaluation

  • predictions=... can be cleaned up with dot chaining
  • The evaluation section hasalot of repeated code both in terms of the experiment duplicated for train and test, and the same snippets "evaluator.evaluate..." consider making these functions or lambdas to tighten this up a bit

Style

  • "It's different from what black wants" - unfortunately our build error this if its not black compliant, consider shortening the comments so black doesent need to bump to new line

Fraud Detection

Cast columns

  • Consider dot chaining to neaten this section

Model evaluation

  • mmlspark is deprecated namespace change to synapse.ml

  • Same comment with duplication on eval code with smote vs non-smote consider making a function or a loop

Saving model

  • We should comment that saveNativeModel is useful if you want to use it in other lightGBM context, otherwise for usage later with spark use the sparkML save load on LGBM Classifier. For this demo it might be simplest to use SparkML model serialization instead of LGBM native

Thanks again for all your patience. I know its no fun getting comments and I really appreciate you!

mhamilton723 avatar Aug 17 '22 03:08 mhamilton723

@mhamilton723 , thanks a lot for all the suggestions. All but one are accepted.

"coalesce before adding monotonic IDs" is needed to keep the ids continously." - I dont think they need to be continuous, its just a user GUID and coalesce forces you to go to a single machine which doesent scale

There are two reasons for us to apply coalesce.

  1. The new generated _user_id is used for recommendation models, it's better to be continous integers to get smaller model size, faster inference speed for some models. So it's more than just GUID.

  2. We used ALS model in this demo, the spark implementation only supports values in Integer range, if we don't apply coalesce, for big dataset, the generated id may exceed the range. In fact, I added coalesce after I met this issue.

thinkall avatar Aug 17 '22 10:08 thinkall

@mhamilton723 , thanks a lot for all the suggestions. All but one are accepted.

"coalesce before adding monotonic IDs" is needed to keep the ids continously." - I dont think they need to be continuous, its just a user GUID and coalesce forces you to go to a single machine which doesent scale

There are two reasons for us to apply coalesce.

  1. The new generated _user_id is used for recommendation models, it's better to be continous integers to get smaller model size, faster inference speed for some models. So it's more than just GUID.
  2. We used ALS model in this demo, the spark implementation only supports values in Integer range, if we don't apply coalesce, for big dataset, the generated id may exceed the range. In fact, I added coalesce after I met this issue.

Thanks @thinkall, i see what you are saying. I think in this case we should instead use the SparkML StringIndexer as this is designed to learn a mapping from strings representing IDs to integers/longs. This will automatically make sure the ids are contiguous, but it wont have the aforementioned scalability issues. Also i still see alot of _ column names. Sorry to be a pain but can we try to name those without _. Your use of _evaluate is OK though!

mhamilton723 avatar Aug 17 '22 16:08 mhamilton723

All done.

thinkall avatar Aug 18 '22 03:08 thinkall

/azp run

thinkall avatar Aug 18 '22 04:08 thinkall

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Aug 18 '22 04:08 azure-pipelines[bot]

Codecov Report

Merging #1606 (1c108d1) into master (ac40e5a) will decrease coverage by 0.04%. The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #1606      +/-   ##
==========================================
- Coverage   83.79%   83.75%   -0.05%     
==========================================
  Files         292      292              
  Lines       15473    15473              
  Branches      752      752              
==========================================
- Hits        12966    12959       -7     
- Misses       2507     2514       +7     
Impacted Files Coverage Δ
...crosoft/azure/synapse/ml/io/http/HTTPClients.scala 67.64% <0.00%> (-7.36%) :arrow_down:
.../execution/streaming/continuous/HTTPSourceV2.scala 92.08% <0.00%> (-0.72%) :arrow_down:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov-commenter avatar Aug 18 '22 04:08 codecov-commenter

/azp run

thinkall avatar Aug 18 '22 06:08 thinkall

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Aug 18 '22 06:08 azure-pipelines[bot]

/azp run

mhamilton723 avatar Aug 19 '22 00:08 mhamilton723

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines[bot] avatar Aug 19 '22 00:08 azure-pipelines[bot]