zenml Fix GCP step logging

Describe changes

The issue arises because GCS artifacts are immutable Stack Overflow thread. To fix the issue, I rewrite the existing file with its old contents and buffer's content appended together.

Code to test the change: (Credits: @strickvl)

Please use GCS stack

import gcsfs
from zenml.client import Client
from zenml.logging.step_logging import StepLogsStorage

client = Client()
_ = client.active_stack

TEST_FILE="gs://zenml-2211/test.txt"

log_storage = StepLogsStorage(logs_uri=TEST_FILE, max_messages=5)
for i in range(0,11):
    log_storage.write(f"I'm log line #{i}")
log_storage.save_to_file()

fs = gcsfs.GCSFileSystem()
with fs.open(TEST_FILE, 'r') as f:
    all_of_it = f.read()

print(all_of_it)

Pre-requisites

Please ensure you have done the following:

[x] I have read the CONTRIBUTING.md document.
[x] If my change requires a change to docs, I have updated the documentation accordingly.
[ ] I have added tests to cover my changes.
[x] I have based my new branch on develop and the open PR is targeting develop. If your branch wasn't based on develop read Contribution guide on rebasing branch to develop.
[x] If my changes require changes to the dashboard, these changes are communicated/requested.

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[x] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Other (add details above)

Jan 26 '24 07:01 adtygan

[!IMPORTANT]

Auto Review Skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository.

To trigger a single review, invoke the @coderabbitai review command.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit-tests for this file.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit tests for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository from git and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit tests.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger a review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
The JSON schema for the configuration file is available here.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/coderabbit-overrides.v2.json

CodeRabbit Discord Community

Join our Discord Community to get help, request features, and share feedback.

Jan 26 '24 07:01 coderabbitai[bot]

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

🔎 Detected hardcoded secrets in your pull request

GitGuardian id	GitGuardian status	Secret	Commit	Filename
-		Google Cloud Keys	82c26cb032d72a7f39c35ea3b44e8ccd5261e214	zenml-key.json	View secret
-		Google Cloud Keys	c57dd68644be0fe82dcf079addf51fea8fa42e92	zenml-key.json	View secret

🛠 Guidelines to remediate hardcoded secrets

Understand the implications of revoking this secret by investigating where it is used in your code.
Replace and store your secrets safely. Learn here the best practices.
Revoke and rotate these secrets.
If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider

following these best practices for managing and storing secrets including API keys and other credentials
install secret detection on pre-commit to catch secret before it leaves your machine and ease remediation.

^{🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

Our GitHub checks need improvements? Share your feedbacks!}

Jan 26 '24 07:01 gitguardian[bot]

I accidentally included my GCP keys during the PR. I later made another commit to delete it. But GitGuardian still shows error. I have not worked on contributing to opensource before and so request some assistance on this.

Jan 26 '24 07:01 adtygan

Thanks for reviewing @htahir1 !

Will this effect other artifact stores like S3? I do not think so. This is because the existing solution appended the logs to the file. The proposed solution overwrites existing file with updated content and hence should work.
Is this still performant? Somehow I feel like we are doing too many IO operations... Maybe we should slow it down a bit? I did try to think of better solutions. The main limitation of GCS files being immutable lead me this option. Could you suggest any alternative approaches that I could try?

Thanks

Jan 26 '24 16:01 adtygan

@adtygan As the keys remain git hit history, please invalidate them from GCP ASAP to make it safe for your cloud.

For the performance, maybe a good way to do it would be to benchmark some running pipelines with varying logs. I have noticed if you use a rich or TQDM progress bar it slows down a LOT, and id love some benchmarks on the local store vs the GCS store for varying scripts :-)

Jan 26 '24 19:01 htahir1

Thanks for the suggestion @htahir1 , I have invalidated my key. With regard to benchmarks, please give me some time. I will get back on this and update you.

Jan 27 '24 05:01 adtygan

Hello @htahir1 , I want to confirm with you if I understand what you said correctly. I'm planning to measure the running time for logging 100, 1,000 and 10,000 lines. For each of these, I'm going to measure the running times for local, GCP, local with TQDM, GCP with TQDM. In total this should give 12 run time values. I need to measure these 12 values for the current version of the code and my PR version.

Is this correct? Thanks.

Feb 03 '24 13:02 adtygan

Yes this is correct!

Hamza Tahir Co-Creator & CTO

[image: ZenML] https://zenml.io/

Github https://github.com/zenml-io/zenml Twitter https://twitter.com/zenml_io Linkedin https://linkedin.com/company/zenml ZenML Inc./GmbH, Schellingstr. 36, 80799 Munich HRB Munich 268487, MD/GF: Adam Probst, Hamza Tahir

On Sat 3. Feb 2024 at 14:56, Aditya Ganesh Kumar @.***> wrote:

Hello @htahir1 https://github.com/htahir1 , I want to confirm with you if I understand what you said correctly. I'm planning to measure the running time for logging 100, 1,000 and 10,000 lines. For each of these, I'm going to measure the running times for local, GCP, local with TQDM, GCP with TQDM. In total this should give 12 run time values. I need to measure these 12 values for the current version of the code and my PR version.

Is this correct? Thanks.

— Reply to this email directly, view it on GitHub https://github.com/zenml-io/zenml/pull/2366#issuecomment-1925328447, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABER6ERWPQKNPONKWZGGEQTYRY62NAVCNFSM6AAAAABCLWRTHGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRVGMZDQNBUG4 . You are receiving this because you were mentioned.Message ID: @.***>

Feb 03 '24 14:02 htahir1

I did some parts of the bechmarking and noticed a bunch of issues.

Here is the details of the run for writing 100 lines of logs (averaged over 10 runs):

My version on GCP stack: 27.807 seconds
Develop branch version on GCP stack: (did not run this because it does not properly write logs)
My version on local stack: (this was creating file size of >1 GB which I don't yet know why it is happening, on GCP stack it works fine)
Develop branch version on local stack: 0.003 seconds

I'm noticing 2 big issues with my fix

It is not fast enough
It creates a huge file on local stack and I can't understand why it is doing.

I need some more time to look into this issue. Thanks.

Feb 06 '24 19:02 adtygan

@adtygan note that you're getting some linting failures on the CI. if you could fix those as well that'd be great!

Feb 08 '24 14:02 strickvl

Hello @strickvl , I don't think my current code can be optimized to improve performance. Instead, I checked the Potential Solution you had mentioned in the initial post of the issue (https://github.com/zenml-io/zenml/issues/2211#issue-2063754569). This option looks like the best choice. However, I want to clarify how to go about incorporating it.

If I understand correctly, you are suggesting to open the log file in write mode and then proceed with logging. This would write all the contents. I have tested on GCP stack and it works. But the only issue I realize is it is going to overwrite past logs.

Can I work on a solution where we create a temporary file to store the logs, and then using the exit() method append this file's contents to the main log file?

Thanks

Feb 14 '24 20:02 adtygan

@adtygan this sounds like a reasonable plan to try out! Would love to see how this new approach would benchmark against the old one

Feb 16 '24 15:02 htahir1

Sorry @htahir1 , I took a break from work for a few weeks and did not keep you updated. I will get back to working on the issue.

Mar 12 '24 19:03 adtygan

Hello @htahir1 and @strickvl , I have opened a PR (https://github.com/zenml-io/zenml/pull/2533). Please review it. To the best of my knowledge, I think this does not have any errors.

Mar 16 '24 06:03 adtygan

zenml zenml copied to clipboard

Fix GCP step logging

Describe changes

Pre-requisites

Types of changes

Auto Review Skipped

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

CodeRabbit Discord Community

⚠️ GitGuardian has uncovered 2 secrets following the scan of your pull request.

zenml
zenml copied to clipboard

CodeRabbit Configration File (`.coderabbit.yaml`)