Building from a local sdist file url broken in 0.21.0
Support for local source file url scheme was added in https://github.com/prefix-dev/rattler-build/pull/177 and working in version 0.5.0.
I hadn't tested that in a while. When trying to build a recipe using a local file as source with rattler-build 0.21.0, it fails.
Issue can be reproduced with:
context:
version: "13.4.2"
package:
name: "rich"
version: ${{ version }}
source:
- url: file:///tmp/rich/rich-13.4.2.tar.gz
sha256: d653d6bccede5844304c605d5aac802c7cf9621efd700b46c7ec2b51ea914898
build:
# Thanks to `noarch: python` this package works on all platforms
noarch: python
script:
- python -m pip install . -vv --no-deps --no-build-isolation
requirements:
host:
- pip
- poetry-core >=1.0.0
- python 3.10
run:
# sync with normalized deps from poetry-generated setup.py
- markdown-it-py >=2.2.0
- pygments >=2.13.0,<3.0.0
- python 3.10
- typing_extensions >=4.0.0,<5.0.0
tests:
- python:
imports:
- rich
pip_check: true
about:
homepage: https://github.com/Textualize/rich
license: MIT
license_file: LICENSE
summary: Render rich text, tables, progress bars, syntax highlighting, markdown and more to the terminal
description: |
Rich is a Python library for rich text and beautiful formatting in the terminal.
The Rich API makes it easy to add color and style to terminal output. Rich
can also render pretty tables, progress bars, markdown, syntax highlighted
source code, tracebacks, and more — out of the box.
documentation: https://rich.readthedocs.io
repository: https://github.com/Textualize/rich
$ rattler-build build
...
╭─ Running build for recipe: rich-13.4.2-pyh4616a5c_0
│
│ ╭─ Fetching source code
│ │ Validated SHA256 values of the downloaded file!
│ │ Using local source file.
│ │ Copying source from url: "/tmp/rich/rich-13.4.2.tar.gz" to "/tmp/rich/output/bld/rattler-build_rich_1725435886/work"
...
│ ╭─ Running build script
│ │ + python -m pip install . -vv --no-deps --no-build-isolation
│ │ Using pip 24.2 from $PREFIX/lib/python3.10/site-packages/pip (python 3.10)
│ │ Non-user install because user site-packages disabled
│ │ Ignoring indexes: https://pypi.org/simple
│ │ Created temporary directory: /tmp/pip-build-tracker-l3rer2po
│ │ Initialized build tracking at /tmp/pip-build-tracker-l3rer2po
│ │ Created build tracker: /tmp/pip-build-tracker-l3rer2po
│ │ Entered build tracker: /tmp/pip-build-tracker-l3rer2po
│ │ Created temporary directory: /tmp/pip-install-149wgmke
│ │ ERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.
...
$ ls /tmp/rich/output/bld/rattler-build_rich_1725435886/work
build_env.sh conda_build.sh rich-13.4.2.tar.gz
The local file was copied to the work directory but wasn't unarchived.
Thanks! There are a few workarounds, of course (e.g. making pip unarchive the file). I would also be interested if path: /tmp/rich-.tar.gz works differently?
Lastly, I do think you are right and this file should be un-archived to have the same behavior as fetching from a URL.
It's working with path: :-)
╭─ Running build for recipe: rich-13.4.2-pyh4616a5c_0
│
│ ╭─ Fetching source code
│ │ Fetching source from path: "/tmp/rich/rich-13.4.2.tar.gz"
│ │ Extracted to "/tmp/rich/output/bld/rattler-build_rich_1725441621/work"
│ │
│ ╰─────────────────── (took 0 seconds)
We can see in the logs that it is extracted.
Using path: instead of url: file:// is fine for me.
Would still be nice to fix the file url behaviour as you said.
Same here, but unfortunately path: also fails:
- An
url:key followed by afile://URL fails to build:source: url: file:///path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.ziprattler-build just copies the ZIP file, does not unarchive it
│ ╭─ Fetching source code │ │ Validated SHA256 values of the downloaded file! │ │ Using local source file. │ │ Copying source from url: "/path/to//matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip" to "/tmp/channel/bld │ │ /rattler-build_matlab-runtime_1728648393/work" │ │ │ ╰─────────────────── (took 69 seconds) - An
url:key followed by anhttps://URL works just fine:source: url: https://ssd.mathworks.com/supportfiles/downloads/R2019b/Release/9/deployment_files/installer/complete/glnxa64/MATLAB_Runtime_R2019b_Update_9_glnxa64.ziprattler-build unarchives the ZIP file
│ ╭─ Fetching source code │ │ Validated SHA256 values of the downloaded file! │ │ Found valid source cache file. │ │ Using extracted directory from cache: "/tmp/channel/src_cache/MATLAB_Runtime_R2019b_Update_9_glnxa64_d213e296" │ │ Copying source from url: "/tmp/channel/src_cache/MATLAB_Runtime_R2019b_Update_9_glnxa64_d213e296" to "/tmp/channel/bld/rattler- │ │ build_matlab-runtime_1728648935/work" │ │ │ ╰─────────────────── (took 32 seconds) - A
path:key initially seemed to work equally fine, but rattler-build keeps unarchiving forever:source: path: /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.ziprattler-build attempts to unarchive the ZIP file, but unzipping lasts forever...
│ ╭─ Fetching source code │ │ Fetching source from path: "/path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip" │ │ ⠤ Extracting zip [00:04:53] [━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╾──────] 2.16 GiB @ 7.56 MiB/s
The issue ~~is probably~~ might be that rattler-build is unable to handle ZIP files larger than 2 GB:
$ ls -lh /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip
-rwxrwx---+ 1 username nogroup 2.6G Aug 12 2021 /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip
$
@DimitriPapadopoulos thank you for the detailed write-up!
~~4:53 doesn't sound like forever to me. Also the indicator is still going at 7.50 MiB/s. I am wondering if it's just slow? Do you have a reference for how long it should take to extract?~~
Ah, I see that in the URL case it takes only 30 seconds so something is wrong. I'll have to take a look.
While /path/to is indeed on a network (NFS) share, our workstations have 1 Gb/s network interfaces and our storage infrastructure is a CephFS cluster with quite decent throughput:
$ rsync --progress /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip /tmp/
MATLAB_Runtime_R2019b_Update_9_glnxa64.zip
2,786,688,287 100% 448.76MB/s 0:00:05 (xfr#1, to-chk=0/1)
$
I'm not used to building/running Rust programs, but chances are function extract_zip stalls in our context:
extract_zip
/// `.zip` files archived with compression other than deflate would fail.
pub(crate) fn extract_zip(
archive: impl AsRef<Path>,
target_direcextract_ziptory: impl AsRef<Path>,
log_handler: &LoggingOutputHandler,
) -> Result<(), SourceError> {
let archive = archive.as_ref();
let target_directory = target_directory.as_ref();
fs::create_dir_all(target_directory)?;
let len = archive.metadata().map(|m| m.len()).unwrap_or(1);
let progress_bar = log_handler.add_progress_bar(
indicatif::ProgressBar::new(len)
.with_finish(indicatif::ProgressFinish::AndLeave)
.with_prefix("Extracting zip")
.with_style(log_handler.default_bytes_style()),
);
let mut archive = zip::ZipArchive::new(progress_bar.wrap_read(
File::open(archive).map_err(|_| SourceError::FileNotFound(archive.to_path_buf()))?,
))
.map_err(|e| SourceError::InvalidZip(e.to_string()))?;
let tmp_extraction_dir = tempfile::Builder::new().tempdir_in(target_directory)?;
archive
.extract(&tmp_extraction_dir)
.map_err(|e| SourceError::ZipExtractionError(e.to_string()))?;
move_extracted_dir(tmp_extraction_dir.path(), target_directory)?;
progress_bar.finish_with_message("Extracted...");
Ok(())
}
Could it be that MATLAB_Runtime_R2019b_Update_9_glnxa64.zip is "archived with compression other than deflate"?
Would you be able to try with the file on the same filesystem? It could be related to NFS, after all.
Will try next week.
By the way, the compression method is either defX or stor for all entries in the ZIP file, nothing exotic here:
$ zipinfo -l /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip | grep -v -e ' defX ' -e ' stor '
Archive: /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip
Zip file size: 2786688287 bytes, number of entries: 5487
5487 files, 2989357849 bytes uncompressed, 2785399227 bytes compressed: 6.8%
$
My workstation was updated from Ubuntu 22.04 to Ubuntu 24.04 a few days ago. I wonder whether a filesystem issue could plague it. After "heavy use" (typically running rattler-build to build from simple but large sources) Google Chrome starts complaining (without reason) about invalid site certificates or identifies other sites as non-existent. I couldn't find anything suspicious in the system logs. I will try on a machine still running Ubuntu 22.04, this might be totally unrelated to rattler-build — could be a Linux kernel bug.
That sounds strange. rattler-build itself should not modify anything system-wide. Of course, I don't know what the build scripts are doing.
Oh, I mean it wouldn't be a rattler-build issue, rather a Linux kernel bug triggered by something specific to rattler-build operation, perhaps manipulating lots of hardlinks.
The scripts are very simple, they just unzip and don't event test. For example: https://github.com/neurospin/neuro-forge/pull/15/files
My issue was probably a Linux kernel issue, or more generally a system issue. Today, ZIP extraction works just fine, either from the local file system:
│ ╭─ Fetching source code
│ │ Fetching source from path: "/tmp/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip"
│ │ Extracted zip to "/tmp/channel/bld/rattler-build_matlab-runtime_1728884183/work"
│ │
│ ╰─────────────────── (took 32 seconds)
or the NFS share:
│ ╭─ Fetching source code
│ │ Fetching source from path: "/path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip"
│ │ Extracted zip to "/tmp/channel/bld/rattler-build_matlab-runtime_1728885071/work"
│ │
│ ╰─────────────────── (took 31 seconds)
Unfortunately, I am again having freezing issues with path: pointing to an NFS share. Yet, unzipping from that same NFS share works without problem:
$ time unzip /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip
Archive: /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip
inflating: sys/os/glnxa64/libgcc_s.so.1
inflating: sys/os/glnxa64/README.libstdc++
linking: sys/os/glnxa64/libstdc++.so.6 -> libstdc++.so.6.0.22
inflating: sys/os/glnxa64/libstdc++.so.6.0.22
extracting: sys/java/jre/glnxa64/jre/LICENSE
extracting: sys/java/jre/glnxa64/jre/bin/ControlPanel
.
.
.
.
.
inflating: productdata/35212.txt
finishing deferred symbolic links:
sys/os/glnxa64/libstdc++.so.6 -> libstdc++.so.6.0.22
bin/glnxa64/libcrypto.so.1 -> libcrypto-mw.so.1.1
bin/glnxa64/libssl.so.1 -> libssl-mw.so.1.1
real 0m28,605s
user 0m24,789s
sys 0m3,682s
$
I don't see anything relevant in the system logs.
Hmm, maybe we need to use a BufferReader or something like that somewhere ...
@DimitriPapadopoulos it was indeed missing a BufReader: https://github.com/prefix-dev/rattler-build/pull/1144 - I believe this will help nicely in your case.
@wolfv Thank you very much for looking into this issue. I don't know much about Rust, I understand it provides unbuffered I/O by default and that unbuffered I/O can be slow due to repeated system calls. Yet progress_bar.wrap_read really felt like it was frozen. Any way, I probably won't have time to test a specific commit, but I will make sure to test the next release. Again, than you very much.
@DimitriPapadopoulos - the progress bar is just for showing the progress. The main problem was the unbuffered read which will result in many more system calls and generally be slow. I am very sure that this can be exaggerated by slow disk / NFS filesystems. We already had this optimization for the Tar-file reader but missed it for Zip.
I already made the release so you can try out 0.28.2 whenever you have time. I am quite sure that it should give you a decent improvement :)
Just upgraded to 0.28., it's still slow. The throughput shown by the progress bar keeps dropping forever:
╭─ Running build for recipe: matlab-runtime-9.7-9-hb0f4dca_0
│
│ ╭─ Fetching source code
│ │ Fetching source from path: /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip
│ │ ⠦ Extracting zip [00:00:13] [━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╾] 2.54 GiB @ 195.72 MiB/s
│ │ Fetching source from path: /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip
│ │ ⠉ Extracting zip [00:01:13] [━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╾─] 2.49 GiB @ 34.93 MiB/s
│ │ Fetching source from path: /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip
│ │ ⠦ Extracting zip [00:33:10] [━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╾─────────] 1.95 GiB @ 1.00 MiB/s
argh. Just to be sure - 0.28.2, right?
Yes, it's 0.28.2 (I forgot to copy/paste the output of --version):
$ rattler-build --version
rattler-build 0.28.2
$
When I start rattler-build, I see:
- a surge of CPU use (with one proc at 100 %) without much network traffic,
- then (receiving) network traffic kicks in and eventually oscillates well under 1000 KiB/s and CPU use drops to almost nothing (see screen capture), while the throughput displayed by
rattler-builddrops drastically, - when I forcibly stop
rattler-buildwith Ctrl+C, network traffic immediately drops to 0.
In short, at the system level, network resources are not used as they should. When running unzip /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip, receiving network traffic steadily peaks at ~ 80 MiB/s which is consistent with the 1 Gb/s link of the workstation.
Nothing in the system logs.
Note that file /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip is > 2 GB, but then it's not a problem with path: /tmp/... or url: https://....
Where is your output folder located and the corresponding src_cache folder? Is that also on the network drive?
I am not really sure what we're doing wrong .. I had high hopes for the BufReader! :)
The output dir is /tmp/channel, it's the local disk.
Now about the cache. We used to have home dirs on NFS servers, but that's not the case any more. Besides, even with home dirs on NFS servers, we used to point the environment variable XDG_CACHE_HOME to local disk. Where it gets interesting is that I run rattler-build though the script of a colleague which executes, env HOME=/tmp/channel rattler-build in an effort to make doubly sure the cache is local. Let me try to skim that:
Initial command:
env HOME=/tmp/channel rattler-build build -r /local/disk/recipes/matlab-runtime-9.7 --output-dir /tmp/channel --experimental -c conda-forge -c bioconda
Skimmed down command:
rattler-build build -r /local/disk/recipes/matlab-runtime-9.7 --output-dir /tmp/channel -c conda-forge
Unfortunately it remains as slow as before. I'm not sure how to further investigate. Do you have a Rust code snippet that unzips a file I could try to build and test locally? I wouldn't be suprised if it were a Rust bug.
What does progress_bar.wrap_read really do? Could it be that it somehow adversely affects disk reads?
I kicked off a build that you could try for debugging: https://github.com/prefix-dev/rattler-build/pull/1146 ...
And when you run unzip locally, you also extract to that same /tmp/... folder?
Yuo can find the binaries here: https://github.com/prefix-dev/rattler-build/actions/runs/11594061424?pr=1146
I unzip in /tmp:
$ mkdir /tmp/channel
$
$ cd /tmp/channel/
$
$ time unzip /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip
Archive: /path/to/matlab-runtime/MATLAB_Runtime_R2019b_Update_9_glnxa64.zip
inflating: sys/os/glnxa64/libgcc_s.so.1
inflating: sys/os/glnxa64/README.libstdc++
.
.
.
inflating: productdata/35212.txt
finishing deferred symbolic links:
sys/os/glnxa64/libstdc++.so.6 -> libstdc++.so.6.0.22
bin/glnxa64/libcrypto.so.1 -> libcrypto-mw.so.1.1
bin/glnxa64/libssl.so.1 -> libssl-mw.so.1.1
real 0m44,915s
user 0m30,763s
sys 0m6,814s
$
I do see a x86_64-unknown-linux-musl build, but am not sure how to install/run locally (I am new to Rust). Is it as simple as git clone and cargo build?
EDIT: Ah, just found the binaries.