kubler icon indicating copy to clipboard operation
kubler copied to clipboard

Kubler download_portage_snapshot() dl_name $_TODAY timezone difference can have different name to origin

Open berney opened this issue 3 years ago • 0 comments

The distfiles.gentoo.org hosting the portage snapshots has a portage-latest.tar.xz and portage-YYYYMMDD.tar.xz (and .bz2 files). The portage-latest.tar.xz will be identically to the latest portage-YYYYMMDD.tar.xz.

The function download_portage_snapshot() will download the portage snapshot, with $PORTAGE_DATE defaulting to latest. It will find portage-latest.tar.xz and download it with $dl_name based off $_TODAY. Due to timezone differences this can mean that the file will be named portage-20220914.tar.bz2, when the equivalent file on the server was portage-20220913.tar.bz2.

Later if upstream released a portage-20220914.tar.bz2, if locally $PORTAGE_DATE was set to 20220914, it would not download the new snapshot as it already has a file ("wrongly") named that, but they would be different files.

I'm working on running Kubler in CI/CD, and I'm caching downloads and other files to speed things up. I want to have consistent behaviour between runs. If I run a build before midnight and after midnight, and there's been no changes to the distfiles mirror, the 2nd run of kubler will download the same portage-latest.tar.xz file but name it differently.

In CI/CD I want consistency, I generally want things up-to-date. I like the default to latest, but I want the local name to match the remote name.

I wrote a kubler cmd to get the latest portage filename.

#!/usr/bin/env bash

# Based off lib/core.sh `fetch_stage3_archive_name()`
# Fetch latest portage snapshot archive name/type, returns exit signal 3 if no archive could be found
function fetch_portage_archive_name() {
    __fetch_portage_archive_name=
    local portage_url portage_regex remote_files remote_line remote_date remote_file_type max_cap
    portage_url="http://distfiles.gentoo.org/snapshots/"
    readarray -t remote_files <<< "$(wget -qO- "${portage_url}")"
    remote_date=0
    get_stage3_archive_regex "portage"
    # shellcheck disable=SC2154
    portage_regex="$__get_stage3_archive_regex"
    for remote_line in "${remote_files[@]}"; do
        if [[ "${remote_line}" =~ href=\"${portage_regex}\" ]]; then
            max_cap="${#BASH_REMATCH[@]}"
            is_newer_stage3_date "${remote_date}" "${BASH_REMATCH[$((max_cap-3))]}${BASH_REMATCH[$((max_cap-2))]}" \
                && { remote_date="${BASH_REMATCH[$((max_cap-3))]}${BASH_REMATCH[$((max_cap-2))]}";
                     remote_file_type="${BASH_REMATCH[$((max_cap-1))]}"; }
	    # We keep going to find the latest rather than the first
        fi
    done
    [[ "${remote_date//[!0-9]/}" -eq 0 ]] && return 3
    __fetch_portage_archive_name="portage-${remote_date}.tar.${remote_file_type}"
}

function main() {
    #echo "kubler dir: ${_KUBLER_DIR}"
    #echo "current namespace: ${_NAMESPACE_DIR}"

    #echo "Finding latest portage"
    # We are abusing `fetch_stage3_archive_name()`
    ## shellcheck disable=SC2034
    #STAGE3_BASE="portage"
    ## shellcheck disable=SC2034
    #ARCH_URL="http://distfiles.gentoo.org/snapshots/"
    ## This will find the first
    #fetch_stage3_archive_name
    ## shellcheck disable=SC2154
    #echo "$__fetch_stage3_archive_name"

    # This will find the latest
    fetch_portage_archive_name
    echo "$__fetch_portage_archive_name"
}

main "$@"

This works, and I could use it to set the $PORTAGE_DATE, to get the consistent behaviour.

$ kubler portage
portage-20220907.tar.bz2 <-- fetch_stage3_archive_name abuse
portage-20220914.tar.bz2 <-- fetch_portage_archive_name variant

I think it would be good to change Kubler's behaviour to download the latest YYYYMMDD portage snapshot rather than downloading and renaming portage-latest.

I might be worth refactoring fetch_stage3_archive_name() into a generic version, optionally exiting on first match (current behaviour), or continuing to latest match (needed for portage snapshots), and generalising the name of get_stage3_archive_regex().

I would also like the option to prefer the archive type bz2 vs xz.

berney avatar Sep 15 '22 17:09 berney