peppy
peppy copied to clipboard
Feature request: `peppy.Project.config.to_yaml()` doesn't accurately reproduce original config file
Overview
Over at PEPhub, I am implementing a feature to download a PEP as a zipped folder. To do this, I am recapitulating the PEP using the internal state of the peppy.Project
object returned from pepagent
and the database. However, the peppy.Project.config
object saves file path configurations as relative to where the peppy.Project
object was instantiated, rather than the actual value in the file. This becomes problematic when downloading PEPs since the file path in the project_config.yaml
file and the actual file location are now out of sync. Here is an example:
Example
The config file:
cat examples/demo/basic/project_config.yaml
pep_version: "2.0.0"
sample_table: sample_table.csv
At the root of pephub (/
):
>>> import peppy
>>> p = peppy.Project("examples/demo/basic/project_config.yaml")
>>> p.config
pep_version: 2.0.0
sample_table: examples/demo/basic/sample_table.csv
Inside the PEP folder (/examples/demo/basic/project_config.yaml
):
>>> import peppy
>>> p = peppy.Project("project_config.yaml")
>>> p.config
pep_version: 2.0.0
sample_table: sample_table.csv
Note how the sample_table
attribute is changing depending on where the PEP was instantiated
I suspect peppy
uses this state internally to initialize things and it was never intended that the original config file would need to be reproduced. This is less of a bug, I guess, and more of a feature request. I could manually update the internal state of peppy.Project
and just strip the relative paths using os.path.basename
, but it would be nice to have a method that can recreate the PEP assuming all files are next to one another.
Supplemental Material :D
Here is the zipping function where I build back up the PEP:
def zip_pep(project: peppy.Project) -> Response:
"""Zip a project up to download"""
content_to_zip = {}
if project.config:
cfg_filename = basename(project.config_file)
content_to_zip[cfg_filename] = project.config.to_yaml()
if project.sample_table is not None:
sample_table_filename = basename(project.to_dict().get('sample_table', "sample_table.csv"))
content_to_zip[sample_table_filename] = project.sample_table.to_csv()
if project.subsample_table is not None:
# sometimes the subsample table is a list. So change behavior
# based on this
if not isinstance(project.subsample_table, list):
subsample_table_filename = basename(project.to_dict().get('subsample_table', "subsample_table.csv"))
content_to_zip[subsample_table_filename] = project.subsample_table.to_csv()
else:
subsample_table_filenames = project.to_dict().get('subsample_table', "subsample_table.csv")
for sstable, sstable_filename in zip(project.subsample_table, subsample_table_filenames):
subsample_table_filename = basename(sstable_filename)
content_to_zip[subsample_table_filename] = **sstable.to_csv()**
Hmm.
I'm a little surprised the peppy adjusts the content in the config file, to be honest.
I think instead, peppy should use an internal value where it keeps track of where it actually finds the sample tables, rather than updating the config. @stolarczyk can you comment on this?
@Khoroshevskyi with your experience with all the to_dict stuff, can you comment on this?
Hmm.
I'm a little surprised the peppy adjusts the content in the config file, to be honest.
I think instead, peppy should use an internal value where it keeps track of where it actually finds the sample tables, rather than updating the config. @stolarczyk can you comment on this?
@Khoroshevskyi with your experience with all the to_dict stuff, can you comment on this?
to_dict
function will take all the content of _config
and _sample_df
variables, it won't change anything. But I agree with @nleroy917 , that this variables are changing during peppy processing.
It won't influence on methods that are initializing peppy from dict, but it will initial paths, that can cause errors in the future (e.g. in this Nathans issue).
I think instead, peppy should use an internal value where it keeps track of where it actually finds the sample tables, rather than updating the config. @stolarczyk can you comment on this?
Yeah, this seems like a sensible approach. I'm not sure what was my motivation for implementing it the other way. I would have done what you suggest if this issue had manifested earlier.