conp-dataset
conp-dataset copied to clipboard
Size or other cut-off points for building git-annex links
Right now, our general policy when ingesting new datasets is to build a git-annex link to every file in a dataset by default, with a couple of specific exceptions (README.md and DATS.json). However, the utility of building links rather than just storing small files directly in github is questionable, and in tests with the microstructure_informed_connectomics dataset, which contains ~11,300 files, building git-annex links to each file took nearly twice as long as building links only to files larger than a cut-off of 200kb (estimated by manual examination of some subdirectories) and downloading the rest directly.
Do we want to consider size-based or other criteria for which files get git-annex links (such as storing all text files directly) ?
This might be tricky for datasets that require third-party accounts since small files can still include data that should not be in the open.
For fully open datasets, I don't see the harm in doing that. @emmetaobrien maybe it could be something to add to the agenda of next week or the week after if we do not have time since we might be doing the roadmap planning?
Indeed, I was only thinking of this as applying to open datasets.
This issue is stale because it has been open 5 months with no activity. Remove stale label or comment or this will be closed in 3 months.
This issue was closed because it has been stalled for 3 months with no activity.
This issue is stale because it has been open 5 months with no activity. Remove stale label or comment or this will be closed in 3 months.
This issue was closed because it has been stalled for 3 months with no activity.
This issue is stale because it has been open 5 months with no activity. Remove stale label or comment or this will be closed in 3 months.
This issue is stale because it has been open 5 months with no activity. Remove stale label or comment or this will be closed in 3 months.
This issue is stale because it has been open 5 months with no activity. Remove stale label or comment or this will be closed in 3 months.
This issue is stale because it has been open 5 months with no activity. Remove stale label or comment or this will be closed in 3 months.