SPIKE: Improve how Dataverse labels shapefiles to prevent mislabelling of zip files that aren't shapefiles
A depositor uploaded a double zipped file into a dataset in the Harvard Dataverse Repository and the file has been incorrectly labelled as a "Shapefile as ZIP Archive".
The file is in the published dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HWVUER.
There are no shape files in the zip file and the depositor wrote that it isn't a shapefile. The depositor also wrote that they used the UI (their Chrome browser) to upload the file (and not the Dataverse API). The email conversation with the depositor is at https://help.hmdc.harvard.edu/Ticket/Display.html?id=322790.
The file needs to be correctly labelled as a "ZIP Archive". Having it labelled as a "Shapefile as ZIP Archive" might be confusing to anyone looking to download the data.
I tried the redetect file type API endpoint. It reported that it worked, but the file is still labelled as a "Shapefile as ZIP Archive".
Lastly, I downloaded the Zip file, double zipped it again and uploaded it to Demo Dataverse to see if Demo would label it as a "Shapefile as ZIP Archive". It did. (The dataset was deleted along with other datasets older than 30 days.)
A .zip file would get labeled as a Shapefile if any of the included files has an extension in ["shp", "shx", "dbf", "prj"]. I can't see your example file - does it have one of these? If so, we could/should tighten up the logic to test for all four since all 4 are required and someone may have a .prj or other single extension for some other reason. If there are no files with these extensions, then something else is happening.
Hi @qqmyers. There are no shape files in the zip file.
@pdurbin found files like “pointZ.dbf pointZ.prj pointZ.shp pointZ.shx” in a hidden directory inside of the zip file. "They seem to come from an R package called “maptools”. The path in the zip is replication/rpkgs/.checkpoint/2020-07-30/lib/x86_64-w64-mingw32/4.0.2/maptools/shapes."
The depositor wrote that "the zip file does not contain any shape files." I'm not sure if the depositor's scripts use the maptools package. I've asked the depositor:
- Is the depositor using that R package? Or was it just imported and not used in the R code? If it isn't being used and can be removed, maybe the depositor can just not include that R package in the zip file, and then the Dataverse software won't label the zip files as a "Shapefile as ZIP Archive"
- Should the use of that R package make the zip file a "Shapefile as ZIP Archive"?
I haven't heard from the depositor, yet. Just sent a followup email. I also took a look at the R files in the zip file and didn't see a maptools package being imported, but I'm not very familiar with R either, so I asked the depositor some clarifying questions about that too.
The depositor let me know that they don't think they directly used maptools in the replication, but it's possible that other packages require maptools and that it's tough to figure out which packages require which packages, so they'd rather keep the current files as they are.
This sounds to me like the entire zip file is not a shapefile and it shouldn't be labelled as such only because a library used in code files includes hidden directories with shapefiles in it. Could that file detection feature in the Dataverse software be adjusted so that it doesn't label this and other zip files like it as "Shapefile as ZIP Archive"?
I remember hearing that by using the API we can also upload files and specify any file type. Thought that could be a workaround for this depositor so I tested this on Demo Dataverse:
curl -H X-Dataverse-key:$API_TOKEN -X POST -F "file=@$FILENAME;type=application/zip" "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID"
But it doesn't work for this zip file. The uploaded file is still labelled as "Shapefile as ZIP Archive" and the response in my terminal shows that the "contentType":"application/zipped-shapefile". (It does work for a PNG file I tried.)
@mreekie, in https://github.com/IQSS/dataverse/issues/8816 we wrote about planning to talk with others who know more about the preservation and use of shapefiles. I'm wondering if those folks can also weigh in on this.
Moving this out of the Harvard Dataverse Repository GitHub repo and into the Dataverse software GitHub
The file is in the published dataset at https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/HWVUER.
I gave this a 10. Hopefully it's straightforward and we have a file to try to reproduce the problem. ^^
We most recently touched this code here:
- #10305