dataverse Cannot Advance search for Original File types (files that have been converted to tab)

When uploading a csv file, IF it is good format, dataverse converts it to a "tab" file. But if a user searches for that file with "File Type = csv", (or any original file type that has been changed to tab) it will not be found. It is found with File Type = tab

Actually searching for "csv" in File Type doesn't find any file (and "csv" is one of the example extensions in the tool tip). And there are some csv files (if dataverse cannot convert them to tab, they stay csv).

Dec 11 '15 22:12 shlake

@shlake um, sorry for the slow response on this issue. 😄 Please check out this related issued I opened last week about searching file types:

#3597

Feb 02 '17 01:02 pdurbin

I changed the above link to #3597, from #2822 (this issue). That's the one you meant, right @pdurbin?

Feb 07 '17 21:02 jggautier

@jggautier yes, thanks! Also, this new issue is related:

#3952

Jun 23 '17 13:06 pdurbin

I would not describe #3952 as related really; I mean, it also deals with the "original format type". But it was something we just broke unintentionally in the last release; in a way that's not related to the feature requested in this issue.

As for this issue: once a file is ingested as tabular data, its content type changes to "text/tab-separated-values", by design, yes. So no, you cannot search on "file type = csv" to locate tab files produced from csv files.

However, that original format ("text/csv", "application/x-stata", etc.) is preserved in the database in the table Datatable, as "originalfileformat". So in the database, you can search for all the tab files produced from CSV by something like

SELECT f.id FROM datafile f, datatable t WHERE t.datafile_id = f.id AND f.originalfileformat LIKE '%csv%';

BUT, we are not indexing this information in SOLR, so we can't search on it, via API or the "advanced search" form. (And that's is what the original requester is asking for)

It would be very easy to start indexing it. We would just need to add an extra field to the schema. And add one line to IndexServiceBean.java, something like:

datafileSolrInputDocument.addField("OriginalFileType", fileMetadata.getDataFile().getDataTable().getOriginalFileFormat());

We must have just missed this request, so it sat around unaddressed for a while.

(while we are at it, we should think if there's anything else specific to tabular data that we are not indexing either... but I can't think of anything right away)

Jun 23 '17 18:06 landreev

@landreev ah, great to know that the data is in the database and that all we need to do is index it, which, as you say, should be a trivial change. Thanks!

Jun 26 '17 00:06 pdurbin