sgn
sgn copied to clipboard
Generic Data File Parser
Expected Behavior
Create a generic data file parser that could be used with all of the file upload functions that will support .csv, .xls, .xlsx, .txt (tab-separated) files for any upload. We've had a few people request CSV uploads due to problems/limitations with Excel.
I've started working on this on the topic/generic_file_parser branch.
There is a new CXGN::File::Parse
class that can be used to parse any of the supported file types into a uniform parsed data format.
For example:
my $parser = CXGN::File::Parse->new(
file => '/home/production/public/data.csv',
required_columns => [ 'accession_name', 'species_name' ],
column_aliases => {
'accession_name' => [ 'accession', 'name' ],
'species_name' => [ 'species' ]
},
column_arrays => [ 'synonym', 'organization_name' ]
);
my $parsed = $parser->parse();
my $errors = $parsed->{errors};
my $columns = $parsed->{columns};
my $data = $parsed->{data};
my $values = $parsed->{values};
will return:
-
errors
: an array of error messages encountered during file read / parsing- problems with opening the file (file doesn't exist, error from type-specific perl module)
- missing required columns (when the required columns are specified)
- rows with no values for required columns
-
columns
: an array of the column headers in the file -
data
: an array of hashes, where each array item is one row of the input file -
values
: a hash of the unique values for each column- for columns specified in the
column_arrays
argument, the value will be split by the delimiter (',' by default) and returned as an array
- for columns specified in the
Example Input:
username | first_name | last_name | email address | organization | address | country | phone | research_keywords | research_interests | webpage |
---|---|---|---|---|---|---|---|---|---|---|
testing1 | Test, Mȧ | Testing123 | [email protected] | Cornell University | ||||||
testing2 | Test | Testing456 | [email protected] | Cornell University | ||||||
testing3 | John | Testing | [email protected] | Cornell University |
Example Output:
{
"errors": [],
"columns": [
"username",
"first_name",
"last_name",
"email address",
"organization",
"address",
"country",
"phone",
"research_keywords",
"research_interests",
"webpage"
],
"data": [
{
"address": null,
"email address": "[email protected]",
"first_name": "Test, Mȧ",
"country": null,
"organization": "Cornell University",
"_row": 2,
"research_keywords": null,
"webpage": null,
"phone": null,
"research_interests": null,
"last_name": "Testing123",
"username": "testing1"
},
{
"phone": null,
"research_interests": null,
"last_name": "Testing456",
"username": "testing2",
"_row": 5,
"organization": "Cornell University",
"webpage": null,
"research_keywords": null,
"country": null,
"email address": "[email protected]",
"address": null,
"first_name": "Test"
},
{
"first_name": "John",
"email address": "[email protected]",
"address": null,
"country": null,
"webpage": null,
"research_keywords": null,
"_row": 6,
"organization": "Cornell University",
"last_name": "Testing",
"research_interests": null,
"username": "testing3",
"phone": null
}
],
"values": {
"country": [],
"username": [
"testing2",
"testing1",
"testing3"
],
"research_interests": [],
"last_name": [
"Testing123",
"Testing456",
"Testing"
],
"phone": [],
"research_keywords": [],
"webpage": [],
"first_name": [
"Test, Mȧ",
"Test",
"John"
],
"organization": [
"Cornell University"
],
"address": [],
"email address": [
"[email protected]",
"[email protected]",
"[email protected]"
]
}
}