sgn Generic Data File Parser

Generic Data File Parser

Open dwaring87 opened this issue 8 months ago • 2 comments

Expected Behavior

Create a generic data file parser that could be used with all of the file upload functions that will support .csv, .xls, .xlsx, .txt (tab-separated) files for any upload. We've had a few people request CSV uploads due to problems/limitations with Excel.

I've started working on this on the topic/generic_file_parser branch.

There is a new CXGN::File::Parse class that can be used to parse any of the supported file types into a uniform parsed data format.

For example:

my $parser = CXGN::File::Parse->new(
    file => '/home/production/public/data.csv',
    required_columns => [ 'accession_name', 'species_name' ],
    column_aliases => {
      'accession_name' => [ 'accession', 'name' ],
      'species_name' => [ 'species' ]
    },
    column_arrays => [ 'synonym', 'organization_name' ]
);
my $parsed = $parser->parse();

my $errors = $parsed->{errors};
my $columns = $parsed->{columns};
my $data = $parsed->{data};
my $values = $parsed->{values};

will return:

errors: an array of error messages encountered during file read / parsing
- problems with opening the file (file doesn't exist, error from type-specific perl module)
- missing required columns (when the required columns are specified)
- rows with no values for required columns
columns: an array of the column headers in the file
data: an array of hashes, where each array item is one row of the input file
values: a hash of the unique values for each column
- for columns specified in the column_arrays argument, the value will be split by the delimiter (',' by default) and returned as an array

Example Input:

username	first_name	last_name	email address	organization
testing1	Test, Mȧ	Testing123	[email protected]	Cornell University


testing2	Test	Testing456	[email protected]	Cornell University
testing3	John	Testing	[email protected]	Cornell University

Example Output:

{
  "errors": [],
  "columns": [
    "username",
    "first_name",
    "last_name",
    "email address",
    "organization",
    "address",
    "country",
    "phone",
    "research_keywords",
    "research_interests",
    "webpage"
  ],
  "data": [
    {
      "address": null,
      "email address": "[email protected]",
      "first_name": "Test, Mȧ",
      "country": null,
      "organization": "Cornell University",
      "_row": 2,
      "research_keywords": null,
      "webpage": null,
      "phone": null,
      "research_interests": null,
      "last_name": "Testing123",
      "username": "testing1"
    },
    {
      "phone": null,
      "research_interests": null,
      "last_name": "Testing456",
      "username": "testing2",
      "_row": 5,
      "organization": "Cornell University",
      "webpage": null,
      "research_keywords": null,
      "country": null,
      "email address": "[email protected]",
      "address": null,
      "first_name": "Test"
    },
    {
      "first_name": "John",
      "email address": "[email protected]",
      "address": null,
      "country": null,
      "webpage": null,
      "research_keywords": null,
      "_row": 6,
      "organization": "Cornell University",
      "last_name": "Testing",
      "research_interests": null,
      "username": "testing3",
      "phone": null
    }
  ],
  "values": {
    "country": [],
    "username": [
      "testing2",
      "testing1",
      "testing3"
    ],
    "research_interests": [],
    "last_name": [
      "Testing123",
      "Testing456",
      "Testing"
    ],
    "phone": [],
    "research_keywords": [],
    "webpage": [],
    "first_name": [
      "Test, Mȧ",
      "Test",
      "John"
    ],
    "organization": [
      "Cornell University"
    ],
    "address": [],
    "email address": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ]
  }
}

Jun 12 '24 18:06 dwaring87

sgn sgn copied to clipboard

Generic Data File Parser

Expected Behavior

sgn
sgn copied to clipboard