sgn icon indicating copy to clipboard operation
sgn copied to clipboard

Generic Data File Parser

Open dwaring87 opened this issue 8 months ago • 2 comments

Expected Behavior

Create a generic data file parser that could be used with all of the file upload functions that will support .csv, .xls, .xlsx, .txt (tab-separated) files for any upload. We've had a few people request CSV uploads due to problems/limitations with Excel.

I've started working on this on the topic/generic_file_parser branch.

There is a new CXGN::File::Parse class that can be used to parse any of the supported file types into a uniform parsed data format.

For example:

my $parser = CXGN::File::Parse->new(
    file => '/home/production/public/data.csv',
    required_columns => [ 'accession_name', 'species_name' ],
    column_aliases => {
      'accession_name' => [ 'accession', 'name' ],
      'species_name' => [ 'species' ]
    },
    column_arrays => [ 'synonym', 'organization_name' ]
);
my $parsed = $parser->parse();

my $errors = $parsed->{errors};
my $columns = $parsed->{columns};
my $data = $parsed->{data};
my $values = $parsed->{values};

will return:

  • errors: an array of error messages encountered during file read / parsing
    • problems with opening the file (file doesn't exist, error from type-specific perl module)
    • missing required columns (when the required columns are specified)
    • rows with no values for required columns
  • columns: an array of the column headers in the file
  • data: an array of hashes, where each array item is one row of the input file
  • values: a hash of the unique values for each column
    • for columns specified in the column_arrays argument, the value will be split by the delimiter (',' by default) and returned as an array

Example Input:

username first_name last_name email address organization address country phone research_keywords research_interests webpage
testing1 Test, Mȧ Testing123 [email protected] Cornell University
testing2 Test Testing456 [email protected] Cornell University
testing3 John Testing [email protected] Cornell University

Example Output:

{
  "errors": [],
  "columns": [
    "username",
    "first_name",
    "last_name",
    "email address",
    "organization",
    "address",
    "country",
    "phone",
    "research_keywords",
    "research_interests",
    "webpage"
  ],
  "data": [
    {
      "address": null,
      "email address": "[email protected]",
      "first_name": "Test, Mȧ",
      "country": null,
      "organization": "Cornell University",
      "_row": 2,
      "research_keywords": null,
      "webpage": null,
      "phone": null,
      "research_interests": null,
      "last_name": "Testing123",
      "username": "testing1"
    },
    {
      "phone": null,
      "research_interests": null,
      "last_name": "Testing456",
      "username": "testing2",
      "_row": 5,
      "organization": "Cornell University",
      "webpage": null,
      "research_keywords": null,
      "country": null,
      "email address": "[email protected]",
      "address": null,
      "first_name": "Test"
    },
    {
      "first_name": "John",
      "email address": "[email protected]",
      "address": null,
      "country": null,
      "webpage": null,
      "research_keywords": null,
      "_row": 6,
      "organization": "Cornell University",
      "last_name": "Testing",
      "research_interests": null,
      "username": "testing3",
      "phone": null
    }
  ],
  "values": {
    "country": [],
    "username": [
      "testing2",
      "testing1",
      "testing3"
    ],
    "research_interests": [],
    "last_name": [
      "Testing123",
      "Testing456",
      "Testing"
    ],
    "phone": [],
    "research_keywords": [],
    "webpage": [],
    "first_name": [
      "Test, Mȧ",
      "Test",
      "John"
    ],
    "organization": [
      "Cornell University"
    ],
    "address": [],
    "email address": [
      "[email protected]",
      "[email protected]",
      "[email protected]"
    ]
  }
}

dwaring87 avatar Jun 12 '24 18:06 dwaring87