alaveteli icon indicating copy to clipboard operation
alaveteli copied to clipboard

Spreadsheet (CSV) import broken

Open kingqueen3065 opened this issue 7 years ago • 30 comments

I can't work out why this won't import into WhatDoTheyKnow: (had to zip it as Github doesn't allow attaching .csv) schoolsimport.zip On "dry runs" it says "error: line 1: invalid email 'tag_string' for authority 'request_email' error: line 2: invalid email 'fei school' for authority '[email protected]'" etc. Odd. Wrong character encoding? Wrong CR,LF? something else? I don't know but would appreciate your assistance. thanks

kingqueen3065 avatar Jul 02 '17 16:07 kingqueen3065

A user has offered to compile a spreadsheet of parish councils for addition to WhatDoTheyKnow. I'm concerned they may spend time doing so and then we can't add them because the spreadsheet import function is broken. Do we know if this is a general problem or one specific only to that spreadsheet, please?

thank you

kingqueen3065 avatar Aug 22 '17 17:08 kingqueen3065

Sorry I didn't get time to look at this a few months back.

It looks like you need the first field to be #id with empty values, contrary to the documentation:

schoolsimport.csv.zip

screen shot 2017-08-23 at 09 15 52

garethrees avatar Aug 23 '17 08:08 garethrees

I've made an initial clarification/fix at https://github.com/mysociety/alaveteli/pull/4184 – it needs a bit more checking, so we'll pick that up next sprint.

garethrees avatar Aug 23 '17 08:08 garethrees

Last night I discovered the dry run passes if the tag column is left out (without the #id column) I tried adding quotation marks to the contents of the tag column, I.e. "fei school academy" and that didn't work.

kingqueen3065 avatar Aug 23 '17 08:08 kingqueen3065

Honestly, the CSV import code is pretty 🍝 . I think it needs a bit of a rebuild. Sorry about the pain its causing :(

garethrees avatar Aug 23 '17 08:08 garethrees

I've just used the csv import feature for 46 new bodies on WhatDoTheyKnow and it worked OK.

RichardTaylor avatar Sep 02 '17 00:09 RichardTaylor

@RichardTaylor could you attach the csv you used so that I can compare?

garethrees avatar Sep 04 '17 08:09 garethrees

I'll email the file to @garethrees (We don't make the WDTK database of FOI contacts available publicly to encourage use of our service - we don't want to make it easier for people to make FOI requests to groups of bodies via private email).

Here's the start of the file, with email addresses removed:


#name,request_email,tag_string,shortname,home_page,notes
Birmingham and Solihull Sustainability and Transformation Partnership,[email address],stp nhs,,,
The Black Country Sustainability and Transformation Partnership,[email address],stp nhs,,,
"Bristol, North Somerset, South Gloucestershire Sustainability and Transformation Partnership",[email address],stp nhs,,,
"Buckinghamshire, Oxfordshire and Berkshire West Sustainability and Transformation Partnership",[email address],stp nhs,BOB STP,https://www.bobstp.org.uk/,"Also known as BOB, or BOB STP"
North Mersey Local Delivery System,[email address],stp nhs,,,"The North Mersey Local Delivery System is responsible for the implementation of the Cheshire and Merseyside Sustainability and Transformation Plan in Liverpool, South Sefton, Southport and Formby and part of Knowsley."

RichardTaylor avatar Sep 07 '17 11:09 RichardTaylor

$ file Downloads/schoolsimport.csv
Downloads/schoolsimport.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators

$ file Downloads/stps.csv
Downloads/stps.csv: UTF-8 Unicode text, with very long lines, with CRLF line terminators

Wonder if the BOM is causing an issue?

In terms of content they seem similar (no ID field).

What programs did you both use to create these CSV files?

garethrees avatar Sep 11 '17 11:09 garethrees

I use either Google Sheets or Excel - exporting as CSV.

I then tweak things in a text editor - Textwrangler - in response to errors highlighted via the "dry run"; which in the latest case were duplicate bodies, and one notes field which contained links causing a problem - so I removed that.

RichardTaylor avatar Sep 13 '17 10:09 RichardTaylor

I used Excel, exporting as a CSV, then Notepad++ to tidy up e.g. quotation marks.

kingqueen3065 avatar Sep 15 '17 19:09 kingqueen3065

Last night I discovered the dry run passes if the tag column is left out (without the #id column)

There does seem to be some weirdness between the id and tag_string attributes. It looks like the documentation should state something along the lines of "You need to include an id column if you have the tag_string column."

Obviously, making this Just Work is more ideal.

garethrees avatar Dec 11 '17 09:12 garethrees

@garethrees the order of the columns seems to make a difference, if the tags column is between the authority name and the email address (contrary to the suggested order given in the spreadsheet upload page) it works more reliably.

kingqueen3065 avatar Dec 11 '17 12:12 kingqueen3065

oh, ouch! useful to know, thanks @kingqueen3065!

lizconlan avatar Dec 11 '17 12:12 lizconlan

I just found the CSV upload / import on WhatDoTheyKnow worked with a file 300 lines long, but timed out with an error if it was 500 lines.

I slightly tediously had to break up my file to upload it.

RichardTaylor avatar Jan 23 '18 13:01 RichardTaylor

Just to note I've used the CSV upload feature again and it worked; it's not totally broken as per the issue subject!

RichardTaylor avatar Oct 24 '18 19:10 RichardTaylor

This may not be the best place to add this but as it seems like the main "fix CSV import" ticket...

when you do an all authority export the csv at that stage has the tags in a column called 'Tags'; whereas, when you do an upload, Alaveteli expects the column to be labelled 'tag_string' and disregards it otherwise

lizconlan avatar Apr 25 '19 08:04 lizconlan

When I tried uploading the attached spreadsheet to WDTK (for https://github.com/mysociety/whatdotheyknow-theme/issues/667) I keep getting '504 Gateway Time-out'.

image

This happens whether I choose the dry run or upload options.

It also happened on my test file of 1 body, although after a couple of attempts that loaded.

While not a big issue in this case and I can update the 15 bodies manually, if it were a bigger spreadsheet with more updates it would be more problematic.

Constabularies.zip

After reading the above log entries, I tried deleting the tag_string column and uploading again. I encountered the same error.

MattK1234 avatar Mar 30 '20 11:03 MattK1234

When I tried uploading the attached spreadsheet to WDTK (for mysociety/whatdotheyknow-theme#667) I keep getting '504 Gateway Time-out'.

image

This happens whether I choose the dry run or upload options.

Oddly, I don't seem to be able to reproduce this - the file you have attached appears to work okay when pushing as a dry run (with either add or update tags chosen). image

What browser are you using @MattK1234?

mdeuk avatar Mar 30 '20 11:03 mdeuk

What browser are you using @MattK1234?

Ah! Didn't think of that! Microsoft Edge 18.

Please do upload it if it allows you to do so. I was using the existing tags option, I haven't made any changes to the tags on this file, just added short names.

MattK1234 avatar Mar 30 '20 11:03 MattK1234

I'm going to download Chrome and see what happens. Never had it installed on this machine and never bothered to do so before!

On Mon, 30 Mar 2020, 12:17 Martyn Dewar, [email protected] wrote:

When I tried uploading the attached spreadsheet to WDTK (for mysociety/whatdotheyknow-theme#667 https://github.com/mysociety/whatdotheyknow-theme/issues/667) I keep getting '504 Gateway Time-out'.

[image: image] https://user-images.githubusercontent.com/47503358/77905261-d6e48380-727d-11ea-8d82-d7bec40cbf1f.png

This happens whether I choose the dry run or upload options.

Oddly, I don't seem to be able to reproduce this - the file you have attached appears to work okay when pushing as a dry run (with either add or update tags chosen). [image: image] https://user-images.githubusercontent.com/249418/77906605-22982c80-7280-11ea-9b1c-7d4ed1c0bcc1.png

What browser are you using @MattK1234 https://github.com/MattK1234?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mysociety/alaveteli/issues/4078#issuecomment-605939744, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALKNP7XFMZXDOJXWKAG4QUDRKB5WDANCNFSM4DROPXEA .

MattK1234 avatar Mar 30 '20 11:03 MattK1234

I've now been able to upload the csv file successfully using Google Chrome. It seems the import doesn't like Microsoft Edge for some reason.

MattK1234 avatar Mar 30 '20 11:03 MattK1234

Wrong CR,LF?

It would be good to have a clear error message if line break formatting is the issue, ideally with a hint on how to address it (try and find a way to save the file with "Unix Style" line breaks presumably)

RichardTaylor avatar Aug 08 '21 17:08 RichardTaylor

The RFC says CSVs should use CRLF.

garethrees avatar Aug 09 '21 12:08 garethrees

Another idea, a checkbox for "don't create new bodies".

Use case - I have a list of bodies, by name, which I want to add a new tag to.

If I upload a csv with the name in the name column, and the new tag in the tag_string column, it does what I want generally, but in a couple of cases where the name is slightly wrong it creates a new body.

I can remove / correct the lines which relate to bodies which are not found, but ideally the system would help me.

RichardTaylor avatar Jun 29 '22 17:06 RichardTaylor

Another idea: fuzzy matching of the name, this could be useful, but sounds dangerous and would probably warrant active confirmation of any suggestions. Making names match if & and and were switched and Ltd and Limited .. and plus or minus an initial The might be pretty safe though.

RichardTaylor avatar Jun 29 '22 17:06 RichardTaylor

I've just ran a CSV upload for 472 bodies, adding one tag to each body.

Whilst it rendered 504 Gateway Time-out checking the admin timeline and a handful of bodies confirmed that the updates were made.

The dry run seemed to work ok.

MattK1234 avatar Jul 15 '22 19:07 MattK1234

We just had an issue on WhatDoTheyKnow where, among other things, bodies were accidentally created rather than updated.

I noted

I do a search on the "dry run" response for "creating" if I'm only expecting "updating".

Another admin reported they do similar.

This should be part of the system - maybe via clear summary of the impact of the upload:

Created X bodies; Updated X bodies.

RichardTaylor avatar Jul 20 '22 18:07 RichardTaylor

I seem to recall I had some odd messages when I was doing the NI import. I'll look up the thread and add a comment here if I find anything.

WilliamWDTK avatar Jul 20 '22 21:07 WilliamWDTK

Here:

When I had done dry runs previously, names matching caused it to "update" an existing body, whereas it now seems to throw an error. The uploading returned 504 gateway timeout errors for all three sheets, though it has (seemed to have) uploaded all the bodies (the numbers and most recent changes add up).

See rfc822msgid:<[email protected]>

WilliamWDTK avatar Jul 20 '22 22:07 WilliamWDTK