alaveteli
alaveteli copied to clipboard
Spreadsheet (CSV) import broken
I can't work out why this won't import into WhatDoTheyKnow: (had to zip it as Github doesn't allow attaching .csv) schoolsimport.zip On "dry runs" it says "error: line 1: invalid email 'tag_string' for authority 'request_email' error: line 2: invalid email 'fei school' for authority '[email protected]'" etc. Odd. Wrong character encoding? Wrong CR,LF? something else? I don't know but would appreciate your assistance. thanks
A user has offered to compile a spreadsheet of parish councils for addition to WhatDoTheyKnow. I'm concerned they may spend time doing so and then we can't add them because the spreadsheet import function is broken. Do we know if this is a general problem or one specific only to that spreadsheet, please?
thank you
Sorry I didn't get time to look at this a few months back.
It looks like you need the first field to be #id
with empty values, contrary to the documentation:
I've made an initial clarification/fix at https://github.com/mysociety/alaveteli/pull/4184 – it needs a bit more checking, so we'll pick that up next sprint.
Last night I discovered the dry run passes if the tag column is left out (without the #id column) I tried adding quotation marks to the contents of the tag column, I.e. "fei school academy" and that didn't work.
Honestly, the CSV import code is pretty 🍝 . I think it needs a bit of a rebuild. Sorry about the pain its causing :(
I've just used the csv import feature for 46 new bodies on WhatDoTheyKnow and it worked OK.
@RichardTaylor could you attach the csv you used so that I can compare?
I'll email the file to @garethrees (We don't make the WDTK database of FOI contacts available publicly to encourage use of our service - we don't want to make it easier for people to make FOI requests to groups of bodies via private email).
Here's the start of the file, with email addresses removed:
#name,request_email,tag_string,shortname,home_page,notes
Birmingham and Solihull Sustainability and Transformation Partnership,[email address],stp nhs,,,
The Black Country Sustainability and Transformation Partnership,[email address],stp nhs,,,
"Bristol, North Somerset, South Gloucestershire Sustainability and Transformation Partnership",[email address],stp nhs,,,
"Buckinghamshire, Oxfordshire and Berkshire West Sustainability and Transformation Partnership",[email address],stp nhs,BOB STP,https://www.bobstp.org.uk/,"Also known as BOB, or BOB STP"
North Mersey Local Delivery System,[email address],stp nhs,,,"The North Mersey Local Delivery System is responsible for the implementation of the Cheshire and Merseyside Sustainability and Transformation Plan in Liverpool, South Sefton, Southport and Formby and part of Knowsley."
$ file Downloads/schoolsimport.csv
Downloads/schoolsimport.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators
$ file Downloads/stps.csv
Downloads/stps.csv: UTF-8 Unicode text, with very long lines, with CRLF line terminators
Wonder if the BOM is causing an issue?
In terms of content they seem similar (no ID field).
What programs did you both use to create these CSV files?
I use either Google Sheets or Excel - exporting as CSV.
I then tweak things in a text editor - Textwrangler - in response to errors highlighted via the "dry run"; which in the latest case were duplicate bodies, and one notes field which contained links causing a problem - so I removed that.
I used Excel, exporting as a CSV, then Notepad++ to tidy up e.g. quotation marks.
Last night I discovered the dry run passes if the tag column is left out (without the #id column)
There does seem to be some weirdness between the id
and tag_string
attributes. It looks like the documentation should state something along the lines of "You need to include an id
column if you have the tag_string
column."
Obviously, making this Just Work is more ideal.
@garethrees the order of the columns seems to make a difference, if the tags column is between the authority name and the email address (contrary to the suggested order given in the spreadsheet upload page) it works more reliably.
oh, ouch! useful to know, thanks @kingqueen3065!
I just found the CSV upload / import on WhatDoTheyKnow worked with a file 300 lines long, but timed out with an error if it was 500 lines.
I slightly tediously had to break up my file to upload it.
Just to note I've used the CSV upload feature again and it worked; it's not totally broken as per the issue subject!
This may not be the best place to add this but as it seems like the main "fix CSV import" ticket...
when you do an all authority export the csv at that stage has the tags in a column called 'Tags'; whereas, when you do an upload, Alaveteli expects the column to be labelled 'tag_string' and disregards it otherwise
When I tried uploading the attached spreadsheet to WDTK (for https://github.com/mysociety/whatdotheyknow-theme/issues/667) I keep getting '504 Gateway Time-out'.
This happens whether I choose the dry run
or upload
options.
It also happened on my test file of 1 body, although after a couple of attempts that loaded.
While not a big issue in this case and I can update the 15 bodies manually, if it were a bigger spreadsheet with more updates it would be more problematic.
After reading the above log entries, I tried deleting the tag_string
column and uploading again. I encountered the same error.
When I tried uploading the attached spreadsheet to WDTK (for mysociety/whatdotheyknow-theme#667) I keep getting '504 Gateway Time-out'.
This happens whether I choose the
dry run
orupload
options.
Oddly, I don't seem to be able to reproduce this - the file you have attached appears to work okay when pushing as a dry run
(with either add or update tags chosen).
What browser are you using @MattK1234?
What browser are you using @MattK1234?
Ah! Didn't think of that! Microsoft Edge 18.
Please do upload it if it allows you to do so. I was using the existing tags option, I haven't made any changes to the tags on this file, just added short names.
I'm going to download Chrome and see what happens. Never had it installed on this machine and never bothered to do so before!
On Mon, 30 Mar 2020, 12:17 Martyn Dewar, [email protected] wrote:
When I tried uploading the attached spreadsheet to WDTK (for mysociety/whatdotheyknow-theme#667 https://github.com/mysociety/whatdotheyknow-theme/issues/667) I keep getting '504 Gateway Time-out'.
[image: image] https://user-images.githubusercontent.com/47503358/77905261-d6e48380-727d-11ea-8d82-d7bec40cbf1f.png
This happens whether I choose the dry run or upload options.
Oddly, I don't seem to be able to reproduce this - the file you have attached appears to work okay when pushing as a dry run (with either add or update tags chosen). [image: image] https://user-images.githubusercontent.com/249418/77906605-22982c80-7280-11ea-9b1c-7d4ed1c0bcc1.png
What browser are you using @MattK1234 https://github.com/MattK1234?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mysociety/alaveteli/issues/4078#issuecomment-605939744, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALKNP7XFMZXDOJXWKAG4QUDRKB5WDANCNFSM4DROPXEA .
I've now been able to upload the csv file successfully using Google Chrome. It seems the import doesn't like Microsoft Edge for some reason.
Wrong CR,LF?
It would be good to have a clear error message if line break formatting is the issue, ideally with a hint on how to address it (try and find a way to save the file with "Unix Style" line breaks presumably)
The RFC says CSVs should use CRLF
.
Another idea, a checkbox for "don't create new bodies".
Use case - I have a list of bodies, by name, which I want to add a new tag to.
If I upload a csv with the name in the name column, and the new tag in the tag_string column, it does what I want generally, but in a couple of cases where the name is slightly wrong it creates a new body.
I can remove / correct the lines which relate to bodies which are not found, but ideally the system would help me.
Another idea: fuzzy matching of the name, this could be useful, but sounds dangerous and would probably warrant active confirmation of any suggestions. Making names match if &
and and
were switched and Ltd
and Limited
.. and plus or minus an initial The
might be pretty safe though.
I've just ran a CSV upload for 472 bodies, adding one tag to each body.
Whilst it rendered 504 Gateway Time-out
checking the admin timeline and a handful of bodies confirmed that the updates were made.
The dry run seemed to work ok.
We just had an issue on WhatDoTheyKnow where, among other things, bodies were accidentally created rather than updated.
I noted
I do a search on the "dry run" response for "creating" if I'm only expecting "updating".
Another admin reported they do similar.
This should be part of the system - maybe via clear summary of the impact of the upload:
Created X bodies; Updated X bodies.
I seem to recall I had some odd messages when I was doing the NI import. I'll look up the thread and add a comment here if I find anything.
Here:
When I had done dry runs previously, names matching caused it to "update" an existing body, whereas it now seems to throw an error. The uploading returned 504 gateway timeout errors for all three sheets, though it has (seemed to have) uploaded all the bodies (the numbers and most recent changes add up).