csvlint.io icon indicating copy to clipboard operation
csvlint.io copied to clipboard

Alpha to Beta?

Open Stephen-Gates opened this issue 10 years ago • 17 comments

What is your criteria from moving from Alpha to Beta?

Stephen-Gates avatar Dec 15 '14 12:12 Stephen-Gates

Our plan is to work on scalability and robustness mostly, so we can deal with larger datasets, and more of them. There are also a bunch of bugs to fix as well. We know what needs doing, our only problem is we need to find some funding to spend time doing it!

Floppy avatar Dec 16 '14 09:12 Floppy

How much funding do you need?

Stephen-Gates avatar Dec 16 '14 10:12 Stephen-Gates

That's a good question! We're putting together the proposal now, so hopefully we'll know that soon :)

Floppy avatar Dec 16 '14 10:12 Floppy

We find Csvlint really helpful but it does choke on big files. Let me know if you need specific things tested as I have 2 staff playing with Csvlint, schemes and data packaging now and early in the new year. We plan to use them as part of our standard publishing process.

Stephen-Gates avatar Dec 16 '14 10:12 Stephen-Gates

Yeah, the issue with large files is certainly something we want to find a fix for as a priority.

pezholio avatar Dec 16 '14 10:12 pezholio

We're about to throw some big csv's at CSVLint. Are you aware of a physical limit for file sizes?

Stephen-Gates avatar Jan 26 '15 23:01 Stephen-Gates

Is there any advice you can offer on what file sizes csvlint currently can handle and should be able to handle in the future? If it helps, I've done a small bit of testing and found that on my PC, as soon as the file gets over 750kb in size the wait times start to blow out. Under that size, wait times are a very respectable 10-20 seconds. For the larger files the checker just continues to run and I've let it go for well over an hour in one case.

TacoSandwich avatar Jan 26 '15 23:01 TacoSandwich

I would say that, currently, anything over 2mb, CSVlint will struggle. This is to do with the way the CSV is processed by saving it directly into MongoDB, which we've found is not performant. Once we identify some funding, this will probably be the first on our wishlist to fix.

pezholio avatar Jan 27 '15 08:01 pezholio

Honestly as currently implemented we don't find CSVLint useful, although it has tons of promise. One of the key limitations is that the tool doesn't report row numbers on data errors. "The data in column 7 is inconsistent with others values in the same column" is too vague when you may have hundreds of rows.

I just ran a 17-record test CSV with 17 different errors that violate my schema. CSVLint reported two warnings. And this github seems to be stagnant. I love the schema format and regex support, but I'm not a programmer so I can't help :( I'll post my observations in a separate thread.

Does anyone have any alternative resources or tools to suggest for CSV validating?

scrybbler avatar Jun 02 '15 19:06 scrybbler

Thanks for commenting! We've not had chance to develop this for a while, I admit, but we should be able to get some time to work on this very soon! Please create tickets for anything you think needs to be improved, with links to the validations that didn't work as you expected - that would be really helpful :)

Floppy avatar Jun 02 '15 19:06 Floppy

@scrybbler we've moved onto Good Tables and FME:

  • http://okfnlabs.org/blog/2015/03/06/goodtables-web-service.html
  • http://www.safe.com

Stephen-Gates avatar Jun 02 '15 19:06 Stephen-Gates

@scrybbler I've added comments to your other issues. csvlint does try to report on all errors with detailed diagnostics, except for when the schema fails to parse and load! I've added an issue #186 to address this, it should resolve some frustrations when using the service.

ldodds avatar Jun 02 '15 20:06 ldodds

@Stephen-Gates what were your reasons for migrating, was it related to file sizes or were there other problems or limitations? As @Floppy says, it'd be useful for us to know as we plan for any further work.

ldodds avatar Jun 02 '15 20:06 ldodds

File size and lack of any visible progress.

Stephen-Gates avatar Jun 02 '15 20:06 Stephen-Gates

Thank you, Stephen and Leigh! Yes, file size is definitely an issue for us too. Our biggest data sets are 30000+ records.

scrybbler avatar Jun 02 '15 21:06 scrybbler

Just as an update, we've finally managed to get some time (and money) to spend on CSVlint, so I'm hoping that you'll see some progress in the next few weeks on a lot of this stuff.

Floppy avatar Jul 07 '15 07:07 Floppy

Ok cool - I'll keep watching

Stephen-Gates avatar Jul 07 '15 08:07 Stephen-Gates