unconf18 Lesson/Examples of how to clean 'field' data

This might be too field ecology specific, but I think it could be useful more broadly.

This is a situation I ran into in my grad school work, and I know many others who are doing field work where they are collecting data hard copy, and then entering it every few days over several months of work run into.

There are data entry errors, spellings, issues, etc, plus you also end up with dozens of files that have been entered, probably my different people, etc.

I dealt with this in my own field work by creating a script I ran over all the files, checked them for the correct spelling of different things, and then printed out the things that were wrong.

This code is not my finest, but it got me through my phd.

Maybe something that does this better already exists, and I just need to learn what it is so I can point others with this issue towards it.

But if it doesn't, this would be something I'd love to work on building.

I realize this functionality already exists in open refine, but I personally don't care for open refine, so I did it this way.


# these are the vectors of values that I am ok with, with the correct spellings

# areas are my study areas
areas <- c("nvca","scnwr","fgca","slnwr","tsca","bkca","ccnwr","dcca","osca","tmpca")

# impound is my wetland impoundments
impound <- c("rail","sanctuary","ash","scmsu2","scmsu3","sgd","sgb","pool2","pool2w","pool3w","m11","m10","m13","ts2a","ts4a","ts6a","ts8a","kt9","kt2","kt5","kt6","ccmsu1","ccmsu2","ccmsu12","dc14","dc18","dc20","dc22","os21","os23","pooli","poole","poolc")

# regions are the four regions
regions <- c("nw","nc","ne","se")

# plant spellings that are correct 
plant <- c("reed canary grass","primrose","millet","bulrush","partridge pea","spikerush","a smartweed","p smartweed","willow","tree","buttonbush","arrowhead","river bulrush","biden","upland","cocklebur","lotus","grass","cattail","prairie cord grass","plantain","sedge","sesbania","typha","corn","sumpweed","toothcup","frogfruit","canola","sedge","crop","rush","goldenrod",NA)

for(i in 1:length(file_names)){
  int <-  read.csv(file_names[i])
# so this prints out instances where three are things that are not part of the lists above and includes the file name so I can go and find the issue.   
  print(paste0(int[(int$region %in% regions==FALSE),]$region," ",file_names[i]," region"))
  print(paste0(int[(int$area %in% areas==FALSE),]$area," ",file_names[i]," area"))
  print(paste0(int[(int$impound %in% impound==FALSE),]$impound," ",file_names[i]," impound"))
  print(paste0(int[(int$plant1 %in% plant==FALSE),]$plant1," ",file_names[i]," plant1"))
  print(paste0(int[(int$plant2 %in% plant==FALSE),]$plant2," ",file_names[i]," plant2"))
  print(paste0(int[(int$plant3 %in% plant==FALSE),]$plant3," ",file_names[i]," plant3"))
}

## once I resolve all of the issues identified from above I then read in all the files, put them in a list and I can stitch them together into one master file. 

vegsheets <- list()

for(i in 1:length(file_names)){
  vegsheets[[i]] <- read.csv(file_names[i])
}

## this takes the list and combines it all together into one data frame
masterdat <- do.call(rbind, vegsheets)

# write it out into a master file
write.csv(masterdat, "~/Github/data/2015_veg_master.csv", row.names=FALSE)```

Apr 25 '18 17:04 aurielfournier

I have many of these, and they all require something a little different, but the tool I use most is assertr

Apr 25 '18 18:04 noamross

I think this is actually a pretty broad problem, at least in epidemiology. For example, if you're using data from a hospital-based reporting system, the fields in the raw data may change over time, or based on the person doing the query for you,and there are ample opportunities for this to wreak havoc.

Not sure if existing tools are adequate to cover these issues or not, although I think a broader problem is getting people responsible for these kinds of data to 1) pay attention to these issues and 2) take the time to use the tools!

Apr 25 '18 20:04 jzelner

I like the idea of structuring "Lessons" @aurielfournier

Reg. assertr is there a way to automatically write the commands based on a table with allowed values? Or based on say EML metadata?

The list of tips could also feature https://github.com/ChrisMuir/refinr

Apr 26 '18 04:04 maelle

I concur with @jzelner on the broadness of the issue. It confronts me constantly via the bikedata package, where different cities generally use their own data formats, and often inconsistently. I've recently adapted the whole thing to a dictionary-style lookup table of possible column names, but this is just at a first-cut stage. It is nevertheless pure C/C++, so I'd be keen to converse, listen, input on nice ways of interfacing R and C++ in this regard.

Apr 26 '18 07:04 mpadge

I think this is great topic, and a tricky one. I think the current situation involves both shortcomings of tooling and shortcomings of outreach, and I agree that one-size-fits-all solutions like OpenRefine only go so far -- like many tools in this space, it can feel both overbearing and not smart enough at the same time.

I definitely believe that "Standards" are a key part of finding happiness here, but they can also be a part of the problem. Dates are a good trivial example: sure, python and ruby have libraries that can reliably parse "Tuesday after Christmas 2012" into a date, but I think we can all agree that dates are suddenly much easier to work with if we all just use ISO-8601.

In ecology/evolution, we have similar solution for the whole problem of working with species name issues, (spelling, formatting, different higher taxonomy definitions, etc) by using taxonomic identifiers, but afik these have had very little penetration among anyone who actually sees their critters in the field, and very much suffers from problem #927.

Ideally, good tooling would pay us immediate dividends for using standards. e.g. I think that's the case with dates, there's an immediate benefit in being able to date-time math etc this way (instead of, say, having separate columns for day, month, and year; which is still too common). but for others like taxon ids, there's no obvious benefit. Spatial descriptions are somewhat in-between -- if you already do spatial analysis you already use spatial standards, but for the rest of us it's easy to feel like you need a master's in GIS before that would be useful, so we just name our regions and sites with convenient labels and get on with it. Tooling that made it easier rather than harder to describe our somewhat standard data in a standard way could, IMHO, make a big difference. But I think we still have too few examples of these tools that are easy enough and modular enough to quickly integrate into field-data-collection workflow. Would love to join a brainstorm on this and bounce some of my no doubt hopelessly idealistic ideas off others!

Apr 26 '18 18:04 cboettig

Keen to hear more on this, too! Ditto the need for more examples of tools & standards "outreach" -- at the overall dataset-structure/schema-level, too.

Maelle, were you picturing something like being able to point assertr at an EML schema &/or other standards/vocabs, e.g.:

EML 2.1.1 (...in the handy ropensci/EML package)
Darwin Core

RDA's list of data standards defined in XML/RDF could be another resource to include if it helps generalize this example to other fields/domains that have their own formally-defined/under-used data standards--if that's not wandering out of scope here. (& either way, ra for hopeless idealism! :)

May 06 '18 18:05 magpiedin

@magpiedin yes I was imagining a wrapper that'd take your raw-ish data and an EML as arguments and output the discrepancies. I don't know other metadata standards well enough 😉 Btw for EML creation there's also the WIP https://github.com/cboettig/eml2 by @cboettig

May 08 '18 06:05 maelle

There's a version 2 in the works, you say? This makes my day :)

I threw in Darwin Core here on the off-chance that the field scenario @aurielfournier has in mind includes any species observation/occurrence data (e.g., inventorying wildlife in a particular area?). If it's a more experimental/other scenario, though, all ears, too.

May 08 '18 18:05 magpiedin

Sorry for dropping out for a bit. Just finally had time to look over the resources everyone tagged in here.

assertr is fantastic, thank you for pointing thatout @noamross

I am not super familiar with EML, but I need to get up to speed on it for some work related things, and from what i know of EML, having some kind of wrapper like @maelle described that takes raw data and EML and tells you what doesn't match could be really exciting, especially since it could help more people actually keep meta data (a big issue in ecology). My 'concern' there would be that EML might not be the best one to choose if we thought the problem was broader then 'just' ecology.

@magpiedin for the example I gave originally, I was talking about species occurrence data, in my dissertation, though I think in ecology the issue is certainly present beyond species occurrence data and would also apply to experimental and other data types.

I'd be really interested in pursuing this as either an idea about developing lessons around an already exisiting tool, like assertr, or building something new on top that would bring in meta data. Both would be hugely helpful for the problem I had in grad school, and I think to many folks more broadly

May 08 '18 21:05 aurielfournier

@aurielfournier I've used (and learnt about) EML for an epidemiology research project, and could document everything that had to be documented, no term was missing in EML standard. But I guess other fields that have other metadata standards (epidemiology doesn't), could be more problematic? But a minimal tool EML+ EML/eml2+assertr could be useful as a starting point (and could be developed further for other metadata standards)?

May 09 '18 04:05 maelle

Maybe useful for testing such an EML+assertr tool https://github.com/DeclareDesign/fabricatr https://github.com/ropensci/charlatan

May 09 '18 08:05 maelle

unconf18 unconf18 copied to clipboard

Lesson/Examples of how to clean 'field' data

unconf18
unconf18 copied to clipboard