cypher-shell
cypher-shell copied to clipboard
Parsing problem
Hi, we have this issue in APOC: https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/1286 in particular this comment: https://github.com/neo4j-contrib/neo4j-apoc-procedures/issues/1286#issuecomment-530538305
So I tried to reproduce the problem and I exported the dataset into this file with the apoc.export.cypher.all
procedure.
But you can use this file that contains only the "bad" line that I extracted from the file above
And if I execute this:
cat import/twitter.cypher | ./bin/cypher-shell --non-interactive
I get this error:
Invalid input ':': expected <init> (line 2, column 1 (offset: 1)) <line omitted>
So I created these two tests looking for an invalid line:
@Test
public void testParsingTwitterFileWithScanner() throws IOException {
ObjectMapper mapper = new ObjectMapper();
mapper.configure(JsonParser.Feature.ALLOW_UNQUOTED_FIELD_NAMES, true);
String prefix = ":param rows => ";
int lineNo = 0;
try (Scanner scan = new Scanner(this.getClass().getClassLoader().getResourceAsStream("twitter.cypher"), "UTF-8")) {
while (scan.hasNext()) {
++lineNo;
String line = scan.nextLine();
if (!line.startsWith(prefix)) {
continue;
}
line = line.substring(prefix.length());
try {
mapper.readValue(line, Object.class);
} catch (Exception e) {
System.err.println("Scanner: " + lineNo);
System.err.println("Scanner: " + line);
}
}
}
}
@Test
public void testParsingTwitterFileWithFileUtils() throws IOException {
ObjectMapper mapper = new ObjectMapper();
mapper.configure(JsonParser.Feature.ALLOW_UNQUOTED_FIELD_NAMES, true);
String prefix = ":param rows => ";
List<String> lines = IOUtils.readLines(this.getClass().getClassLoader().getResourceAsStream("twitter.cypher"), Charset.forName("UTF-8"));
lines.forEach(line -> {
if (!line.startsWith(prefix)) {
return;
}
line = line.substring(prefix.length());
try {
mapper.readValue(line, Object.class);
} catch (Exception e) {
System.err.println("IOUtils: " + line);
}
});
}
And while the testParsingTwitterFileWithScanner
fails 7 times, the testParsingTwitterFileWithFileUtils
works well.
So I looked into the Scanner class into the readLine method, and I found that it uses this pattern to get a line "\r\n|[\n\r\u2028\u2029\u0085]"
, and if I open the file with Sublime, and look for the first line that breaks after the last words reported we found 0x85
which should be the u0085 used by the Scanner line pattern.
data:image/s3,"s3://crabby-images/292d2/292d20a976b080630db26cb505cf80156a9621e6" alt="Schermata 2019-09-12 alle 16 30 02"
I don't know if you under the hood use the Scanner
class, but I hope that I provided enough info to understand where the problem is.
@conker84 are you sure it's not that the cypher export function should be escaping the U+0085 character in some way ? It seems like it's a valid 'new line' expression https://www.compart.com/en/unicode/U+0085
I thought about that and it's something that we can do as a workaround because I think that the correct behaviour is the one provided testParsingTwitterFileWithFileUtils
, so there is a method to correctly parse the string.