jsonlines icon indicating copy to clipboard operation
jsonlines copied to clipboard

jsonlines.org and ndjson.org

Open max-mapper opened this issue 7 years ago • 18 comments

hey I noticed http://ndjson.org/ and http://jsonlines.org/ are very similar, I was just wondering if maybe they could link to each other to reduce confusion? I like both names personally and use them interchangeably

cc @chrisdew

max-mapper avatar Jan 29 '17 20:01 max-mapper

This site links to ndjson from http://jsonlines.org/on_the_web/ and ndjson.org links back here from its footer, is that not sufficient?

wardi avatar Jan 29 '17 20:01 wardi

In the interest of converging on a single standard, would it not be beneficial for these two sites co-ordinate and agree on items, and ideally just be one .org. Having two similar sites each promoting an emerging 'standard' with differences gives a sense it's not ready for interoperation.

karmakaze avatar Aug 14 '18 02:08 karmakaze

created similar issue in the ndjson repo:

  • https://github.com/ndjson/ndjson-spec/issues/35

glensc avatar Mar 03 '21 15:03 glensc

:+1: for just one standard and one web site. Based on a quick look there aren't any real differences except to the extension (.jsonl vs .ndjson). Having a common extension would make it more likely that editors and IDEs support this format without extra configuration.

pekkaklarck avatar Mar 23 '21 13:03 pekkaklarck

Cross linking is not sufficient. Anyone would google with v.s.. which might lead to here.

I'm sure the author was not confused. :) But I'm confused.

This format is specified at ndjson.org and documented at the JSON Lines website.

- https://en.wikipedia.org/wiki/JSON_streaming

onacit avatar Feb 23 '22 06:02 onacit

Observation: If repository issue activity is any metric for discoverability, then JSON Lines has an advantage.

In the interest of converging on a single standard, would it not be beneficial for these two sites co-ordinate and agree on items, and ideally just be one .org. Having two similar sites each promoting an emerging 'standard' with differences gives a sense it's not ready for interoperation.

I understand that, historically, there were spec differences due to potential ambiguity (UTF-8 encoding, required JSON data on every line, etc.), but it seems as though they are now aligned. At this point in time, are there any remaining spec differences? And are there any other issues which are preventing convergence (e.g. copyright credit, etc.)? I think the community will greatly benefit from a single, unified standard with an RFC, registered IANA media type, etc. The involved parties appear to be reasonable and responsive. Can we make this happen?

jsejcksn avatar Jul 06 '22 19:07 jsejcksn

I prefer the name "JSON lines" because that seemed like the obvious name to me :-) but, the ndjson folks did go the extra mile and write a spec.

If we're fully aligned I like the idea of settling on a single name. Is there an unbiased measure we can use for deciding?

wardi avatar Jul 06 '22 20:07 wardi

Is there an unbiased measure we can use for deciding?

@wardi Names are names and will always be arbitrary/subjective. 😅 I think it's just up to the party that submits the RFC and registers. IMO, a unified standard with either name is better than two ambiguously identical alternatives.

jsejcksn avatar Jul 06 '22 21:07 jsejcksn

The ndjson repo hasn't seen any maintainer activity in years. That makes it both impossible to pick this and have them redirect, and a bad idea to pick them and redirect from here.

remram44 avatar Jul 06 '22 21:07 remram44

The owner of the ndjson domain seems to be fine going forward with jsonlines.org.

pekkaklarck avatar Oct 20 '22 15:10 pekkaklarck

This is a mess. Let's finally get to some decision. My proposition is to take the already existing JSON Text Sequences RFC 7464 and enrich it with additions: add a file extension jsonl and make the usage of the RS symbol optional and the LF too.

A good overview of all streaming formats https://en.wikipedia.org/wiki/JSON_streaming

  • Just a concatenated JSON. Each bracket must be paired with a closed bracket. No any spec on this.
  • The NDJSON: separator \n (LF), on parsing accepts \r\n. File ext: .ndjson, MIME: application/x-ndjson
  • The JSON Lines : separator \n, on parsing accepts \r\n. File ext: .jsonl, MIME: none
  • The RFC 7464 File ext: none, MIME: application/json-seq and it's registered IANA. Additionally it uses a RS symbol:

It's basic idea to have "unambiguous JSON" resilient to many forms of damage such as truncation, multiple writers incorrectly configured to write to the same file, corrupted JSON, etc.  An example sequence:

    ␞{"d":"2014-09-22T21:58:35.270Z","value":6}␤     ␞{"d":"2014-09-22T21:59:15.117Z","value":12}␤

From the spec:

 Phillip Hallam-Baker proposed the use of JSON text sequences for  logfiles and pointed out the need for resynchronization.  Stephen  Dolan created https://github.com/stedolan/jq, which uses something  like JSON text sequences (with LF as the separator between texts on  output, and requiring only such whitespace as needed to disambiguate  on input). Carsten Bormann suggested the use of ASCII RS, and Joe  Hildebrand suggested the use of LF in addition to RS for  disambiguating top-level number values.

So basically for a simplest case when I know that the data is not corrupted I can simply use a concatenated JSON. I can use line separators too and they'll just ignored as in usual JSON. The only one requirement is from the parser to accept multiple documents.

Example 1:

{"id":1}{"id":2}

Example 2: two documents but formatted with a newline

{
  "id":1
}
{
  "id":2
}

If I may have corrupted JSONs then a newline may be used. But here may be a problem to distinguish when the newline was used just for a formatting and when to split two documents.

Example 3: the first document is broken and doesn't have a closing bracket but \n anyway allows to split them

{"id":
{"id":2}

Example 4: first doc is broken, then newline, and the second doc is formatted with a newline

{"id":
{
  "id": 2,
  "props": {
    "prop1": 1,
    "prop2": 2
  }
}

But visually we still can distinguish where the first doc ends and the second starts. And we can use a simple rule: sequence \n{ separates the next document. E.g. { at the start of a line without indentation. But when there is \n some spaces and only then continue the document until finding the closing bracket. I think the simple rule should work almost always. But anyway the indented JSON makes a little sense for the JSON streaming and not expected.

If I need to have top level values then the RS may be used optionally. Anyway this is something that a producer may decide to use the RS or not. In any case a parser may be just configured to require the RS if it expects top level values or broken data e.g. he needs for the "unambiguous JSON". E.g. this should be an option of the format but not a requirement. As for me the RS at the beginning still makes a little sense for unambiguous because on threading issues you may just have lines intermixed. It looks like overengineering. But probably it came from real world usage and problems so I'm not sure.

@nicowilliams you are the author of the RFC 7464. Please give us your thoughts. Is it possible to make some errata for the spec?

cc: @hoegertn @finnp @wardi

Related: already was discussed an idea to use the application/json-seq as a MIME for the JSONL #19

The file extension: both ndjson and jsonl are easy to google. The jsonl files are easier to pronounce, easier to read at first sight and also they'll sort more naturally with existing json files. The mime type is json-seq so a file extension jsons would be more appropriate but may cause confusion in a conversation. So IMHO the existing jsonl should be better

stokito avatar Feb 22 '23 19:02 stokito

@stokito updating RFC 7464 as you describe sounds good to me.

wardi avatar Feb 22 '23 21:02 wardi

Could we include the MIME type application/jsonl that seem to be used already by others and is suggested in https://github.com/wardi/jsonlines/issues/19?

sp4ce avatar May 05 '23 16:05 sp4ce

@stokito

from issue in https://github.com/wardi/jsonlines/issues/65#issue-1604557768 I don't think jsonlines is going into any direction to allow incomplete record, empty lines, or other type of linebreaks that doesn't separate valid JSON records.

I am not sure amending RFC 6474 will be valid in that context. The examples you gave seems to allow that.

To me streaming JSON is a whole other problem, I think jsonlines is about a succession of valid JSON, like you would do a succession of API call for batching input or reading some process results (we've been using it with Amazon Comprehend to manage training corpus for example, or the recognition job inputs)

sp4ce avatar May 06 '23 14:05 sp4ce

Imagine the file extension being a format like .lines.json or .stream.json

Taking inspiration from:

  • Kotlin Gradle files (.gradle.kts)
  • Compressed tarballs (.tar.gz)

.stream.json keeps with the idea that .x.y means it is a y file but for x

ciscorucinski avatar Aug 24 '23 15:08 ciscorucinski

The difference is that a .gradle.kts file is a valid .kts file, and a .tar.gz is a valid .gz file. A lines.json is not a valid JSON file, since it contains multiple JSON objects. It needs to be split before it yields valid JSON documents.

So .json.lines would make more sense if anything.

remram44 avatar Aug 24 '23 15:08 remram44

Point taken.

I would still be for .json.stream. It's a higher-level concept that fits all current json streaming formats (I mean the concept is already called streaming).

A good overview of all streaming formats https://en.wikipedia.org/wiki/JSON_streaming

  • Just a concatenated JSON. Each bracket must be paired with a closed bracket. No any spec on this.
  • The NDJSON: separator \n (LF), on parsing accepts \r\n. File ext: .ndjson, MIME: application/x-ndjson
  • The JSON Lines : separator \n, on parsing accepts \r\n. File ext: .jsonl, MIME: none
  • The RFC 7464 File ext: none, MIME: application/json-seq and it's registered IANA. Additionally it uses a RS symbol:

Anyways, just throwing this out. Glad this concept has been seen. Seems like all emoji interactions like the concept, but just preferred it swapped around. I'm completely down for that.

ciscorucinski avatar Aug 25 '23 10:08 ciscorucinski

I would rather see an extension that specifically says which it is. We don't use .img for PNG, JPG, BMP, and TIF. Similarly I think those 4 (well, 3) different formats should have different extensions.

remram44 avatar Aug 25 '23 13:08 remram44