gamma-cat icon indicating copy to clipboard operation
gamma-cat copied to clipboard

How to handle datasets without a paper_id

Open cdeil opened this issue 8 years ago • 9 comments

In https://github.com/gammapy/gamma-cat/pull/36#issuecomment-259271929 and https://github.com/gammapy/gamma-cat/issues/9#issuecomment-259950364 @wegenmat @Konstancja and I started to discuss how to input lightcurves that don't have an ADS bibcode, i.e. nopaper_id into gamma-cat.

My proposal would be that we put data_id: some_string for those, and then add a file or folder that looks something like this:

# Internal gamma-cat data references

- data_id: some_string
   url: if available
   comments: received via email from XYZ on ABC or whatever
- data_id: some_other_string
   comments: found it on my backup disk. origin unknown. Fits my theory, so I'm keeping it!

The data_id should be short strings that can be used in filenames, just like the paper_ids I could implement this scheme and document it in the coming days.

You could wait for that, or start putting data_id keys in the LC files now, and then add the available info / comments to the data_references.yaml file later.

@wegenmat @Konstancja @cboisson - Thoughts?

cdeil avatar Nov 11 '16 14:11 cdeil

@cdeil I do not fully get your solution. You want to add data_id in addition to paper_id or instead of it?

Konstancja avatar Nov 11 '16 14:11 Konstancja

You want to add data_id in addition to paper_id or instead of it?

Instead.

  • Put paper_id where available.
  • If not available, put data_id

Then all info is present in the input files, and it should be easy to process it into any format we like in the output files that we offer to users. For the output files, I also have no better idea that to do the same as for the input files and to explain the scheme to users.

cdeil avatar Nov 11 '16 14:11 cdeil

Wouldn't it make sense to use one field for both? E.g. 'data_id' could also reference a paper.

Konstancja avatar Nov 11 '16 14:11 Konstancja

OK, one field.

I don't like paper_id (because some aren't papers) and I don't like data_id (because that sounds more like dataset ID, but a given paper can have multiple datasets (e.g. spectra, lightcurves, other results).

So how about calling it reference? And using reference: tevcat-123456 for internal reference entries (incrementing integer, like we do for source_id)?

cdeil avatar Nov 11 '16 15:11 cdeil

It is always better not to multiply fields, so one has to be enough. However I am not sure to understand to which light curve without any reference you are refering to? in catalogs you can only have published (authorized) data no?

cboisson avatar Nov 14 '16 09:11 cboisson

I don't know which cases @Konstancja and @wegenmat have.

But a typical case is:

  1. source detection gets announced in a presentation or poster
  2. a year later the proceeding appears
  3. three years later the paper appears

There are examples of HESS sources that have been detected 5 years ago, but the paper and something even the proceeding never came.

IMO we should be able to have a mechanism to add the detected source to gamma-cat already at point 1., and just internally have a reference that links to the slides or just mentions some info about where this comes from.

cdeil avatar Nov 14 '16 09:11 cdeil

@cdeil OK, one field name reference.

@cboisson In our case those are usually old data from HEGRA or CAT that were presented on a conference and a paper never followed :(

Konstancja avatar Nov 14 '16 10:11 Konstancja

if those data were never published (no conference proceeding) how to ensure the data are good quality? if no name, nobody? in such a case there should be a warning

cboisson avatar Nov 14 '16 10:11 cboisson

Yes, yes there are proceedings! But, there are proceedings (or papers) with only a plot (no table with flux values) and the authors do not have the data points any more. Anyway we'll do our best to recover as much as possible.

Konstancja avatar Nov 15 '16 15:11 Konstancja