tapas icon indicating copy to clipboard operation
tapas copied to clipboard

Using TAPAS on domain specific data, training large tables of domain specific data for TAPAS

Open sbhttchryy opened this issue 3 years ago • 1 comments

Dear developers, I have the following questions:

  1. I want to use TAPAS on the MIMIC dataset, the columns of which contains specific terms like 'CPT_CD, CPT_NUMBER, CPT_SUFFIX'. To enable fine-tuning on a dataset like this, how should I proceed?
  2. Additionally, the tables in MIMiC are large. Is there a provision to incorporate this?

Thanks!

sbhttchryy avatar Aug 17 '21 09:08 sbhttchryy

Hi @sbhttchryy , thanks for your interest

  1. I suggest you use one of the notebooks have examples on how to convert the data into the required format in tf examples. Alternatively you could follow the original format for one of the datasets we use, like TabFact, WikiSQL, etc and run the scripts from there.

  2. There are multiple possible approaches and it's hard to tell which is the best in advance. One is to split the table into multiple chunks and then join the results using a heuristic. Another is to try heuristic ways of trimming the table content to restrict it to whatever is relevant. We tried something like this for https://aclanthology.org/2020.findings-emnlp.27/ but mostly on a column level, but perhaps in your case the other approaches makes more sense. We will also be realeasing the code for https://aclanthology.org/2021.findings-acl.289/ very soon,.

eisenjulian avatar Aug 23 '21 17:08 eisenjulian