transform-and-tell
transform-and-tell copied to clipboard
dataset
Hi Alasdair,
I observed the same issue with what you mentioned in the paper about goodnews dataset: "Many of the articles in GoodNews are partially extracted because the generic article extraction library failed to recognize some of the HTML tags specific to The New York Times."
- Have you tried to re-crawl these data via the link?
- Is there similar issue with NYT800K? Thanks.
Since about 94% of the captions in GoodNews are also in NYTimes800k (the other 6% now point to dead links I think), you can (almost) reconstruct a cleaner version of GoodNews by taking a subset of NYTimes800k.
We don't have the same issue with NYTimes800k (I wrote a custom parser that takes care of the corner cases). To see this, you can select all articles in NYTimes800k that also appear in GoodNews, and you will that the average article length in the NYTimes800k subset is 960, whereas is it's only 450 in GoodNews.
In the paper, we didn't fix the GoodNews dataset because we want to compare to the numbers presented in the GoodNews paper.
Gotcha, that is good to know.
Hi, I am wondering about why the mongodb dataset is over 40 GB. As far as I know from the goodnews_flattened.py, we only have articles, image ids, and captions. We can retrieve them using the following codes. I got them stored in json, and it only takes about 3 GB. Can you explain this? Thanks.
for sample_id in ids:
sample = self.db.splits.find_one({'_id': {'$eq': sample_id}})
# Find the corresponding article
article = self.db.articles.find_one({
'_id': {'$eq': sample['article_id']},
}, projection=['_id', 'context', 'images', 'web_url'])
# Load the image
image_path = os.path.join(self.image_dir, f"{sample['_id']}.jpg")
try:
image = Image.open(image_path)
except (FileNotFoundError, OSError):
continue
yield self.article_to_instance(article, image, sample['image_index'], image_path)
The database contains the pretrained face embeddings and object embeddings, which are used in the full model. All of the captions and article texts also contain POS and NER annotations from spacy.
Hi, For NYT800K dataset, is location_aware applied to nytimes_faces_ner_matched.py? Is here used to extract the 512 tokens around the image? If I want to extract 1000 tokens, I just change the 512 to 1000?
Thank you.
Yes location_aware
is implemented in nytimes_faces_ner_matched.py
. You can see that the code tries to extract the text above the image into the list before
, and the text below the image into the list after
.
Yes change it to 1000 if you want 1000 tokens. But note that there's another hard cutoff in the token indexer here because bert/roberta encoders only support text with max 512 tokens.
Yes, thank you.